PyCon 2016 in Portland, Or
hills next to breadcrumb illustration

Saturday 1:20 p.m.–4:40 p.m.

A Tour of Large-Scale Data Analysis Tools in Python

Sarah Guido, Sean O'Connor

Audience level:
Intermediate
Category:
Best Practices & Patterns

Description

Large-scale data analysis is complicated. There’s a limit to how much data you can analyze on a single box, but it is relatively inexpensive to get access to a large number of commodity servers. In this tutorial, you’ll learn how to leverage the power of distributed computing tools to do large-scale data analysis quickly and affordably using pure Python, Hadoop MapReduce, and Apache Spark.

Abstract

Large-scale data analysis can be complicated. There’s a limit to how much data you can analyze on a single box, but thankfully it is relatively inexpensive to get access to a large number of commodity servers. There are well-established tools and frameworks for taking advantage of those servers. The tools we’re working with in this tutorial are some of the most widely-used tools for large scale data analysis. We’ll show you how to analyze data in three different ways. The first method will be in pure Python, to give you a baseline for what we’re going to accomplish. The second method will be using MapReduce and Hadoop, which is the tried-and-true method of analyzing large amounts of data. The third method will be using Apache Spark, a promising newcomer to the data processing scene that comes with a suite of distributed machine learning and statistical tools. In this tutorial, you’ll learn how to leverage the power of distributed computing tools to do large-scale data analysis quickly and affordably. You'll have a general understanding of commonly available solutions for analyzing big data, and you'll understand basic concepts of data processing in pure Python, MapReduce and Hadoop, and Apache Spark.

Student Handout

No handouts have been provided yet for this tutorial