Thursday 1:20 p.m.–4:40 p.m.

Parallel Data Analysis

Ben Zaitlen, Matthew Rocklin, Min Ragan-Kelley, Olivier Grisel

Description

An overview of parallel computing techniques available from Python and hands-on experience with a variety of frameworks. This course has two primary goals: 1. Teach students how to reason about parallel computing 2. Provide hands-on experience with a variety of different parallel computing frameworks Students will walk away with both a high-level understanding of parallel problems and how to select and use an appropriate parallel computing framework for their problem. They will get hands-on experience using tools both on their personal laptop, and on a cluster environment that will be provided for them at the tutorial. For the first half we cover programming patterns for parallelism found across many tools, notably map, futures, and big-data collections. We investigate these common APIs by diving into a sequence of examples that require increasingly complex tools. We learn the benefits and costs of each API and the sorts of problems where each is appropriate. For the second half, we focus on the performance aspects of frameworks and give intuition on how to pick the right tool for the job. This includes common challenges in parallel analysis, such as communication costs, debugging parallel code, as well as deployment and setup strategies.

Student Handout

No handouts have been provided yet for this tutorial