top band

Friday 4:30 p.m.–5 p.m.

Losing your Loops: Fast Numerical Computing with NumPy

Jake VanderPlas

Audience level:
Intermediate
Category:
Science

Description

NumPy, the core array computing library for Python, provides tools for flexible and powerful data analysis, and is the basis for most scientific code written in Python. Getting the most out of NumPy, though, might require slightly changing how you think about writing code: this talk will outline the basic strategies essential to performing fast numerical computations in Python with NumPy.

Abstract

While Python is sometimes maligned as too slow for real world applications, many scientists, statisticians, and other data-oriented programmers find it to be efficient and powerful for large-scale data analytic tasks. The vast majority of this type of data analysis in Python is performed using NumPy, a package which extends Python with a powerful array-oriented computing interface. These arrays are flexible enough to handle all kinds of data: scientific imaging, sales records, statistical model parameters, logfile results, and more. The tools in NumPy form the core of other well-known Python data science packages such as Pandas, Scikit-Learn, SciPy, Matplotlib, and many more. Designing efficient data-intensive algorithms in Python requires not just using NumPy, but using it effectively. In broad-brush, this amounts to replacing slow loops over datasets with more efficient vectorized operations. In this talk I’ll cover briefly why loops in CPython tend to be slow, and why vectorizing these operations using NumPy can often provide speedups of 100-1000x in many cases. I’ll go on to introduce the four practical concepts essential to fast data analytics using NumPy: *aggregation functions*, *universal functions*, *broadcasting*, and *fancy indexing*. With a firm understanding of these four patterns, you’ll be well on your way to writing fast & efficient data-intensive code in Python.
bottom band background