Data analysis in Python with pandas

Type:
Tutorial
Audience level:
Experienced
Category:
Science
March 7th 1:20 p.m. – 4:40 p.m.

Description

The tutorial will give a hands-on introduction to manipulating and analyzing large and small structured data sets in Python using the pandas library. While the focus will be on learning the nuts and bolts of the library's features, I also aim to demonstrate a different way of thinking regarding structuring data in memory for manipulation and analysis.

Abstract

What the tutorial will teach students

The tutorial will teach the mechanics of the most important features of pandas. It will be focused on the nuts and bolts of the two main data structures, Series (1D) and DataFrame (2D), as they relate to a variety of common data handling problems in Python. The tutorial will be supplemented by a collection of scripts and example data sets for the users to run while following along with the material. As such a significant part of the tutorial will be spend doing interactive data exploration and working examples from within the IPython console.

The tutorial will also teach participants best practices for structuring data in memory and the do's and don'ts of high performance computing with large data sets in Python. For participants who have never used IPython, this will also provide a gentle introduction to interactive scientific computing with IPython.

Prerequisites

Participants should have:

  • Intermediate to advanced Python experience with a solid grasp of the built-in Python data structures and types: strings, lists, dicts, tuples
  • Basic experience with NumPy: ndarray objects, data types, and vectorized operations on arrays

Laptop prerequisites

  • Python 2.7
  • IPython >= 0.11
  • A text editor of some kind (we will be running code within IPython)
  • Latest official releases of NumPy, SciPy, and matplotlib
  • A functioning GUI backend ("ipython --pylab" should not error)
  • Latest release of pandas and dependencies
  • python-dateutil

Optional, but not required - PyTables - scikits.statsmodels

Outline

  • Brief overview of IPython, necessary for rest of tutorial

    • Tab completion, introspection
    • Getting help
    • Running scripts
    • Interactive debugger
  • Intro to indexed data structures

    • Series: 1d vector with labels
    • DataFrame: 2d tabular data structure with row and column labels
    • Common construction patterns, dicts of arrays, nested dicts, etc.
  • Indexing mechanics

    • Selecting data by label
    • Selecting subsets of data and slicing
    • Advanced label-based indexing
    • What if you have no index?
  • DataFrame: dict-like and matrix-like container for data

    • Fundamental data types
    • Column insertion and deletion
    • Converting column types, e.g. strings to datetimes, and other casting
    • Transposing and limitations
  • Loading and storing data from various sources

    • Flat files (CSV, delimited)
    • Parser options: indexing, date parsing
    • SQL databases
    • PyTables / HDF5: HDFStore class (helpful to have PyTables installed)
  • Iterating over data structures

    • Iteration basics
    • Efficiently iterating over DataFrame rows and columns
  • Data alignment

    • Binary operations between Series, DataFrame objects
    • Automatic and explicit data alignment
    • The reindex and align functions
  • Data Analysis 101

    • Handing missing data
    • Descriptive statistics
    • General function application with apply
  • Hierarchical (multi-level) indexing

    • Creating hierarchical indexes from flat files
    • Creating your own hierarchical index (MultiIndex) objects
    • Selecting data groups by levels
  • Joining and merging data sets

    • Joining on index
    • Joining single or multiple keys
    • Relationship to analogous SQL operations or Excel VLOOKUP-type operations
  • Reshaping and pivoting data

    • Creating Excel / spreadsheet-style pivot tables using DataFrame
    • Reshaping operations, integration with hierarchical indexing
    • larry, datarray
  • Time series functionality

    • Date range generation
    • Frequency conversions
    • Resamping and interpolation
    • Plotting
    • Moving window functions
  • GroupBy: working with naturally grouped datasets

    • Grouping by DataFrame columns and arbitrary keys
    • Iterating over groups
    • Aggregating groups
    • Transforming groups
    • More general / advanced function application and combining results
    • Non-trivial examples
      • Time-series related aggregation
      • Random sampling from groups
  • Relating GroupBy to pivoting and reshaping

    • GroupBy operations expressed as reshaping-based operations
    • Building some more intuition for stacking / unstacking
  • Data visualization

    • Integration with matplotlib
  • Other topics

    • Integration with statsmodels for statistical modeling