pandas: Powerful data analysis tools for Python

Type:
Talk
Audience level:
Intermediate
Category:
Science
March 9th 5:20 p.m. – 6 p.m.

Description

pandas is a Python library providing fast, expressive data structures for working with structured or relational data sets. In addition to being used for general purpose data manipulation and data analysis, it has also been designed to enable Python to become a competitive statistical computing platform. In this talk, I will discuss the library's features and show a variety of topical examples.

Abstract

Overview

pandas is a Python library providing fast, expressive data structures for working with structured or relational data sets. In addition to being used for general purpose data manipulation and data analysis, it has also been designed to enable Python to become a competitive statistical computing platform. In this talk, I will discuss the library's features and show a variety of topical examples.

In contrast to other tools in domain-specific data analysis languages like R, it features deeply integrated array axis indexing which enables intuitive data alignment, pivoting and reshaping, joining and merging, and other kinds of standard relational data manipulations. In particular, I would like to show how the library should be of interest to many Python users outside of the scientific Python community.

Outline

  • Talk overview (1 min)

    • Examples and survey of ecosystem
    • What matters for structured / relational data problems
    • pandas library deep dive with demos
    • Example application or two
    • Extensions and future direction
  • Motivating examples (3 min)

    • SQL databases
    • Flat files
    • Spreadsheet operations and pivot tables
    • Statistical data sets (with missing data)
    • Time series data
  • Brief survey of non-scientific Python ecosystem (2 min)

    • Python stdlib: collections, itertools, etc.
    • Miscellaneous libraries: CSVKit, asciitable, others
    • Database abstraction layers: SQLAlchemy, etc.
  • Scientific Python Libraries (1 min)

    • NumPy
    • PyTables / HDF5
    • pandas
    • tabular
    • larry, datarray
  • Features that matter for structured data (5 min)

    • Tabular data structures with heterogeneous columns
    • Rich indexing capability, including hierarchical indexing
    • Flexible organization of irregular data
    • Integrated data alignment
    • Missing data handling
    • Group by / aggregation functionality
    • Joining / merging capability
    • Reshaping / pivoting operations
    • Other "domain specific" features

      • Data visualization (matplotlib)
      • Time series functionality
      • Integration with statistical models
  • pandas library overview (3 min)

    • A high-performance, NumPy-based library designed to solve all of the above problems
    • Intro to labeled vector (Series) object and tabular data structure, the DataFrame
    • Size mutability (table column insertion and deletion)
  • Indexing (5 min)

    • Simple indexing with unique labels
    • Data alignment on labels
    • Hierarchical indexing
    • Reindexing and subset selection
  • GroupBy and aggregation (5 minutes)

    • How GroupBy relates to indexing
    • Iterating over groups
    • Aggregating, transforming
    • Building custom grouping logic
  • Pivoting and reshaping data sets (5 minutes)

    • Simple Spreadsheet example (Excel, OpenOffice)
    • Show same pivot table example in Python with pandas
    • Reshaping functions: stack and unstack
    • Relationship to GroupBy
  • Data analysis and visualization (5 minutes)

    • Descriptive statistics with missing data
    • Binary operations between pandas data structures
    • Integration with matplotlib
    • Integration with statsmodels for statistics
  • Time series functionality (3 minutes)

    • Date range generation
    • Resamping / frequency conversion
    • Time series plotting
  • Some basic performance benchmarks

  • Comparisons with other tools (2 minutes)

    • NumPy structured arrays (a.k.a. record arrays)
    • Other SciPy libraries: larry, tabular,
    • R language
  • Future directions and other ideas (3 minutes)

    • pandas-optimized database drivers (ODBC, MySQL, Postgres, etc.)
    • Adapt pandas for big data processing, map reduce