Storing, manipulating and visualizing timeseries using open source packages in Python

Jonathan Rocher

Type:: Talk
Audience level:: Intermediate
Category:: Big Data

March 10th 11:05 a.m. – 11:45 a.m.

Description

Analyzing, storing and visualizing time-series efficiently are recurring though difficult tasks in various aspects of scientific data analysis such as meteorological forecasting, financial modeling, ... In this talk we will explore the current Python ecosystem for doing this effectively, comparing options, using only open source packages that are mature yet still under active development.

Abstract

Detailed description

We will first discuss the data structures to efficiently load and hold time-series using standard Numpy arrays, the recently modified datetime package of Numpy and some domain specific libraries like scikits.timeseries, Pandas, Larry, etc. We will walk through the strength and use cases for each of these tools.

Storing time-series has evolved drastically in the last years. Regular relational databases written in all common languages (sqlite, PostgreSQL, Oracle, MySQL, ...) can be used and interfaced from Python using a set of standard modules. Python also provides access to standard hierarchical data formats (HDF5, netcdf, ..) with the needed packages to read and write them efficiently (pytables or h5py). We will illustrate with a few examples how relational databases and hierarchical datasets can be used for storing time series. The focus will be made on trade-offs between the two data models.

Data analysis in Python can again leverage many open source packages in SciPy and the scikits ecosystem around. We will review packages in SciPy that can be useful for statistical analysis, and Monte Carlo simulations to deal with forecasts and statistical knowledge. Additional regression and statistical tools can also be found in the statsmodels scikits which will be illustrated if time permits.

The final part of the talk will present tools to visualize these time-series in a powerful 2D visualization library: Chaco. Part of the Enthought Tool Suite, Chaco is an open source package that focuses on dealing with large datasets. It allows to quickly develop custom tools to interact with the plot: selection tools, overlays, ...

Combining the Python language with some powerful packages like Numpy, Pytables and Chaco allow one to create a time-series data management platform that is robust, fast, maintainable and open.

Target audience

This talk aims at intermediate level scientific programmers with an interest in time-series management and won't assume any prior knowledge in the packages discussed, except regular Python and Numpy.