Using Python with SciDB

Shane Grigsby

Audience level:
Intermediate
Category:
Databases

Description

Python provides an easy and often surprisingly efficient array computing environment-- provided that your data can fit in memory. For 'Big Data' that can't fit in memory, one option is SciDB, an array database. Using python with SciDB allows two work flows for working with large datasets: First, as a data storage system for large arrays; second, as a way of running distributed operations.

Abstract

This project presents two different ways of using python to process earth science data with SciDB. First, we use SciDB as an external database in calculating snow grain size and fractional cover from MODIS satellite data. By taking advantage of typed arrays in numpy, and indexes and labels in Pandas, we load data into SciDB through python. We then retrieve arrays piece by piece to run our analysis in python. Second, we use SciDB as both a data store, and as a linear algebra engine. By executing matrix inversions in SciDB, we can implement kriging interpolation, prediction, and simulation on much larger arrays then is possible directly within numpy and scipy-- but can maintain numpy and scipy syntax.