Change the future

Connecting Disparate Data Sources with Toothpick

Andrew Roberts

Audience level:
Intermediate
Category:
Databases/NoSQL

Description

Toothpick is an open-source data-mapping framework written in Python for building lightweight, easy-to-use models of diverse, schema-optional data sources. It is geared

Abstract

Toothpick was developed to support present-day genomics research, which increasingly depends on successful integration of the large volumes of heterogeneous data being produced. The Broad Institute’s Genome Sequencing Center supports large-scale comparative genomic and metagenomic infectious disease research projects, for which there is a need to relate clinical metadata to raw sequences and derived features. Our research relies on the ability to connect many disparate data sources, and developing solutions using existing Python libraries has proven unwieldy and cumbersome.

Toothpick enables Broad developers to build scientific research products quickly by providing a common API that can combine genomic sequences and annotations from a RESTful web service with project metadata from document-based NoSQL data stores. This allows us to define a standard means of modeling and interacting with data objects, regardless of the backing source, and greatly reduces development time.

Toothpick’s architecture supports the combination of data from multiple sources into a single model that behaves as a unified object. Toothpick also allows for associations between models, validations, and callback hooks; no schema is required, although structural definitions may be provided in the model or enforced through validations.

Here we provide an introduction to the architecture, rationale, and design goals behind the Toothpick library, show some examples of where it can be more useful than traditional ORM libraries, and demonstrate how we're using it at the Broad Institute to answer real-world questions about antibiotic resistance in the bacterium that causes Tuberculosis, the pathogenicity of a sudden E. Coli outbreak in Europe, and the microbial communities found in and on the human body.

This project has been funded in part with Federal funds from the National Institute of Allergy and Infectious Diseases National Institutes of Health, Department of Health and Human Services, under Contract No.: HHSN272200900018C.