top band

Nanometer-scale Pandas: A Data Science Approach to Structural Biology

Christopher Ing

Audience level:


The dynamic motions of biomolecules like DNA and proteins are of primary interest in the study of health and medicine. By performing novel analysis on time series data from supercomputer generated models, I am able to test scientific hypotheses with statistical certainty rarely seen in this field. I apply this process to an example protein and discuss generalizations to a high-throughput workflow.


The dynamic motions of biomolecules like DNA, RNA, and proteins are integral to our understanding of drugs, disease, and human health. By performing expensive supercomputer simulations, one can model the motion of proteins in a controlled manner not possible in a test-tube or petri dish. While data science and statistical approaches are ubiquitous in the field of genomics, dealing primarily with large genetic sequence or expression datasets, these approaches are infrequently used in the field of structural biology. However, this paradigm is changing with advances in GPU computing and supercomputer availability, In this work I present a novel approach, documented in the open-source repository ["PandasMD"][2] (New repo commit imminent!), to analyze time series data derived from supercomputer simulations by leveraging the functionality of Pandas. By applying approaches commonly used in data science, I am able to perform meta-analysis across different biological models and across different simulation repeats. In doing so I am able to test scientific hypotheses with improved statistical certainty over traditional analysis conducted in this field. The application of this approach is demonstrated for an example protein, a voltage-gated ion channel found in human neurons, that culminated in a [peer-reviewed publication][3]. The generalization of this approach to a "high-throughput" methodology is discussed. [1]: [2]: [3]:
bottom band background