Sunday 10 a.m.–1 p.m. in

BAMnostic: an OS-agnostic port of genomic sequence analysis

Marcus Sherman

Description

As genome sequencing and testing gets [cheaper](https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/) and becomes more [mainstream](http://bgr.com/2017/11/27/23andme-dna-test-price-drop-amazon-cyber-monday/), the amount of data being generated is staggering. Much like other scientific fields, **Python** has become one of the predominant programming languages used to process such data. What most people do not know is that a majority of genome analytics can be boiled down to clever string comparison and matching algorithms. The caveat here being a single file can be as ≥**300 Gb** in its [compressed binary encoded format](https://samtools.github.io/hts-specs/SAMv1.pdf). A high-throughput sequencing library ([htslib](https://github.com/samtools/htslib)) was developed to establish a standard encoding and compression schema that enabled researchers to have random access to these large files. As it stands, htslib is the industry standard in the realm of genomics. One of the most popular Python libraries for handling genomic data ([PySAM](http://pysam.readthedocs.io/en/latest/)) is essentially a wrapper for htslib. As widely used as both htslib and PySAM are for developers, a large contingent of users (both end and developer) are excluded simply because htslib and PySAM do not support [***Windows***](https://github.com/pysam-developers/pysam/issues/575) environments outside of contrived builds and dependencies that many end-users would not be willing to implement. To overcome this issue, pure Python ports of the random access, unpacking, and decoding components of htslib were developed as a lightweight toolkit called **BAMnostic**. BAMnostic was developed to be a drop-in alternative for a majority of PySAM's workload when working in a Windows environment or projects that require an OS-agnostic approach. As a drop-in, it retains the same interface as PySAM for each of its supported functions. This interface also provides a means of simple extensibility for machine learning and statistical analysis through libraries such as [TensorFlow](https://www.tensorflow.org/api_docs/python/), [scikit-learn](http://scikit-learn.org/stable/), and [statsmodels](http://www.statsmodels.org/stable/index.html). Additionally, it makes piping desired data into data visualization libraries, such as [Plotly](https://plot.ly/), a simple task. Lastly, as pure Python, it can now be easily embedded into a socketed [Flask](http://flask.pocoo.org/) web server or used as standalone application.