PyCon 2019 in Cleveland, Ohio

Sunday 10 a.m.–1 p.m. in Expo Hall

Unsupervised Clustering of Cancer Types from Gene Expressions using Variational Autoencoders

Shruthi Ravichandran

Description

<p> While many other diseases are relatively predictable and treatable, cancer is very diverse and unpredictable, making diagnosis, treatment, and control extremely difficult. Traditional methods try to treat cancer based on the organ of origin in the body, such as breast or brain cancer, but this type of classification is often inadequate. If we are able to identify cancers based on their gene expressions, there is hope to find better medicines and treatment methods. However, gene expression data is so vast that humans cannot detect such patterns. In this project, the approach is to apply unsupervised deep learning to automatically identify cancer subtypes. In addition, we seek to organize patients based on their gene expression similarities, in order to make the recognition of similar patients easier. </p> <p>While traditional clustering algorithms use nearest neighbor methods and linear mappings, we use a recently developed technique called Variational Autoencoding (VAE) that can automatically find clinically meaningful patterns and therefore find clusters that have medicinal significance. Python-based deep learning framework, Keras, offers an elegant way of defining such a VAE model, training, and applying it. In this work, the data of 11,000 patients across 32 different cancer types was retrieved from The Cancer Genome Atlas. A VAE was used to compress 5000 dimensions into 100 clinically meaningful dimensions. Then, the data was reduced to two dimensions for visualization using tSNE (t-distributed stochastic neighbor embedding). Finally, an interactive Javascript scatter plot was created. We noticed that the VAE representation correctly clustered existing types, identified new subtypes, and pointed to similarities across cancer types. This interactive plot of patient data also allows the study of nearest patients, and when a classification task was created to validate the accuracy of the representation, it achieved 98% accuracy. The hope is that this tool will allow doctors to quickly identify specific subtypes of cancer found using gene expression and allow for further study into treatments provided to other patients who had similar gene expressions. </p>