A Deep Learning Approach to Genomics Data for Population Scale Clustering and Ethnicity Prediction
Refereed Conference Meeting Proceeding
The understanding of variations in genome sequences assists us in identifying people who are predisposed to common diseases, solving rare diseases, and finding corresponding population group of the individuals from a larger population group. Although classical machine learning techniques allow the researchers to identify groups or clusters of related variables, accuracies, and effectiveness of these methods diminish for large and hyperdimensional datasets such as whole human genome. On the other hand, deep learning (DL) can make better representations of large-scale datasets to build models to learn these representations very extensively. Furthermore, Semantic Web (SW) technologies already acted as useful adaptors in life science research for large-scale data integration and querying. Thus the standardized public data created using SW plays an increasingly important role in life sciences research. In this paper, we propose a novel and scalable genomic data analysis towards population scale clustering and predicting geographic ethnicity using SW and DL-based technique. We used genotypes data from the 1000 Genome Project resulting from the whole genomes sequencing extracted from the 2504 individuals consisting of 84 million variants with 26 ethnic origins. Experimental results in terms accuracy and scalability show the effectiveness and superiority compared to the state-of-the-art. Particularly, our deep-learning-based analytics technique using classification and clustering algorithms can predict and group targeted populations with a prediction accuracy of 98% and an ARI of 0.92 respectively.
Semantic Web solutions for large-scale biomedical data analytics (SeWeBMeDA)
Digital Object Identifer (DOI):
National University of Ireland, Galway (NUIG)
Open access repository: