Insight Health: Using data analysis to interpret the human genome

Submitted on Thursday, 17/02/2022

The human genome is a treasure trove of information about human diseases – why they occur and how they progress – but it is a hugely challenging area of research because so much information is contained in the genome. This is the challenge Insight@UCD PhD candidate Jinbo Zhao is working on.

One of the primary goals of human genetics research is to identify which DNA sequence variants have impact on the onset and progression of human diseases like cancer and diabetes. When people develop these diseases, and what course those diseases take, is influenced by a combination of environmental, lifestyle, and genomic factors. Each of these has varying effects and interacts with others in complex ways.

The challenge is that the human genome is enormous from an analytics perspective. It is composed of a series of nucleic acid sequences, comprising approximately 3.2 billion nucleotides of DNA. Whole-genome sequencing on a research cohort with a large sample size can be very expensive. Instead, academics commonly use genome-wide genetic variants sequencing, which sequences differences in individual DNA building blocks, called single nucleotide polymorphisms (SNPs).

Statistical genomicists build statistical models to detect the associations between these SNPs and diseases. Over 10,000 significant associations between SNPs and diseases have been detected so far, with most variants having only small effects on the development of diseases. Polygenic risk scores (PRS) were created by combining those small, but statistically significant, variant effects into a single PRS to measure the heritability of a trait and disease onset. PRSs for many common diseases have been developed and have demonstrated potential for risk prediction. However, so far, the individual-level risk prediction ability of PRS is not accurate enough to be clinically useful.

Our research team aims to explore scenarios in which individual levels of PRS are not clinically relevant, but where if you put them together, aggregate PRS statistics are potentially useful. We compare the abilities of existing PRS strategies to accurately stratify individuals into risk categories and then assess what factors this depends upon. When doing this, we take into account how representative the samples are of the population, the disease prevalence, heritability and accuracy of the polygenic scores. We carry out survival analysis with and without PRS after including traditional risk factors in the model, with the aim of identifying the conditions under which cohort level information is of use in identifying effective early intervention strategies. Our analysis is conducted using the data from the participants in the UK Biobank data set. The UK Biobank project is an open resource cohort study collected on approximately 500,000 individuals across the UK since 2013, with each participant’s genetic data containing over 93 million SNPs, and other types of genetic variants.