Developing computational strategies for assembly of heterozygous DNA sequence data

Term: 
2018-2019 Fall
Faculty Department of Project Supervisor: 
Sabancı University Nanotechnology Research and Application Center (SU-NUM)
Number of Students: 
2

Rapid technical improvements in high-throughput DNA sequencing technology have made it possible to obtain large amounts of genetic sequence data for almost any biological organism. However, what all current sequencing technologies have in common is that the cellular DNA must first be split into thousands or millions of small fragments, each of which are sequenced individually. These sequence fragments (called 'reads') must then be compared, their overlapping regions identified, and merged to recover the original genome sequence. This process is called "genome assembly" and presents a significant computational challenge, especially given the presence of errors in the raw read data. While a number of different specialized programs have been developed for sequence assembly, in most cases they assume that the DNA sequence being reassembled is essentially haploid (that is, each element is present in a single copy). In reality, the majority of eukaryotic organisms are diploid or polyploid, having 2 or multiple copies of each chromosome. Therefore a single sample contains multiple copies of each gene, which are often non-identical. Separating these different copies (alleles) and determining which of them originated from the same DNA strand (haplotypes) is essential to answer a number of important biological questions including inheritance of traits, understanding the genetic basis of diseases, etc. In this project, the student will work on a large whole-genome sequencing dataset from hazelnut (Corylus avellana), an agriculturally important tree species with a diploid genome that is believed to be highly heterozygous. Initial genome assemblies using existing tools with this dataset have produced results significantly larger than the expected genome size, and containing a large number of duplicated elements. The project will involve using existing tools and developing new strategies to address the issue of heterozygosity and develop a more complete and realistic genome assembly.

Related Areas of Project: 
Computer Science and Engineering
Molecular Biology, Genetics and Bioengineering

About Project Supervisors