Determining the DNA sequence, a billion dollar logic puzzle

James A. Yorke, Institute for Physical Sciences and Technology, University of Maryland, College Park

The genome of an individual is the collection of DNA in each of his/her/its cells. It can be expressed as one or more sequences of the letters A, C, G, T. For mammals the genome has about 3 billion letters while for a bacteria it has a couple million. The dominant method used for determining the sequence is called whole genome shotgun assembly. Using this method, The National Institutes of Health has spent about one billion dollars determining genomes of many species in the past five years. Parts of genome turn out to be easier to determine than other parts but overall each genome becomes a giant jigsaw puzzle. At the University of Maryland, we try to find techniques for solving as much of the puzzle as possible. The most difficult parts of puzzle to assemble are often the parts that have been mutating the most in the recent millions of years. We are also trying to determine the patterns of repeats.