Genome Assembly, Chinese Postman, and Virtual Clusters

Michael Brudno

Google Tech Talk, May 2007



In the first half of the talk I will discuss several algorithmic results affecting whole genome assembly -- the problem of assembling a genome from short pieces (called reads). This problem is often reduced to various path problems on graphs. We will first show that the problem of finding the Shortest Chinese Superwalk on a de Bruijn graph is NP-complete/hard, hence demonstrating the computational equivalence of two methods for sequenceassembly: the String Graph approach of Myers et al and the Eulerian Superpath approach of Pevzner et al. We will also demonstrate a simple polynomial time algorithm separating complimentary paths on a graph, which is equivalent to finding a path when reads on a genome come from either of the two DNA strands, thus solving a problem left open by Pevzner and colleagues in 2001. If we have time we will show algorithms for assembling the recently published genome of the sea squirt (Ciona savignyi) and the results of the assembly. Afterwards I will switch topics and demonstrate the use of virtual cluster technology for bioinformatics (and other embarassingly parallelizable) software development. A virtual cluster is a "cluster" of virtual machines, connected by a virtual network which shares the underlying physical machines with other virtual clusters. Furthermore the user of the virtual clusters may be located at a remote site, and demand low latency connections to a large data source during computation and dynamic interaction with the results. We will argue that virtualisation is an effective technique to allow for easy software development, debugging, and deployment in the context of common resources and large datasets.