Wednesday, February 23, 2005

You can see an awful lot, just by looking

(With apologies to Yogi Berra)

The genome sequencing efforts are flooding public databases with raw data. It is becoming a real information science issue to manage what has already been logged, let alone cope with the incoming. On yesterday, I noticed a link on this topic regarding researchers at the institute for Genomic Research, or TIGR. These scientists have been doing what TIGRs love best (couldn't resist the pun), which is to hunt for new microbial sequences. But their hunt took them to an unexpected place: archival data generated by the fruit fly genome projects.

The bacteria that they identified are endosymbionts with flies, with the bacterial DNA probably a contaminant in the fly DNA samples using for sequencing. This is easy to understand, as the enzymatic procedures involved in sequencing do not distinguish the source of the DNA. What is suprising is the extent to which these microbes were sequenced, essentially by accident. One strain was represented with 95% coverage, equivalent to a first-pass shotgun run! Ten years ago, this would have been notable in itself. In fact the coverage was good enough to allow detect a few horizontal gene transfer events, and to compare the 3 new species with previously described endosymbiont bacteria.

So, to get back to the title of this post, a computer is getting to be as much an experimental tool as a pipetter. Have a look!

