Monday, February 07, 2005

Genomics is hard, part II

There's a nice synposis in PLOS Biology about measuring the power of comparative sequence analysis in decoding new genomes. Basically, the mammalian genomes already sequenced confers a great deal of power to this type of analysis, but the choice of the next genomes would affect how quickly the power of the analysis would increase. Using the measure described in this article can help set priorities, especially in an era when resources are limited.

The first line of attack in understanding any newly sequenced genome is to identify sequences which are conserved between your new genome and in some related organism (i.e. mouse-human). However, there is a law of limiting returns at work, especially as you try to determine the importance of very small features of the new genome. The availability of the a single reasonably close (but not too close) relative allows you to unambiguously score conservation of features down to around 50 base pairs (already incredibly small!). But to go smaller, down to 5 base pairs, you need 5 genomes, ideally from species which all diverged at the same evolutionary moment.
So the take home message (public access article) is that successful identification of "big" features (the author gives a measure for bigness) can be achieved by a relatively small number of genomes, but the effort for the "smallest" features requires asymptotically more effort, with the exact position of the curve depending heavily on the evolutionary distance among the input genomes. At one point the analysis talks about 130 genomes. I wonder if that would ever happen...

No comments: