Genomic DNA sequencing has been following its own version of Moore's law: the price is dropping while the quality (length) is increasing. (Meanwhile methods are also being used to assemble shorter reads.) The result is a jaw-dropping increase of incoming data, comparable to particle physics. Take a look at the physical plant needs for the Sanger Institute
Also racing along, but not exponentially, are the methods for analysis of the raw sequence. In particular, although many genes are known, there are few absolute methods for predicting genes directly from the sequence. This is made a bit harder by phenomena like pseudogenes, which were likely once functional and thus have a lot of the sequence features expected of genes. Secondly, although there are reasonably shared features for the start of a gene, eukaryotic messenger RNAs are spliced together from pieces that can be far apart, and prediction of the exact splice sites has been very hard.
A recent set of papers from the Brent lab (an open-access article here
describes analysis of chicken gemonic data) has evaluated TWINSCAN,
a computer method with specific improvements in splice site prediction. The improvements from TWINSCAN derive by side-by-side comparision of two related organisms, since the splicing sites are likely to be preserved. The approach is sensitive to exactly how related the two organisms are: too close, and too much sequence will be identical, but too far, and genes will have diverged from each other. What is important here is that the main information are the genomic sequences themselves, and not accessory databases. It's all sitting on the hard drive. The level of computation required is daunting; but computation has in the best case the capability to scale up, wherease bench annotation still has to be done by us linear humans.
Once a hit comes up from TWINSCAN the Brent lab confirms that the predicted sequence is actually made into mRNA. They used RT-PCR.
, a variant of PCR used for detection of RNA.
Their best case, which is what caught my eye, was comparing C. elegans to C. briggsae.
These are two species of nematodes studied by geneticists. In this case, the TWINSCAN side-by-side computation of raw genomic data yielded predictions corresponding to 60% of the known genes, and predicted additional regions (later confirmed) which had not been predicted to be genes before.
Although TWINSCAN is efficient, the Brent lab themselves seem to use more than one program (ENSEMBL etc.) to predict genes, with the highest confidence being given to genes predicted by more than one analysis. A particular question which I didn't see addressed by TWINSCAN is the issue of splicing variants, in which different series of snips are stiched together from the DNA sequence. On the whole, though, I think these are interesting days for gene-hunters. There's quite a lot to be found by hunting the raw reads which are already sitting in the public databases.