Wednesday, March 16, 2005

Too much information

An editorial in PLOS Medicine calls for reconsideration of how biologists publish work involving large datasets. Although this problem of data management is becoming widespread, the focus of the editorial (and they are really unsparing!) is the frequent poor presentation of microarray data from clinical trials, where, to quote the editorial, "studies involving large datasets..are so poorly reported..that many are not reproducible." In a later section they say "published studies are far from the level of evidence that would be accepted for virtually any other medical test."

Given this level of alarm, I wish they would give an example of a high profile association (i.e. one which received media attention) which fell through.

There appear to be two sets of issues in the editorial. The first set consists of tasks made harder by big data sets: analysis by authors, review by reviewers and editors,display by journals (which often resort to extensive web supplements that are never printed) and digestion by readers. The second set lands on the backs of the researchers themselves, and seems, in aggregate, to be that people are doing the experiments without adequate training. Both sets of issues are endemic in the array-using biology world, so there's no need to trot out sterotypes of M.D.s burning down the lab. (Is there?) Still, a false clinical association could put someone's health at risk.

PLOS, which is mainly a web journal, wants publications to include everything in an interactive E-document, from "primary data to statistical methods, figures and derived data together with textual documentation." That sounds a little extreme, but they do point to such a document already in existence. I can't help thinking, though-- haven't physicists already confronted the issue of data management? Aren't netheads designing data mining software?

I am a consumer of this kind of data, and I don't think an interactive document would help me much. In my own experience with "gene-gazing," I tend to go through the raw tables in the Supplementary Info, and, if the experimental design was especially relevant, I cook up the raw numbers myself. This is because I am an "old biologist"-- actually, literally pretty aged-- and my experiments are centered around 1:1 or 1:several interactions. Maybe I should call myself a hedgehog, compared favorably to the idiot foxes.

No comments: