Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise. But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete.The argument, it seems, is one of induction on steroids: with so many data points, creating a model isn't necessary because the data are predictive with computational pattern finding, statistical analyses. "With enough data, the numbers," he writes, "speak for themselves."
To illustrate this point, Anderson uses Google as an example. Google's algorithm doesn't care why one page is higher ranked than another, all that matters is that the math says it is. This, of course, is a red herring. Google's algorithm is the model and their continued dominance in search is the successful test. As a technology company, they probably don't care what makes a page relevant, as long as their model continues to reflect what people are looking for.
A more relevant example used in the article is Craig Venter. Lamenting that our knowledge of biology and biochemistry is becoming too complex to be able to model and predict, Anderson points to Venter's ocean and air sequencing projects as an example of science without hypotheses.
If the words "discover a new species" call to mind Darwin and drawings of finches, you may be stuck in the old way of doing science. Venter can tell you almost nothing about the species he found. He doesn't know what they look like, how they live, or much of anything else about their morphology. He doesn't even have their entire genome. All he has is a statistical blip — a unique sequence that, being unlike any other sequence in the database, must represent a new species.There's no doubt that Venter has made some major contributions to biology (and while he's pretty badass, I don't know if anybody considers him the greatest scientist who ever lived, as some do Darwin), but is his work really done without the scientific method? Of course not. Whatever genomes he sequences will be aligned and annotated, gene functions will be hypothesized all based on current theory. The massive amounts of data can be used to test evolutionary hypotheses; they can be used to generate hypotheses.
This sequence may correlate with other sequences that resemble those of species we do know more about. In that case, Venter can make some guesses about the animals — that they convert sunlight into energy in a particular way, or that they descended from a common ancestor. But besides that, he has no better model of this species than Google has of your MySpace page. It's just data. By analyzing it with Google-quality computing resources, though, Venter has advanced biology more than anyone else of his generation.
Imagine in the future, with all this wealth of personal genomic information available, somebody did the kind of pattern finding and statistical analyses the WIRED piece suggests, and finds a novel mutation that is the cause of some disease. This tells us nothing about human biology or the etiolgy of the disease. This doesn't suggest intervention or treatment. Without the scientific method - hypothesis forming and testing - this is little more than trivia.
Science is a way of knowing; a way of exploring our world and learning about it. It's the way we test ideas, answer questions and advance technologies. Chris Anderson seems to see it as bookkeeping; simply cataloguing observations. He's certainly correct that the 'Petabyte Age' offers "huge amounts of data, along with the statistical tools to crunch these numbers" and this will undoubtedly be a powerful tool and invaluable resource. To say it marks the end of the scientific method is absurd. If anything, the vastness of data will provide new observations and new ideas, which is the beginning of the process, not the end. The rumours of its death have been greatly exaggerated.
UPDATE: Good Math, Bad Math writes about the WIRED piece and large scale data analysis.