Friday, September 17, 2010

How to make your data significant

Anyone who's been in science long enough to either get a grasp of statistics, or alternatively figure out empirically how to game it, knows that given enough parameters and data points you will inevitably reach statistical significance on something. So what's the difference between a p-value of 0.051 and 0.049? Well this guy sums it up with a good anecdote :

"About two years ago the Wall Street Journal (registration required) investigated the statistical practices of Boston Scientific, who had just introduced a new stent called the Taxsus Liberte.

Boston Scientific did the proper study to show the stent worked, but analyzed their data using an unfamiliar test, which gave them a p-value of 0.049, which is statistically significant.

The WSJ re-examined the data, but used different tests (they used the same model). Their tests gave p-values from 0.051 to about 0.054; which are, by custom, not statistically significant.

Real money is involved, because if “significance” isn’t reached, Boston Scientific can’t sell their stents. But what the WSJ is quibbling, because there is no real-life difference between 0.049 and 0.051. P-values do not answer the only question of interest: does the stent work?


" Significance is vaguely meaningful only if both a model and the test used being are true and optimal. It gives no indication of the truth or falsity of any theory.

Statistical significance is easy to find in nearly any set of data. Remember that we can choose our model. If the first doesn’t give joy, pick another and it might. And we can keep going until one does."


Rob said...

Stats like this always bother me because the quoted punchline is either, "The stent works" or "The stent doesn't work" instead of a more quantitative statement such as, "The stent might work but it isn't a large improvement". At a marginal p-value that seems like the best conclusion.

Kamel said...

Rob said "...quantitative statement such as, 'The stent might work but it isn't a large improvement'. At a marginal p-value that seems like the best conclusion"

But a P-value doesn't tell you that at all - it addresses the question of how likely a piece of data could have arisen by chance, not the magnitude of the difference in the comparison.

Clinically significant isn't the same as statistically significant. You can have very small differences that have little relevance to clinical practice but are highly unlikely to have arisen by chance (very small P-value) and vice versa.

The best conclusion for *any* P-value is "these results have a [P-value] chance of having arisen by chance." Or, put another way, "assuming the null hypothesis, we have a [P-value] probability of obtaining this result." (With, of course, the caveats of appropriate model and choice of statistical test)

Kamel said...

(...waits for somebody with a better grasp of statistics to correct me...)

Anonymous Coward said...

I guess this guy's point is that there are many underlying assumptions to statistical tests that most "users" aren't aware of, and that statistical significance has limited "power" to test your hypothesis. It's a necessary evil, because often we have no better tool, but there is no biological difference between 0.049 and 0.051. So it should be even more tentative than your previous statement. Something like: "assuming the null hypothesis, we have a [P-value] probability of obtaining this result if our assumptions about about the shape of the value distribution are correct and the correct test was administered."

Kamel said...

Even worse it should probably be "of obtaining this value of T" or something similar instead of "results" since these statistical comparisons are done on test statistics calculated from the data, not the results themselves. But my point was really addressing the notion that P-values say something about quantifying differences in compared data (how big a change/improvement/etc.) introduced in the first comment, not the OP.

I agree with the sentiment of the OP and your rewording of the conclusion phrasing. I'm sure we can refine it even further, though the point isn't so much about the convoluted language as the idea that scientists should have as good a grasp on statisitcal methodology being used as they do on experimental. Courses in statistics tailored for the sciences should be available or even mandatory!

Rob said...

I didn't take a single statistics course during undergrad. I guess it shows. Your comment about statistics courses in science programs is dead on. (I took a course on differential equations instead of statistics. It has proven to be useless in comparison to what a statistics course could offer me.)
I think that I sometimes see data presented in a way which leads to my confusion. Small numerical differences are defended as biologically important because they are statistically significant.

Kamel said...

Yeah, I only took the single mandatory stats course (first year) in undergrad and kind of regret not doing more.

I remember going to the grad offices to register at UOttawa and trying to get them to let me take a stats/biostats course since I thought it would be very useful. They basically wouldn't let me. I guess I could have done it on my own dime (my tuition waiver wouldn't have applied since it wasn't a degree requirement) and time (on top of the regular course load and research requirements) but the way it all went down wasn't exactly encouraging.

"Small numerical differences are defended as biologically important because they are statistically significant."

This is definitely the case, and pretty much everybody does it. Very recently a reviewer caught a case where I had done this, which is maybe why I'm more sensitive about this stuff currently. I could probably go on for quite a while with my uninformed stats ranting.

The WSJ/Boston Scientific case is pretty interesting. I think the best thing it does is remind us that 0.05 is completely arbitrary as a cut-off for accepting/rejecting the null hypothesis and we shouldn't necessarily adhere to it so blindly and rigidly. It's also a good reminder of how dumb I am about stats.