Tuesday, June 9, 2015

Ioannidis' "Why Most Published Research Findings Are False" - two kinds of 'false'

Ioannidis put out "Why Most Published Research Findings Are False" 10 years ago.

The title is provocative. False? Really? Oh my god, are bridges falling? (no). But are people having adverse reactions to this medication that doesn't really do anything? Very probably!

But really, what did Ioannidis mean by 'false'? Most people think of 'false' as the entire opposite of 'true' (well, there's very little denying that!) But what I mean is rather 'false' means that the entire opposite of a statement is true. And that subtlety makes a difference.

"This apple is red"... that is either true or false. It may be entirely green, in which case that statement is perfectly false. Or it may be mostly red with green spots. Or it may be partly green and partly red about the same area. Or maybe it is about to turn from green to red and is paradoxically half way between. You can judge for yourself how true or false each of those might be. But there is undoubtedly play or vagueness in those words, in 'red' and 'false, maybe even 'is'.

When it comes to experiments, the situation is now about a number of things. 'All apples are red'. That is certainly not the case, because some are green. If there is at least one apple that is not red, then that statement is false (not only be everyday common sense, but by the stipulated mathematical/logical usage of quantifiers like 'all'. But scientifically 'all apples are red' can be statistically justified (and it is accepted usage) if there are a reasonable few that aren't red. That is, 'All apples are red, except for a few which don't really count'.

But that is my semantic analysis. It is totally relevant to modern research, and gives a reasonable interpretation to the title of the paper. Most experimental research is trying to say something like "All X are Y, for the most part". Ioannidis in his paper is actually pursuing another definition of false. Sorry, not another definition, but a perfectly good, oops the best, most correct definition of false... in a particular context. Well, in the interests of full disclosure, he uses it two ways, in the traditional definit... oops context, and then also in a computational context.

How he uses 'false' this is not terribly complex and is very logical and supportable and I agree with it and it is a useful way of using 'false'... but it's not what you expect. He takes the usual 2x2 statistical hypothesis testing paradigm with its type I and type II errors ("Apples are mostly red" vs "Apples are not mostly red (a non-negligeable amount are not red" as competing hypotheses and then testing against reality).

What Ioannidis does is take the parameters for hypothesis testing, alpha the probability of a false negative, beta the probability of a false positive (this is also the parameter for determining power, or the number of instances required to guarantee a significant result if there is one), and R the prior probability of the hypothesis being true, and computes the PPV (positive predictive value), using very elementary and straightforward arithmetic (see Table 1)


The PPV is TP/(TP+FP) = (1-beta)R/(R+alpha-beta*R). OK that's not 2+2, but it's not at all rocket science. He simplifies this considerably by setting alpha to be the usual cutoff for significance acceptability:

Since usually the vast majority of investigators depend on a = 0.05, this means that a research finding is more likely true than false if (1 - β)R > 0.05.
He then goes to show in a later table 4, that for given beta and R (and bias u) that, given a few types of studies, studies in each type having roughly the same params, each kind of study will have a probability of being ... true (statistically/roughly/acceptably).





Presumably, He considers only those top two to be generally true, and all the rest presumably false (the latter even in the loosey goosey/benefit of the doubt/non-categorical/for the most part 'true').

The press for this article often claims that Ioannidis says that '75% of studies are false'. Again, presumably, that figure comes from a some weighted average, over all studies (in some unspecified context) using the above table. I have not done that computation, nor the setup work of judging a large set of studies (medical?) and which category they lie in.


There's lots more to say about that. But (!) there's more in the paper than that. Ioannidis also bullet points a number of areas in which there are systematic problems both related to these calculations and unrelated, which he lists as 'corollaries' (his corollaries, my comments):

Corollary 1: The smaller the studies conducted in a scientific field, the less likely the research findings are to be true.  - This is fairly uncontroversial. Small n bad, large n good. 
Corollary 2: The smaller the effect sizes in a scientific field, the less likely the research findings are to be true.  Also uncontroversial, but some skepticism is warranted because of the meaning of 'effect size'. (abitrary scaling can affect this). But surely a large effect is correlated, sorry, (non-technically) connected with a true finding.
Corollary 3: The greater the number and the lesser the selection of tested relationships in a scientific field, the less likely the research findings are to be true. I take this to be directed at making lots of hypothesis test on a single set of data. Which of course is a problem because probability will say that something will erroneously be significant.
Corollary 4: The greater the flexibility in designs, definitions, outcomes, and analytical modes in a scientific field, the less likely the research findings are to be true. Selected publication (publishing only 'positive' results, non-standard experimental design leads to high variability of what is actually measured. 
Corollary 5: The greater the financial and other interests and prejudices in a scientific field, the less likely the research findings are to be true. "Conflicts of interest and prejudice may increase bias" - general sociology.

Corollary 6: The hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true - because people will quickly publish things, using smaller sets, skipping quality.


None of these actually seem to be a corollary of the parameter/PPV calculation in table 1. But they hold water pretty well. And he only tangentially refers to the 'p-value' hegemony (and not at all that with a large enough n, everything is 'statistically significant').


No comments: