Sunday, November 15, 2015
In defense of publication bias and p-hacking
Publication bias and p-hacking have recently come under a few attacks even though it's not a new thing... not new ... at all.
Publication bias is the tendency to publish study results that have p-value (the statistical measure of 'significance') if the p-value is <= .05 (5%), the magical oversimplifying cutoff stated originally by Fisher as a.. well... magical oversimplification because he found that most people didn't really understand 'really' what p-values mean so he gave this as a quick heuristic for significance.
'P-hacking' is the tendency for researchers to massage the data, the experimental design, the statistical method to improve the p-value output just over the threshold to 'significance' mostly to get around the publication bias.
These are problematic for different reasons. Publication bias leads the literature to ignore the non-information or weak information or non-results, things like 'X doesn't predict Y very well', or 'Treatment Z really doesn't really do much'. These are things that would be nice to know, so a researcher in a field can then either avoid studying such unimportant things or discover refinements of the data that do show something important.
'P-hacking' is a bit more pernicious because it can lead in the direction of outright fabrication. Fixing poorly recorded data removing an outlier (somewhat reasonable, but it is controversial), but this is almost in the direction of fabricating data itself.
Given these well attested problems with publication bias and p-hacking, something about them is actually not so bad, in fact, they are an outcome of very reasonable and desirable scientific behaviors. This is not a justification for 'a road to hell is paved with good intentions', rather that some part of both is actually a good thing.
First, what is bad about publication bias (tending to publish only positive results)? From a literal, rational viewpoint, it is obviously denying half the story, creating all the false negatives. You should report all your results positive and negative to get a good picture. But that is a false equivalence. Positive results are not positive instances of a coin flip. Positive results are the interesting results. Interesting is something new and compelling. The null hypothesis is dull and lifeless. We already knew the null hypothesis. The null hypothesis is the air we walk through constantly. Reporting positive results is like pointing out a new pathway in the forest. Experts in the field, especially editors of academic journals, se many many results on slightly different phenomena. They have a good sense of what is new and important, and what has been done over and over again (and is maybe replication), but they also have a sense of what isn't important or isn't positive in the field. I'm not saying that negative results should not be published, but I do say that they don't need the boosting that positive ones do. A negative result is usually not that interesting. (In the natural sciences, that is; in more mathematical sciences, they are a different sort of thing, and often earth shattering)
And for p-hacking, sure, it is gaming the system. Hacking and gaming are things you do to improve something that are, let's say, hors du combat, outside the system Once you take a measurement, it is something to be gamed. Two runners competing for the fastest time? Train harder, eat better, lean at the tape, starting blocks, better shoes, bend the rules, shave your hair, make up new rules, take meds, get surgery, make up rules about these rules. The difficulty with experimentation and science is knowing what is cheating and what is allowable. For p-hacking, there are rules, things to do, things that are encouraged, things to avoid, and things you just can't do.
P-values are a third order measurement. First, data is the most primary measurement, a stopwatch, a rule, whatever. A statistic (like an average) is a measurement on data, you take a bunch of data and measure that set of data. For the mean, it gives you an idea of the center of the data. Then you can measure the p-value which is a measurement of how reliable the statistic is. At each stage of measurement, gaming can take place. You can manipulate data (remove outliers, 'fix' values) or manipulate the statistic (choose another, sample differently), manipulate the p-value (pick the best one, correct for multiple comparisons) and at each stage of gaming you can do it legitimately or not (no bias or much bias; yes, what the bias is is well underspecified).
The p-value itself is a time honored quantity associated with statistics. It is notoriously subtle and notoriously difficult to teach those subtleties. But it is a very useful measure of quality. It shouldn't be thrown away, just used carefully. When you mix dangerous chemicals, you do it under a fume hood, wear goggles, and have first aid nearby. When you calculate p-values, you make sure you don't calculate many on the same data and pick the best one.
Sure, sure, sure, publication bias and p-hacking are, as stated in that manner with their expected tendentious meanings, to be avoided as such. But the scientific process that results in those things, are not entirely evil. They are natural drives for knowledge and expression of knowledge and convincing people of knowledge. By saying 'natural' I'm not being lenient. Those drives have a correct part and an incorrect part. The part that we label bias and hacking are essentially bad, but the other part is not and is good. Not everything is bad about them.
Labels:
statistics
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment