The National Institutes of Health (NIH) spend more than $30 billion on medical research every year. Some research is done in-house, but most is contracted out to universities and research organizations. But the NIH is not always getting its money’s worth when it contracts out. The selection process for awarding grants is inconsistent, and rigorous review of study design and statistical methods is missing. As a result the quality of NIH-supported research is shockingly poor, at least for some diseases.

After becoming sick with Lyme disease in 2003, I read up on the medical literature on treatment of Lyme disease. Lyme disease is a bacterial infection caused by a tick bite. It is treated with antibiotics, but unfortunately many patients fail the recommended treatment of 2-4 weeks of antibiotics. Currently, there is a fierce debate among medical professionals about the optimal treatment. The debate is fuel by the lack of good research. The NIH has in the past funded four medical trials that looked at additional antibiotic therapy in Lyme patients who failed previous treatment. All four trials had serious design flaws that biased them against finding a treatment effect.

One trial by Krupp et al. (2003) looked at Lyme patients with persistent fatigue. The trial had been designed to find no effect of additional antibiotic therapy. Patients enrolled in the study were still sick after extensive antibiotic therapy for Lyme disease, some having already been treated for years. Therefore, it was unlikely that they would benefit from an additional 4 weeks of therapy. Another design flaw was that the study was small. Fifty-five patients were enrolled, and half them received a placebo, while 7 dropped out. When you have a small sample size, the difference between the treatment group and the placebo group must be large in order to find a statistical significant result. So one way to design your study to find that treatment offers no benefits is to use a small sample.

Another way is to enroll patients who are not that sick. Patients enrolled in the Krupp study were selected for suffering of “severe persistent fatigue.” Normally when people suffer from severe fatigue they are unable to perform every day tasks, such as working in a job. But patients in the Krupp study worked a lot of hours. In fact, 67 percent of these patients with severe fatigue worked in a full-time job. That is a much higher share than the U.S. population as a whole. Data from the Bureau of Labor Statistics show that only 53 percent of adults were employed in a full-time job in 1998 when the study was conducted. Moreover, the patients enrolled in the study were on average older, and older people are less likely to work and work fewer hours. So the patients selected for this study were somehow able to work more on average than would be expected from a comparable cross-section of the population who did not have severe persistent fatigue.

Despite its design flaws, Krupp found that patients who received 28 days of IV antibiotics showed markedly improvement in their level of fatigue, compared to the placebo group. Fully 69 percent in the treatment group experienced less fatigue, compared to 23 percent who received the placebo. Rather than celebrating this encouraging finding, Krupp immediately dismissed it. She speculated that the study has been compromised because, as she notes, “more of the ceftriaxone [the antibiotic used in the study] than the placebo treated groups correctly guessed their treatment assignment.” At the end of the study, 69 percent of the ceftriaxone group correctly guessed they were in the treatment group, while 32 percent of the placebo group correctly guessed that they were in the placebo group. This means that 68 percent of the placebo group thought there were in the treatment group. So nearly the same proportion, 69% vs. 68%, thought they were in the treatment group, meaning that the placebo effect would have been almost identical in the two groups. So obviously the trial was not compromised as Krupp wrongly deduced.

Krupp also found other problems with the study after the fact. She argued that fatigue is a nonspecific symptom, and therefore is not a reliable measure of treatment effect. This makes you wonder why she bothered to do the study in the first place, and why the NIH funded this study, wasting more than $1.5 million of taxpayers’ money.

Curiously, Krupp also looked at the effect of treatment on mental speed, measured by a test that Krupp had developed herself. Krupp found that not only did antibiotics not improve mental speed, patients in the treatment group ended up slower-witted at the end of the study. The placebo group fared equally bad. But the study had not selected patients for mental speed, and the patients had “relatively mild cognitive deficits, which,” as Krupp notes, “may have contributed to the lack of a treatment effect on cognition.” In other words, the study found that treatment had little effect in curing a symptom that the patients did not have.

Even more curiously, Krupp looked at whether an antigen test went from positive to negative. Again a positive antigen status was not something that Krupp had selected for, and only 8 of the 55 Lyme patients had a positive test at the beginning of the study. Krupp found that the test turned from positive to negative for all four of patients in the treatment group, but concludes that antibiotics therapy had no effect, because the effect was not statistical significant from the placebo group where 3 out of 4 patients’ tests turned negative. This is a perfect example of what statisticians refer to as an underpowered test, where even the largest effect possible from treatment, is not statistically significant because the sample size is ridiculously small.

The Krupp study appeared in a major medical journal: Neurology. It is shocking that it passed the review process given its poor design and the authors’ own dismissal of their main measurement. This should make people wonder about the quality of medical research more generally.