This piece is almost identical with today’s Spectator Health article.
This week there has been enormously wide coverage in the press for one of the worst papers on acupuncture that I’ve come across. As so often, the paper showed the opposite of what its title and press release, claimed. For another stunning example of this sleight of hand, try Acupuncturists show that acupuncture doesn’t work, but conclude the opposite: journal fails, published in the British Journal of General Practice).
Presumably the wide coverage was a result of the hyped-up press release issued by the journal, BMJ Acupuncture in Medicine. That is not the British Medical Journal of course, but it is, bafflingly, published by the BMJ Press group, and if you subscribe to press releases from the real BMJ. you also get them from Acupuncture in Medicine. The BMJ group should not be mixing up press releases about real medicine with press releases about quackery. There seems to be something about quackery that’s clickbait for the mainstream media.
As so often, the press release was shockingly misleading: It said
Acupuncture may alleviate babies’ excessive crying Needling twice weekly for 2 weeks reduced crying time significantly
This is totally untrue. Here’s why.
Luckily the Science Media Centre was on the case quickly: read their assessment. The paper made the most elementary of all statistical mistakes. It failed to make allowance for the jelly bean problem. The paper lists 24 different tests of statistical significance and focusses attention on three that happen to give a P value (just) less than 0.05, and so were declared to be "statistically significant". If you do enough tests, some are bound to come out “statistically significant” by chance. They are false postives, and the conclusions are as meaningless as “green jelly beans cause acne” in the cartoon. This is called P-hacking and it’s a well known cause of problems. It was evidently beyond the wit of the referees to notice this naive mistake. It’s very doubtful whether there is anything happening but random variability. And that’s before you even get to the problem of the weakness of the evidence provided by P values close to 0.05. There’s at least a 30% chance of such values being false positives, even if it were not for the jelly bean problem, and a lot more than 30% if the hypothesis being tested is implausible. I leave it to the reader to assess the plausibility of the hypothesis that a good way to stop a baby crying is to stick needles into the poor baby. If you want to know more about P values try Youtube or here, or here. |
One of the people asked for an opinion on the paper was George Lewith, the well-known apologist for all things quackish. He described the work as being a "good sized fastidious well conducted study ….. The outcome is clear". Thus showing an ignorance of statistics that would shame an undergraduate.
On the Today Programme, I was interviewed by the formidable John Humphrys, along with the mandatory member of the flat-earth society whom the BBC seems to feel obliged to invite along for "balance". In this case it was professional acupuncturist, Mike Cummings, who is an associate editor of the journal in which the paper appeared. Perhaps he’d read the Science media centre’s assessment before he came on, because he said, quite rightly, that
"in technical terms the study is negative" "the primary outcome did not turn out to be statistically significant"
to which Humphrys retorted, reasonably enough, “So it doesn’t work”. Cummings’ response to this was a lot of bluster about how unfair it was for NICE to expect a treatment to perform better than placebo. It was fascinating to hear Cummings admit that the press release by his own journal was simply wrong.
Listen to the interview here
Another obvious flaw of the study is that the nature of the control group. It is not stated very clearly but it seems that the baby was left alone with the acupuncturist for 10 minutes. A far better control would have been to have the baby cuddled by its mother, or by a nurse. That’s what was used by Olafsdottir et al (2001) in a study that showed cuddling worked just as well as another form of quackery, chiropractic, to stop babies crying.
Manufactured doubt is a potent weapon of the alternative medicine industry. It’s the same tactic as was used by the tobacco industry. You scrape together a few lousy papers like this one and use them to pretend that there’s a controversy. For years the tobacco industry used this tactic to try to persuade people that cigarettes didn’t give you cancer, and that nicotine wasn’t addictive. The main stream media obligingly invite the representatives of the industry who convey to the reader/listener that there is a controversy, when there isn’t.
Acupuncture is no longer controversial. It just doesn’t work -see Acupuncture is a theatrical placebo: the end of a myth. Try to imagine a pill that had been subjected to well over 3000 trials without anyone producing convincing evidence for a clinically useful effect. It would have been abandoned years ago. But by manufacturing doubt, the acupuncture industry has managed to keep its product in the news. Every paper on the subject ends with the words "more research is needed". No it isn’t.
Acupuncture is pre-scientific idea that was moribund everywhere, even in China, until it was revived by Mao Zedong as part of the appalling Great Proletarian Revolution. Now it is big business in China, and 100 percent of the clinical trials that come from China are positive.
if you believe them, you’ll truly believe anything.
Follow-up
29 January 2017
Soon after the Today programme in which we both appeared, the acupuncturist, Mike Cummings, posted his reaction to the programme. I thought it worth posting the original version in full. Its petulance and abusiveness are quite remarkable.
I thank Cummings for giving publicity to the video of our appearance, and for referring to my Wikipedia page. I leave it to the reader to judge my competence, and his, in the statistics of clinical trials. And it’s odd to be described as a "professional blogger" when the 400+ posts on dcscience.net don’t make a penny -in fact they cost me money. In contrast, he is the salaried medical director of the British Medical Acupuncture Society.
It’s very clear that he has no understanding of the error of the transposed conditional, nor even the mulltiple comparison problem (and neither, it seems, does he know the meaning of the word ‘protagonist’).
I ignored his piece, but several friends complained to the BMJ for allowing such abusive material on their blog site. As a result a few changes were made. The “baying mob” is still there, but the Wikipedia link has gone. I thought that readers might be interested to read the original unexpurgated version. It shows, better than I ever could, the weakness of the arguments of the alternative medicine community. To quote Upton Sinclair:
“It is difficult to get a man to understand something, when his salary depends upon his not understanding it.”
It also shows that the BBC still hasn’t learned the lessons in Steve Jones’ excellent “Review of impartiality and accuracy of the BBC’s coverage of science“. Every time I appear in such a programme, they feel obliged to invite a member of the flat earth society to propagate their make-believe.
Acupuncture for infantile colic – misdirection in the media or over-reaction from a sceptic blogger?26 Jan, 17 | by Dr Mike Cummings So there has been a big response to this paper press released by BMJ on behalf of the journal Acupuncture in Medicine. The response has been influenced by the usual characters – retired professors who are professional bloggers and vocal critics of anything in the realm of complementary medicine. They thrive on oiling up and flexing their EBM muscles for a baying mob of fellow sceptics (see my ‘stereotypical mental image’ here). Their target in this instant is a relatively small trial on acupuncture for infantile colic.[1] Deserving of being press released by virtue of being the largest to date in the field, but by no means because it gave a definitive answer to the question of the efficacy of acupuncture in the condition. We need to wait for an SR where the data from the 4 trials to date can be combined. So what about the research itself? I have already said that the trial was not definitive, but it was not a bad trial. It suffered from under-recruiting, which meant that it was underpowered in terms of the statistical analysis. But it was prospectively registered, had ethical approval and the protocol was published. Primary and secondary outcomes were clearly defined, and the only change from the published protocol was to combine the two acupuncture groups in an attempt to improve the statistical power because of under recruitment. The fact that this decision was made after the trial had begun means that the results would have to be considered speculative. For this reason the editors of Acupuncture in Medicine insisted on alteration of the language in which the conclusions were framed to reflect this level of uncertainty. DC has focussed on multiple statistical testing and p values. These are important considerations, and we could have insisted on more clarity in the paper. P values are a guide and the 0.05 level commonly adopted must be interpreted appropriately in the circumstances. In this paper there are no definitive conclusions, so the p values recorded are there to guide future hypothesis generation and trial design. There were over 50 p values reported in this paper, so by chance alone you must expect some to be below 0.05. If one is to claim statistical significance of an outcome at the 0.05 level, ie a 1:20 likelihood of the event happening by chance alone, you can only perform the test once. If you perform the test twice you must reduce the p value to 0.025 if you want to claim statistical significance of one or other of the tests. So now we must come to the predefined outcomes. They were clearly stated, and the results of these are the only ones relevant to the conclusions of the paper. The primary outcome was the relative reduction in total crying time (TC) at 2 weeks. There were two significance tests at this point for relative TC. For a statistically significant result, the p values would need to be less than or equal to 0.025 – neither was this low, hence my comment on the Radio 4 Today programme that this was technically a negative trial (more correctly ‘not a positive trial’ – it failed to disprove the null hypothesis ie that the samples were drawn from the same population and the acupuncture intervention did not change the population treated). Finally to the secondary outcome – this was the number of infants in each group who continued to fulfil the criteria for colic at the end of each intervention week. There were four tests of significance so we need to divide 0.05 by 4 to maintain the 1:20 chance of a random event ie only draw conclusions regarding statistical significance if any of the tests resulted in a p value at or below 0.0125. Two of the 4 tests were below this figure, so we say that the result is unlikely to have been chance alone in this case. With hindsight it might have been good to include this explanation in the paper itself, but as editors we must constantly balance how much we push authors to adjust their papers, and in this case the editor focussed on reducing the conclusions to being speculative rather than definitive. A significant result in a secondary outcome leads to a speculative conclusion that acupuncture ‘may’ be an effective treatment option… but further research will be needed etc… Now a final word on the 3000 plus acupuncture trials that DC loves to mention. His point is that there is no consistent evidence for acupuncture after over 3000 RCTs, so it clearly doesn’t work. He first quoted this figure in an editorial after discussing the largest, most statistically reliable meta-analysis to date – the Vickers et al IPDM.[2] DC admits that there is a small effect of acupuncture over sham, but follows the standard EBM mantra that it is too small to be clinically meaningful without ever considering the possibility that sham (gentle acupuncture plus context of acupuncture) can have clinically relevant effects when compared with conventional treatments. Perhaps now the best example of this is a network meta-analysis (NMA) using individual patient data (IPD), which clearly demonstrates benefits of sham acupuncture over usual care (a variety of best standard or usual care) in terms of health-related quality of life (HRQoL).[3] |
30 January 2017
I got an email from the BMJ asking me to take part in a BMJ Head-to-Head debate about acupuncture. I did one of these before, in 2007, but it generated more heat than light (the only good thing to come out of it was the joke about leprechauns). So here is my polite refusal.
Hello Thanks for the invitation, Perhaps you should read the piece that I wrote after the Today programme Why don’t you do these Head to Heads about genuine controversies? To do them about homeopathy or acupuncture is to fall for the “manufactured doubt” stratagem that was used so effectively by the tobacco industry to promote smoking. It’s the favourite tool of snake oil salesman too, and th BMJ should see that and not fall for their tricks. Such pieces night be good clickbait, but they are bad medicine and bad ethics. All the best David |
Just a couple of questions:
(1) Are you saying that using a cut-off value of 0.05 is fine, but if P is very close to that value then you can’t really claim that the result is statistically significant?
(2) On the point about the way they carried out 24 separate comparisons, am I right in thinking that they should have applied the Bonferroni correction? So they should have used a cut-off value of 0.05 / 24 for each individual comparison?
I don’t fully understand the stats, so I just want to check these two points. Thanks.
@af
(1) I’m saying that the term “statistically significant” should not be used at all. It’s very misleading, because a P value close to 0.05 provides weak evidence that there is a real effect. In particular. it does not mean that there is a 5 percent probability that the results are due to chance -in fact observing a P value close to 0.05 mplies that the chance that you have a false positive is at least 30 percent. To understand the crucial distinction between P values and false postive rate, you need to follow the links that I gave: Youtube or here, or here.
(2) Yes, that’s right. Some correction should have been applied. Bonferroni is one way to do it, though there are better (more powerful) methods, such as Benjamini & Hochberg’s method. Whichever method is used, it’s unlikely that that ther e are any real effects at all.
Okay, thanks for the clarification. I was a bit pressed for time, so I didn’t get a chance to follow up on the links. That’s next on my list though!
Just wondering, to what extent does this jelly bean hacking thingy affect clinical studies in pharmacology, too?
@Smith
It’s quite common in all fields for authors to fail to make appropriate corrections for multiple comparisons. Luckily it’s easy to spot once you are conscious of the problem.
There is another related problem. The corrections for multiple comparisons (eg Bonferroni, or Benjamini-Hochberg) aim to produce the specified type 1 error rate. They don’t help at all with the false positive problem. Even if they are succesful in producing a type 1 error rate of 5 percent. you can still expect that, if you observe a corrected P value just below 0.05, there’s a chance of at least 30% that it’s a false positive, as discussed in the 2nd comment, above).
Thanks for that. The paper and article you link to about p-values suggest that statistical misunderstanding has meant that a great deal of medical research up till now is unreliable, and that as a result the authors conclusions are likely to be wrong in many cases. Have I understood correctly and stated it reasonably?
BTW Here at the top, when you say “Table 1 of the paper lists 24 different tests of statistical significance”, I think you mean Tables 2 & 3.
@Smith
I think that it would be an exaggeration to say that “a great deal of medical research up till now is unreliable”. Firstly, many P values are much less than 0.05 and leave an ample margin of safety. Secondly, the ‘crisis of reproducibility’ seems to be concentrated in relatively few areas. It’s best documented in experimental psychology, fMRI and genome-wide sequencing. And of course many areas of research don’t use P values at all. In my own ares (single ion channel kinetics, I’ve hardly ever needed to do a test of significance.
Clinical trials that are organised by the drug industry tend to have, if anything, better statistical standards than those organised in academia, The sins of the drug industry come later, in things like suppression of negative results, and ruthless advertising.
The problems arise mainly when effects are small (relative to noise). They are the times when you get P values close to 0.05. For example, statins have a small effect on expectation of life (despite having a big effect on cholesterol levels).
This is especially true in research on alternative medicine and “supplements”. Experimental design is often poor, effect sizes are always small. or non-existent. P values are marginal and experiments that show no effect are unlikely to be published by people whose livelihood depends of there being an effect.
It’s impossible to estimate just how much is wrong in medical science as a whole. It’s been estimated that something like 50% of regular medicines work reasonably well. One wishes that were higher (some tet hat don’t work, like “cough medicines” are hangovers from the early days,before things were tested properly). But 50% is pretty good compared with alternative medicine, which has produced not a single treatment that works well enough to get marketing authorisation.
“…alternative medicine, which has produced not a single treatment that works well enough to get marketing authorisation.”
Not sure what kind of authorisation you mean? The ASA lists several conditions each for which they accept the claims for effectiveness of acupuncture, chiropractic and osteopathy respectively.
I do accept that the number of conditions accepted by ASA is small compared with the number for which claims are made.
However, I digress. Thank you for the eye-opener on p-values and jelly beans.
Marketing authorisation is the term used by the MHRA to describe treatments that have presented satisfactory evidence that they both work and are safe: see https://www.gov.uk/government/collections/marketing-authorisations-lists-of-granted-licences.
No alternetive treatments have qualified
The MHRA has adopted a form of (very ineffective) regulation for alternative medicines that’s based on “traditional use”. In order to be able to sell an alternative medicine you have to present evidence that it’s safe, but you don’t need any evidence that it works. Unfortunately this “traditional use” loophole allows herbal and homeopathic potions to be labelled in a misleading way.
Although the ASA usually makes quite good judgements (for example homeopaths are no longer allowed to claim that they can cure diseases), the ultimate arbiter of benefit/cost is NICE. They recently renmoved acupuncture from the recommended treatments for back pain.
Perhaps most convingly of all, a bit of pork barrel politics led the US National Institutes of Health to set up a branch to test alternative therapies, Despite being sympathetic to alternative medicine, and despite spending billions of US taxpayers’ money on testing various implausible treatments, they have failed to come up with a single useful treatment.
I recently had a fall and really banged my shoulder. I’m waiting on the results of an MRI scan, in the meantime I was referred by my GP to the physio service. Whilst I lay on the bed the physio stuck pins in me and then sat about looking at his vdu! I don’t know how much the NHS pays for such ‘service’. How can we get this nonsense out of the hard pressed NHS?
@DrPT
Good question. A lot of us are working to try to get rid of money-wasting nonsense. At the individual level, one thing you could do would be to decline acupuncture and then write a letter to the head of the Trust complaining about being fobbed-off with quackery.
It seems that acupuncture is largely sustained by a minority of physiotherapists who aren’t very interested in evidence. That’s very sad: it disgraces their profession (see, for example, my post about Connect Physical Health). They should leave the quackery to chiropractors.
Pure quack hospitals are a smaller problem. The Royal London Homeopathic Hospital (now re-branded) has the Queen as its patron (not Charles). It’s probably too much too hope that any CEO would shut it down against her wishes. Such cowardice does them no credit, but that’s life.