statistics
This piece is almost identical with today’s Spectator Health article.
This week there has been enormously wide coverage in the press for one of the worst papers on acupuncture that I’ve come across. As so often, the paper showed the opposite of what its title and press release, claimed. For another stunning example of this sleight of hand, try Acupuncturists show that acupuncture doesn’t work, but conclude the opposite: journal fails, published in the British Journal of General Practice).
Presumably the wide coverage was a result of the hyped-up press release issued by the journal, BMJ Acupuncture in Medicine. That is not the British Medical Journal of course, but it is, bafflingly, published by the BMJ Press group, and if you subscribe to press releases from the real BMJ. you also get them from Acupuncture in Medicine. The BMJ group should not be mixing up press releases about real medicine with press releases about quackery. There seems to be something about quackery that’s clickbait for the mainstream media.
As so often, the press release was shockingly misleading: It said
Acupuncture may alleviate babies’ excessive crying Needling twice weekly for 2 weeks reduced crying time significantly
This is totally untrue. Here’s why.
Luckily the Science Media Centre was on the case quickly: read their assessment. The paper made the most elementary of all statistical mistakes. It failed to make allowance for the jelly bean problem. The paper lists 24 different tests of statistical significance and focusses attention on three that happen to give a P value (just) less than 0.05, and so were declared to be "statistically significant". If you do enough tests, some are bound to come out “statistically significant” by chance. They are false postives, and the conclusions are as meaningless as “green jelly beans cause acne” in the cartoon. This is called P-hacking and it’s a well known cause of problems. It was evidently beyond the wit of the referees to notice this naive mistake. It’s very doubtful whether there is anything happening but random variability. And that’s before you even get to the problem of the weakness of the evidence provided by P values close to 0.05. There’s at least a 30% chance of such values being false positives, even if it were not for the jelly bean problem, and a lot more than 30% if the hypothesis being tested is implausible. I leave it to the reader to assess the plausibility of the hypothesis that a good way to stop a baby crying is to stick needles into the poor baby. If you want to know more about P values try Youtube or here, or here. |
One of the people asked for an opinion on the paper was George Lewith, the well-known apologist for all things quackish. He described the work as being a "good sized fastidious well conducted study ….. The outcome is clear". Thus showing an ignorance of statistics that would shame an undergraduate.
On the Today Programme, I was interviewed by the formidable John Humphrys, along with the mandatory member of the flat-earth society whom the BBC seems to feel obliged to invite along for "balance". In this case it was professional acupuncturist, Mike Cummings, who is an associate editor of the journal in which the paper appeared. Perhaps he’d read the Science media centre’s assessment before he came on, because he said, quite rightly, that
"in technical terms the study is negative" "the primary outcome did not turn out to be statistically significant"
to which Humphrys retorted, reasonably enough, “So it doesn’t work”. Cummings’ response to this was a lot of bluster about how unfair it was for NICE to expect a treatment to perform better than placebo. It was fascinating to hear Cummings admit that the press release by his own journal was simply wrong.
Listen to the interview here
Another obvious flaw of the study is that the nature of the control group. It is not stated very clearly but it seems that the baby was left alone with the acupuncturist for 10 minutes. A far better control would have been to have the baby cuddled by its mother, or by a nurse. That’s what was used by Olafsdottir et al (2001) in a study that showed cuddling worked just as well as another form of quackery, chiropractic, to stop babies crying.
Manufactured doubt is a potent weapon of the alternative medicine industry. It’s the same tactic as was used by the tobacco industry. You scrape together a few lousy papers like this one and use them to pretend that there’s a controversy. For years the tobacco industry used this tactic to try to persuade people that cigarettes didn’t give you cancer, and that nicotine wasn’t addictive. The main stream media obligingly invite the representatives of the industry who convey to the reader/listener that there is a controversy, when there isn’t.
Acupuncture is no longer controversial. It just doesn’t work -see Acupuncture is a theatrical placebo: the end of a myth. Try to imagine a pill that had been subjected to well over 3000 trials without anyone producing convincing evidence for a clinically useful effect. It would have been abandoned years ago. But by manufacturing doubt, the acupuncture industry has managed to keep its product in the news. Every paper on the subject ends with the words "more research is needed". No it isn’t.
Acupuncture is pre-scientific idea that was moribund everywhere, even in China, until it was revived by Mao Zedong as part of the appalling Great Proletarian Revolution. Now it is big business in China, and 100 percent of the clinical trials that come from China are positive.
if you believe them, you’ll truly believe anything.
Follow-up
29 January 2017
Soon after the Today programme in which we both appeared, the acupuncturist, Mike Cummings, posted his reaction to the programme. I thought it worth posting the original version in full. Its petulance and abusiveness are quite remarkable.
I thank Cummings for giving publicity to the video of our appearance, and for referring to my Wikipedia page. I leave it to the reader to judge my competence, and his, in the statistics of clinical trials. And it’s odd to be described as a "professional blogger" when the 400+ posts on dcscience.net don’t make a penny -in fact they cost me money. In contrast, he is the salaried medical director of the British Medical Acupuncture Society.
It’s very clear that he has no understanding of the error of the transposed conditional, nor even the mulltiple comparison problem (and neither, it seems, does he know the meaning of the word ‘protagonist’).
I ignored his piece, but several friends complained to the BMJ for allowing such abusive material on their blog site. As a result a few changes were made. The “baying mob” is still there, but the Wikipedia link has gone. I thought that readers might be interested to read the original unexpurgated version. It shows, better than I ever could, the weakness of the arguments of the alternative medicine community. To quote Upton Sinclair:
“It is difficult to get a man to understand something, when his salary depends upon his not understanding it.”
It also shows that the BBC still hasn’t learned the lessons in Steve Jones’ excellent “Review of impartiality and accuracy of the BBC’s coverage of science“. Every time I appear in such a programme, they feel obliged to invite a member of the flat earth society to propagate their make-believe.
Acupuncture for infantile colic – misdirection in the media or over-reaction from a sceptic blogger?26 Jan, 17 | by Dr Mike Cummings So there has been a big response to this paper press released by BMJ on behalf of the journal Acupuncture in Medicine. The response has been influenced by the usual characters – retired professors who are professional bloggers and vocal critics of anything in the realm of complementary medicine. They thrive on oiling up and flexing their EBM muscles for a baying mob of fellow sceptics (see my ‘stereotypical mental image’ here). Their target in this instant is a relatively small trial on acupuncture for infantile colic.[1] Deserving of being press released by virtue of being the largest to date in the field, but by no means because it gave a definitive answer to the question of the efficacy of acupuncture in the condition. We need to wait for an SR where the data from the 4 trials to date can be combined. So what about the research itself? I have already said that the trial was not definitive, but it was not a bad trial. It suffered from under-recruiting, which meant that it was underpowered in terms of the statistical analysis. But it was prospectively registered, had ethical approval and the protocol was published. Primary and secondary outcomes were clearly defined, and the only change from the published protocol was to combine the two acupuncture groups in an attempt to improve the statistical power because of under recruitment. The fact that this decision was made after the trial had begun means that the results would have to be considered speculative. For this reason the editors of Acupuncture in Medicine insisted on alteration of the language in which the conclusions were framed to reflect this level of uncertainty. DC has focussed on multiple statistical testing and p values. These are important considerations, and we could have insisted on more clarity in the paper. P values are a guide and the 0.05 level commonly adopted must be interpreted appropriately in the circumstances. In this paper there are no definitive conclusions, so the p values recorded are there to guide future hypothesis generation and trial design. There were over 50 p values reported in this paper, so by chance alone you must expect some to be below 0.05. If one is to claim statistical significance of an outcome at the 0.05 level, ie a 1:20 likelihood of the event happening by chance alone, you can only perform the test once. If you perform the test twice you must reduce the p value to 0.025 if you want to claim statistical significance of one or other of the tests. So now we must come to the predefined outcomes. They were clearly stated, and the results of these are the only ones relevant to the conclusions of the paper. The primary outcome was the relative reduction in total crying time (TC) at 2 weeks. There were two significance tests at this point for relative TC. For a statistically significant result, the p values would need to be less than or equal to 0.025 – neither was this low, hence my comment on the Radio 4 Today programme that this was technically a negative trial (more correctly ‘not a positive trial’ – it failed to disprove the null hypothesis ie that the samples were drawn from the same population and the acupuncture intervention did not change the population treated). Finally to the secondary outcome – this was the number of infants in each group who continued to fulfil the criteria for colic at the end of each intervention week. There were four tests of significance so we need to divide 0.05 by 4 to maintain the 1:20 chance of a random event ie only draw conclusions regarding statistical significance if any of the tests resulted in a p value at or below 0.0125. Two of the 4 tests were below this figure, so we say that the result is unlikely to have been chance alone in this case. With hindsight it might have been good to include this explanation in the paper itself, but as editors we must constantly balance how much we push authors to adjust their papers, and in this case the editor focussed on reducing the conclusions to being speculative rather than definitive. A significant result in a secondary outcome leads to a speculative conclusion that acupuncture ‘may’ be an effective treatment option… but further research will be needed etc… Now a final word on the 3000 plus acupuncture trials that DC loves to mention. His point is that there is no consistent evidence for acupuncture after over 3000 RCTs, so it clearly doesn’t work. He first quoted this figure in an editorial after discussing the largest, most statistically reliable meta-analysis to date – the Vickers et al IPDM.[2] DC admits that there is a small effect of acupuncture over sham, but follows the standard EBM mantra that it is too small to be clinically meaningful without ever considering the possibility that sham (gentle acupuncture plus context of acupuncture) can have clinically relevant effects when compared with conventional treatments. Perhaps now the best example of this is a network meta-analysis (NMA) using individual patient data (IPD), which clearly demonstrates benefits of sham acupuncture over usual care (a variety of best standard or usual care) in terms of health-related quality of life (HRQoL).[3] |
30 January 2017
I got an email from the BMJ asking me to take part in a BMJ Head-to-Head debate about acupuncture. I did one of these before, in 2007, but it generated more heat than light (the only good thing to come out of it was the joke about leprechauns). So here is my polite refusal.
Hello Thanks for the invitation, Perhaps you should read the piece that I wrote after the Today programme Why don’t you do these Head to Heads about genuine controversies? To do them about homeopathy or acupuncture is to fall for the “manufactured doubt” stratagem that was used so effectively by the tobacco industry to promote smoking. It’s the favourite tool of snake oil salesman too, and th BMJ should see that and not fall for their tricks. Such pieces night be good clickbait, but they are bad medicine and bad ethics. All the best David |
This post arose from a recent meeting at the Royal Society. It was organised by Julie Maxton to discuss the application of statistical methods to legal problems. I found myself sitting next to an Appeal Court Judge who wanted more explanation of the ideas. Here it is.
Some preliminaries
The papers that I wrote recently were about the problems associated with the interpretation of screening tests and tests of significance. They don’t allude to legal problems explicitly, though the problems are the same in principle. They are all open access. The first appeared in 2014:
http://rsos.royalsocietypublishing.org/content/1/3/140216
Since the first version of this post, March 2016, I’ve written two more papers and some popular pieces on the same topic. There’s a list of them at http://www.onemol.org.uk/?page_id=456.
I also made a video for YouTube of a recent talk.
In these papers I was interested in the false positive risk (also known as the false discovery rate) in tests of significance. It turned out to be alarmingly large. That has serious consequences for the credibility of the scientific literature. In legal terms, the false positive risk means the proportion of cases in which, on the basis of the evidence, a suspect is found guilty when in fact they are innocent. That has even more serious consequences.
Although most of what I want to say can be said without much algebra, it would perhaps be worth getting two things clear before we start.
The rules of probability.
(1) To get any understanding, it’s essential to understand the rules of probabilities, and, in particular, the idea of conditional probabilities. One source would be my old book, Lectures on Biostatistics (now free), The account on pages 19 to 24 give a pretty simple (I hope) description of what’s needed. Briefly, a vertical line is read as “given”, so Prob(evidence | not guilty) means the probability that the evidence would be observed given that the suspect was not guilty.
(2) Another potential confusion in this area is the relationship between odds and probability. The relationship between the probability of an event occurring, and the odds on the event can be illustrated by an example. If the probability of being right-handed is 0.9, then the probability of being not being right-handed is 0.1. That means that 9 people out of 10 are right-handed, and one person in 10 is not. In other words for every person who is not right-handed there are 9 who are right-handed. Thus the odds that a randomly-selected person is right-handed are 9 to 1. In symbols this can be written
\[ \mathrm{probability=\frac{odds}{1 + odds}} \]
In the example, the odds on being right-handed are 9 to 1, so the probability of being right-handed is 9 / (1+9) = 0.9.
Conversely,
\[ \mathrm{odds =\frac{probability}{1 – probability}} \]
In the example, the probability of being right-handed is 0.9, so the odds of being right-handed are 0.9 / (1 – 0.9) = 0.9 / 0.1 = 9 (to 1).
With these preliminaries out of the way, we can proceed to the problem.
The legal problem
The first problem lies in the fact that the answer depends on Bayes’ theorem. Although that was published in 1763, statisticians are still arguing about how it should be used to this day. In fact whenever it’s mentioned, statisticians tend to revert to internecine warfare, and forget about the user.
Bayes’ theorem can be stated in words as follows
\[ \mathrm{\text{posterior odds ratio} = \text{prior odds ratio} \times \text{likelihood ratio}} \]
“Posterior odds ratio” means the odds that the person is guilty, relative to the odds that they are innocent, in the light of the evidence, and that’s clearly what one wants to know. The “prior odds” are the odds that the person was guilty before any evidence was produced, and that is the really contentious bit.
Sometimes the need to specify the prior odds has been circumvented by using the likelihood ratio alone, but, as shown below, that isn’t a good solution.
The analogy with the use of screening tests to detect disease is illuminating.
Screening tests
A particularly straightforward application of Bayes’ theorem is in screening people to see whether or not they have a disease. It turns out, in many cases, that screening gives a lot more wrong results (false positives) than right ones. That’s especially true when the condition is rare (the prior odds that an individual suffers from the condition is small). The process of screening for disease has a lot in common with the screening of suspects for guilt. It matters because false positives in court are disastrous.
The screening problem is dealt with in sections 1 and 2 of my paper. or on this blog (and here). A bit of animation helps the slides, so you may prefer the Youtube version.
The rest of my paper applies similar ideas to tests of significance. In that case the prior probability is the probability that there is in fact a real effect, or, in the legal case, the probability that the suspect is guilty before any evidence has been presented. This is the slippery bit of the problem both conceptually, and because it’s hard to put a number on it.
But the examples below show that to ignore it, and to use the likelihood ratio alone, could result in many miscarriages of justice.
In the discussion of tests of significance, I took the view that it is not legitimate (in the absence of good data to the contrary) to assume any prior probability greater than 0.5. To do so would presume you know the answer before any evidence was presented. In the legal case a prior probability of 0.5 would mean assuming that there was a 50:50 chance that the suspect was guilty before any evidence was presented. A 50:50 probability of guilt before the evidence is known corresponds to a prior odds ratio of 1 (to 1) If that were true, the likelihood ratio would be a good way to represent the evidence, because the posterior odds ratio would be equal to the likelihood ratio.
It could be argued that 50:50 represents some sort of equipoise, but in the example below it is clearly too high, and if it is less that 50:50, use of the likelihood ratio runs a real risk of convicting an innocent person.
The following example is modified slightly from section 3 of a book chapter by Mortera and Dawid (2008). Philip Dawid is an eminent statistician who has written a lot about probability and the law, and he’s a member of the legal group of the Royal Statistical Society.
My version of the example removes most of the algebra, and uses different numbers.
Example: The island problem
The “island problem” (Eggleston 1983, Appendix 3) is an imaginary example that provides a good illustration of the uses and misuses of statistical logic in forensic identification.
A murder has been committed on an island, cut off from the outside world, on which 1001 (= N + 1) inhabitants remain. The forensic evidence at the scene consists of a measurement, x, on a “crime trace” characteristic, which can be assumed to come from the criminal. It might, for example, be a bit of the DNA sequence from the crime scene.
Say, for the sake of example, that the probability of a random member of the population having characteristic x is P = 0.004 (i.e. 0.4% ), so the probability that a random member of the population does not have the characteristic is 1 – P = 0.996. The mainland police arrive and arrest a random islander, Jack. It is found that Jack matches the crime trace. There is no other relevant evidence.
How should this match evidence be used to assess the claim that Jack is the murderer? We shall consider three arguments that have been used to address this question. The first is wrong. The second and third are right. (For illustration, we have taken N = 1000, P = 0.004.)
(1) Prosecutor’s fallacy
Prosecuting counsel, arguing according to his favourite fallacy, asserts that the probability that Jack is guilty is 1 – P , or 0.996, and that this proves guilt “beyond a reasonable doubt”.
The probability that Jack would show characteristic x if he were not guilty would be 0.4% i.e. Prob(Jack has x | not guilty) = 0.004. Therefore the probability of the evidence, given that Jack is guilty, Prob(Jack has x | Jack is guilty), is one 1 – 0.004 = 0.996.
But this is Prob(evidence | guilty) which is not what we want. What we need is the probability that Jack is guilty, given the evidence, P(Jack is guilty | Jack has characteristic x).
To mistake the latter for the former is the prosecutor’s fallacy, or the error of the transposed conditional.
Dawid gives an example that makes the distinction clear.
“As an analogy to help clarify and escape this common and seductive confusion, consider the difference between “the probability of having spots, if you have measles” -which is close to 1 and “the probability of having measles, if you have spots” -which, in the light of the many alternative possible explanations for spots, is much smaller.”
(2) Defence counter-argument
Counsel for the defence points out that, while the guilty party must have characteristic x, he isn’t the only person on the island to have this characteristic. Among the remaining N = 1000 innocent islanders, 0.4% have characteristic x, so the number who have it will be NP = 1000 x 0.004 = 4 . Hence the total number of islanders that have this characteristic must be 1 + NP = 5 . The match evidence means that Jack must be one of these 5 people, but does not otherwise distinguish him from any of the other members of it. Since just one of these is guilty, the probability that this is Jack is thus 1/5, or 0.2— very far from being “beyond all reasonable doubt”.
(3) Bayesian argument
The probability of the having characteristic x (the evidence) would be Prob(evidence | guilty) = 1 if Jack were guilty, but if Jack were not guilty it would be 0.4%, i.e. Prob(evidence | not guilty) = P. Hence the likelihood ratio in favour of guilt, on the basis of the evidence, is
\[ LR=\frac{\text{Prob(evidence } | \text{ guilty})}{\text{Prob(evidence }|\text{ not guilty})} = \frac{1}{P}=250 \]
In words, the evidence would be 250 times more probable if Jack were guilty than if he were innocent. While this seems strong evidence in favour of guilt, it still does not tell us what we want to know, namely the probability that Jack is guilty in the light of the evidence: Prob(guilty | evidence), or, equivalently, the odds ratio -the odds of guilt relative to odds of innocence, given the evidence,
To get that we must multiply the likelihood ratio by the prior odds on guilt, i.e. the odds on guilt before any evidence is presented. It’s often hard to get a numerical value for this. But in our artificial example, it is possible. We can argue that, in the absence of any other evidence, Jack is no more nor less likely to be the culprit than any other islander, so that the prior probability of guilt is 1/(N + 1), corresponding to prior odds on guilt of 1/N.
We can now apply Bayes’s theorem to obtain the posterior odds on guilt:
\[ \text {posterior odds} = \text{prior odds} \times LR = \left ( \frac{1}{N}\right ) \times \left ( \frac{1}{P} \right )= 0.25 \]
Thus the odds of guilt in the light of the evidence are 4 to 1 against. The corresponding posterior probability of guilt is
\[ Prob( \text{guilty } | \text{ evidence})= \frac{1}{1+NP}= \frac{1}{1+4}=0.2 \]
This is quite small –certainly no basis for a conviction.
This result is exactly the same as that given by the Defence Counter-argument’, (see above). That argument was simpler than the Bayesian argument. It didn’t explicitly use Bayes’ theorem, though it was implicit in the argument. The advantage of using the former is that it looks simpler. The advantage of the explicitly Bayesian argument is that it makes the assumptions more clear.
In summary The prosecutor’s fallacy suggested, quite wrongly, that the probability that Jack was guilty was 0.996. The likelihood ratio was 250, which also seems to suggest guilt, but it doesn’t give us the probability that we need. In stark contrast, the defence counsel’s argument, and equivalently, the Bayesian argument, suggested that the probability of Jack’s guilt as 0.2. or odds of 4 to 1 against guilt. The potential for wrong conviction is obvious.
Conclusions.
Although this argument uses an artificial example that is simpler than most real cases, it illustrates some important principles.
(1) The likelihood ratio is not a good way to evaluate evidence, unless there is good reason to believe that there is a 50:50 chance that the suspect is guilty before any evidence is presented.
(2) In order to calculate what we need, Prob(guilty | evidence), you need to give numerical values of how common the possession of characteristic x (the evidence) is the whole population of possible suspects (a reasonable value might be estimated in the case of DNA evidence), We also need to know the size of the population. In the case of the island example, this was 1000, but in general, that would be hard to answer and any answer might well be contested by an advocate who understood the problem.
These arguments lead to four conclusions.
(1) If a lawyer uses the prosecutor’s fallacy, (s)he should be told that it’s nonsense.
(2) If a lawyer advocates conviction on the basis of likelihood ratio alone, s(he) should be asked to justify the implicit assumption that there was a 50:50 chance that the suspect was guilty before any evidence was presented.
(3) If a lawyer uses Defence counter-argument, or, equivalently, the version of Bayesian argument given here, (s)he should be asked to justify the estimates of the numerical value given to the prevalence of x in the population (P) and the numerical value of the size of this population (N). A range of values of P and N could be used, to provide a range of possible values of the final result, the probability that the suspect is guilty in the light of the evidence.
(4) The example that was used is the simplest possible case. For more complex cases it would be advisable to ask a professional statistician. Some reliable people can be found at the Royal Statistical Society’s section on Statistics and the Law.
If you do ask a professional statistician, and they present you with a lot of mathematics, you should still ask these questions about precisely what assumptions were made, and ask for an estimate of the range of uncertainty in the value of Prob(guilty | evidence) which they produce.
Postscript: real cases
Another paper by Philip Dawid, Statistics and the Law, is interesting because it discusses some recent real cases: for example the wrongful conviction of Sally Clark because of the wrong calculation of the statistics for Sudden Infant Death Syndrome.
On Monday 21 March, 2016, Dr Waney Squier was struck off the medical register by the General Medical Council because they claimed that she misrepresented the evidence in cases of Shaken Baby Syndrome (SBS).
This verdict was questioned by many lawyers, including Michael Mansfield QC and Clive Stafford Smith, in a letter. “General Medical Council behaving like a modern inquisition”
The latter has already written “This shaken baby syndrome case is a dark day for science – and for justice“..
The evidence for SBS is based on the existence of a triad of signs (retinal bleeding, subdural bleeding and encephalopathy). It seems likely that these signs will be present if a baby has been shake, i.e Prob(triad | shaken) is high. But this is irrelevant to the question of guilt. For that we need Prob(shaken | triad). As far as I know, the data to calculate what matters are just not available.
It seem that the GMC may have fallen for the prosecutor’s fallacy. Or perhaps the establishment won’t tolerate arguments. One is reminded, once again, of the definition of clinical experience: “Making the same mistakes with increasing confidence over an impressive number of years.” (from A Sceptic’s Medical Dictionary by Michael O’Donnell. A Sceptic’s Medical Dictionary BMJ publishing, 1997).
Appendix (for nerds). Two forms of Bayes’ theorem
The form of Bayes’ theorem given at the start is expressed in terms of odds ratios. The same rule can be written in terms of probabilities. (This was the form used in the appendix of my paper.) For those interested in the details, it may help to define explicitly these two forms.
In terms of probabilities, the probability of guilt in the light of the evidence (what we want) is
\[ \text{Prob(guilty } | \text{ evidence}) = \text{Prob(evidence } | \text{ guilty}) \frac{\text{Prob(guilty })}{\text{Prob(evidence })} \]
In terms of odds ratios, the odds ratio on guilt, given the evidence (which is what we want) is
\[ \frac{ \text{Prob(guilty } | \text{ evidence})} {\text{Prob(not guilty } | \text{ evidence}} =
\left ( \frac{ \text{Prob(guilty)}} {\text {Prob((not guilty)}} \right )
\left ( \frac{ \text{Prob(evidence } | \text{ guilty})} {\text{Prob(evidence } | \text{ not guilty}} \right ) \]
or, in words,
\[ \text{posterior odds of guilt } =\text{prior odds of guilt} \times \text{likelihood ratio} \]
This is the precise form of the equation that was given in words at the beginning.
A derivation of the equivalence of these two forms is sketched in a document which you can download.
Follow-up
23 March 2016
It’s worth pointing out the following connection between the legal argument (above) and tests of significance.
(1) The likelihood ratio works only when there is a 50:50 chance that the suspect is guilty before any evidence is presented (so the prior probability of guilt is 0.5, or, equivalently, the prior odds ratio is 1).
(2) The false positive rate in signiifcance testing is close to the P value only when the prior probability of a real effect is 0.5, as shown in section 6 of the P value paper.
However there is another twist in the significance testing argument. The statement above is right if we take as a positive result any P < 0.05. If we want to interpret a value of P = 0.047 in a single test, then, as explained in section 10 of the P value paper, we should restrict attention to only those tests that give P close to 0.047. When that is done the false positive rate is 26% even when the prior is 0.5 (and much bigger than 30% if the prior is smaller –see extra Figure), That justifies the assertion that if you claim to have discovered something because you have observed P = 0.047 in a single test then there is a chance of at least 30% that you’ll be wrong. Is there, I wonder, any legal equivalent of this argument?
“Statistical regression to the mean predicts that patients selected for abnormalcy will, on the average, tend to improve. We argue that most improvements attributed to the placebo effect are actually instances of statistical regression.”
“Thus, we urge caution in interpreting patient improvements as causal effects of our actions and should avoid the conceit of assuming that our personal presence has strong healing powers.” |
In 1955, Henry Beecher published "The Powerful Placebo". I was in my second undergraduate year when it appeared. And for many decades after that I took it literally, They looked at 15 studies and found that an average 35% of them got "satisfactory relief" when given a placebo. This number got embedded in pharmacological folk-lore. He also mentioned that the relief provided by placebo was greatest in patients who were most ill.
Consider the common experiment in which a new treatment is compared with a placebo, in a double-blind randomised controlled trial (RCT). It’s common to call the responses measured in the placebo group the placebo response. But that is very misleading, and here’s why.
The responses seen in the group of patients that are treated with placebo arise from two quite different processes. One is the genuine psychosomatic placebo effect. This effect gives genuine (though small) benefit to the patient. The other contribution comes from the get-better-anyway effect. This is a statistical artefact and it provides no benefit whatsoever to patients. There is now increasing evidence that the latter effect is much bigger than the former.
How can you distinguish between real placebo effects and get-better-anyway effect?
The only way to measure the size of genuine placebo effects is to compare in an RCT the effect of a dummy treatment with the effect of no treatment at all. Most trials don’t have a no-treatment arm, but enough do that estimates can be made. For example, a Cochrane review by Hróbjartsson & Gøtzsche (2010) looked at a wide variety of clinical conditions. Their conclusion was:
“We did not find that placebo interventions have important clinical effects in general. However, in certain settings placebo interventions can influence patient-reported outcomes, especially pain and nausea, though it is difficult to distinguish patient-reported effects of placebo from biased reporting.”
In some cases, the placebo effect is barely there at all. In a non-blind comparison of acupuncture and no acupuncture, the responses were essentially indistinguishable (despite what the authors and the journal said). See "Acupuncturists show that acupuncture doesn’t work, but conclude the opposite"
So the placebo effect, though a real phenomenon, seems to be quite small. In most cases it is so small that it would be barely perceptible to most patients. Most of the reason why so many people think that medicines work when they don’t isn’t a result of the placebo response, but it’s the result of a statistical artefact.
Regression to the mean is a potent source of deception
The get-better-anyway effect has a technical name, regression to the mean. It has been understood since Francis Galton described it in 1886 (see Senn, 2011 for the history). It is a statistical phenomenon, and it can be treated mathematically (see references, below). But when you think about it, it’s simply common sense.
You tend to go for treatment when your condition is bad, and when you are at your worst, then a bit later you’re likely to be better, The great biologist, Peter Medawar comments thus.
"If a person is (a) poorly, (b) receives treatment intended to make him better, and (c) gets better, then no power of reasoning known to medical science can convince him that it may not have been the treatment that restored his health"
(Medawar, P.B. (1969:19). The Art of the Soluble: Creativity and originality in science. Penguin Books: Harmondsworth). |
This is illustrated beautifully by measurements made by McGorry et al., (2001). Patients with low back pain recorded their pain (on a 10 point scale) every day for 5 months (they were allowed to take analgesics ad lib).
The results for four patients are shown in their Figure 2. On average they stay fairly constant over five months, but they fluctuate enormously, with different patterns for each patient. Painful episodes that last for 2 to 9 days are interspersed with periods of lower pain or none at all. It is very obvious that if these patients had gone for treatment at the peak of their pain, then a while later they would feel better, even if they were not actually treated. And if they had been treated, the treatment would have been declared a success, despite the fact that the patient derived no benefit whatsoever from it. This entirely artefactual benefit would be the biggest for the patients that fluctuate the most (e.g this in panels a and d of the Figure).
Figure 2 from McGorry et al, 2000. Examples of daily pain scores over a 6-month period for four participants. Note: Dashes of different lengths at the top of a figure designate an episode and its duration.
The effect is illustrated well by an analysis of 118 trials of treatments for non-specific low back pain (NSLBP), by Artus et al., (2010). The time course of pain (rated on a 100 point visual analogue pain scale) is shown in their Figure 2. There is a modest improvement in pain over a few weeks, but this happens regardless of what treatment is given, including no treatment whatsoever.
FIG. 2 Overall responses (VAS for pain) up to 52-week follow-up in each treatment arm of included trials. Each line represents a response line within each trial arm. Red: index treatment arm; Blue: active treatment arm; Green: usual care/waiting list/placebo arms. ____: pharmacological treatment; – – – -: non-pharmacological treatment; . . .. . .: mixed/other.
The authors comment
"symptoms seem to improve in a similar pattern in clinical trials following a wide variety of active as well as inactive treatments.", and "The common pattern of responses could, for a large part, be explained by the natural history of NSLBP".
In other words, none of the treatments work.
This paper was brought to my attention through the blog run by the excellent physiotherapist, Neil O’Connell. He comments
"If this finding is supported by future studies it might suggest that we can’t even claim victory through the non-specific effects of our interventions such as care, attention and placebo. People enrolled in trials for back pain may improve whatever you do. This is probably explained by the fact that patients enrol in a trial when their pain is at its worst which raises the murky spectre of regression to the mean and the beautiful phenomenon of natural recovery."
O’Connell has discussed the matter in recent paper, O’Connell (2015), from the point of view of manipulative therapies. That’s an area where there has been resistance to doing proper RCTs, with many people saying that it’s better to look at “real world” outcomes. This usually means that you look at how a patient changes after treatment. The hazards of this procedure are obvious from Artus et al.,Fig 2, above. It maximises the risk of being deceived by regression to the mean. As O’Connell commented
"Within-patient change in outcome might tell us how much an individual’s condition improved, but it does not tell us how much of this improvement was due to treatment."
In order to eliminate this effect it’s essential to do a proper RCT with control and treatment groups tested in parallel. When that’s done the control group shows the same regression to the mean as the treatment group. and any additional response in the latter can confidently attributed to the treatment. Anything short of that is whistling in the wind.
Needless to say, the suboptimal methods are most popular in areas where real effectiveness is small or non-existent. This, sad to say, includes low back pain. It also includes just about every treatment that comes under the heading of alternative medicine. Although these problems have been understood for over a century, it remains true that
"It is difficult to get a man to understand something, when his salary depends upon his not understanding it."
Upton Sinclair (1935) |
Responders and non-responders?
One excuse that’s commonly used when a treatment shows only a small effect in proper RCTs is to assert that the treatment actually has a good effect, but only in a subgroup of patients ("responders") while others don’t respond at all ("non-responders"). For example, this argument is often used in studies of anti-depressants and of manipulative therapies. And it’s universal in alternative medicine.
There’s a striking similarity between the narrative used by homeopaths and those who are struggling to treat depression. The pill may not work for many weeks. If the first sort of pill doesn’t work try another sort. You may get worse before you get better. One is reminded, inexorably, of Voltaire’s aphorism "The art of medicine consists in amusing the patient while nature cures the disease".
There is only a handful of cases in which a clear distinction can be made between responders and non-responders. Most often what’s observed is a smear of different responses to the same treatment -and the greater the variability, the greater is the chance of being deceived by regression to the mean.
For example, Thase et al., (2011) looked at responses to escitalopram, an SSRI antidepressant. They attempted to divide patients into responders and non-responders. An example (Fig 1a in their paper) is shown.
The evidence for such a bimodal distribution is certainly very far from obvious. The observations are just smeared out. Nonetheless, the authors conclude
"Our findings indicate that what appears to be a modest effect in the grouped data – on the boundary of clinical significance, as suggested above – is actually a very large effect for a subset of patients who benefited more from escitalopram than from placebo treatment. "
I guess that interpretation could be right, but it seems more likely to be a marketing tool. Before you read the paper, check the authors’ conflicts of interest.
The bottom line is that analyses that divide patients into responders and non-responders are reliable only if that can be done before the trial starts. Retrospective analyses are unreliable and unconvincing.
Some more reading
Senn, 2011 provides an excellent introduction (and some interesting history). The subtitle is
"Here Stephen Senn examines one of Galton’s most important statistical legacies – one that is at once so trivial that it is blindingly obvious, and so deep that many scientists spend their whole career being fooled by it."
The examples in this paper are extended in Senn (2009), “Three things that every medical writer should know about statistics”. The three things are regression to the mean, the error of the transposed conditional and individual response.
You can read slightly more technical accounts of regression to the mean in McDonald & Mazzuca (1983) "How much of the placebo effect is statistical regression" (two quotations from this paper opened this post), and in Stephen Senn (2015) "Mastering variation: variance components and personalised medicine". In 1988 Senn published some corrections to the maths in McDonald (1983).
The trials that were used by Hróbjartsson & Gøtzsche (2010) to investigate the comparison between placebo and no treatment were looked at again by Howick et al., (2013), who found that in many of them the difference between treatment and placebo was also small. Most of the treatments did not work very well.
Regression to the mean is not just a medical deceiver: it’s everywhere
Although this post has concentrated on deception in medicine, it’s worth noting that the phenomenon of regression to the mean can cause wrong inferences in almost any area where you look at change from baseline. A classical example concern concerns the effectiveness of speed cameras. They tend to be installed after a spate of accidents, and if the accident rate is particularly high in one year it is likely to be lower the next year, regardless of whether a camera had been installed or not. To find the true reduction in accidents caused by installation of speed cameras, you would need to choose several similar sites and allocate them at random to have a camera or no camera. As in clinical trials. looking at the change from baseline can be very deceptive.
Statistical postscript
Lastly, remember that it you avoid all of these hazards of interpretation, and your test of significance gives P = 0.047. that does not mean you have discovered something. There is still a risk of at least 30% that your ‘positive’ result is a false positive. This is explained in Colquhoun (2014),"An investigation of the false discovery rate and the misinterpretation of p-values". I’ve suggested that one way to solve this problem is to use different words to describe P values: something like this.
P > 0.05 very weak evidence
P = 0.05 weak evidence: worth another look P = 0.01 moderate evidence for a real effect P = 0.001 strong evidence for real effect |
But notice that if your hypothesis is implausible, even these criteria are too weak. For example, if the treatment and placebo are identical (as would be the case if the treatment were a homeopathic pill) then it follows that 100% of positive tests are false positives.
Follow-up
12 December 2015
It’s worth mentioning that the question of responders versus non-responders is closely-related to the classical topic of bioassays that use quantal responses. In that field it was assumed that each participant had an individual effective dose (IED). That’s reasonable for the old-fashioned LD50 toxicity test: every animal will die after a sufficiently big dose. It’s less obviously right for ED50 (effective dose in 50% of individuals). The distribution of IEDs is critical, but it has very rarely been determined. The cumulative form of this distribution is what determines the shape of the dose-response curve for fraction of responders as a function of dose. Linearisation of this curve, by means of the probit transformation used to be a staple of biological assay. This topic is discussed in Chapter 10 of Lectures on Biostatistics. And you can read some of the history on my blog about Some pharmacological history: an exam from 1959.
Chalkdust is a magazine published by students of maths from UCL Mathematics department. Judging by its first issue, it’s an excellent vehicle for popularisation of maths. I have a piece in the second issue
You can view the whole second issue on line, or download a pdf of the whole issue. Or a pdf of my bit only: On the Perils of P values.
The piece started out as another exposition of the interpretation of P values, but the whole of the first part turned into an explanation of the principles of randomisation tests. It beats me why anybody still does a Student’s t test. The idea of randomisation tests is very old. They are as powerful as t tests when the assumptions of the latter are fulfilled but a lot better when the assumptions are wrong (in the jargon, they are uniformly-most-powerful tests).
Not only that, but you need no mathematics to do a randomisation test, whereas you need a good deal of mathematics to follow Student’s 1908 paper. And the randomisation test makes transparently clear that random allocation of treatments is a basic and essential assumption that’s necessary for the the validity of any test of statistical significance.
I made a short video that explains the principles behind the randomisation tests, to go with the printed article (a bit of animation always helps).
When I first came across the principals of randomisation tests, i was entranced by the simplicity of the idea. Chapters 6 – 9 of my old textbook were written to popularise them. You can find much more detail there.
In fact it’s only towards the end that I reiterate the idea that P values don’t answer the question that experimenters want to ask, namely:- if I claim I have made a discovery because P is small, what’s the chance that I’ll be wrong?
If you want the full story on that, read my paper. The story it tells is not very original, but it still isn’t known to most experimenters (because most statisticians still don’t teach it on elementary courses). The paper must have struck a chord because it’s had over 80,000 full text views and more than 10,000 pdf downloads. It reached an altmetric score of 975 (since when it has been mysteriously declining). That’s gratifying, but it is also a condemnation of the use of metrics. The paper is not original and it’s quite simple, yet it’s had far more "impact" than anything to do with my real work.
If you want simpler versions than the full paper, try this blog (part 1 and part 2), or the Youtube video about misinterpretation of P values.
The R code for doing 2-sample randomisation tests
You can download a pdf file that describes the two R scripts. There are two different R programs.
One re-samples randomly a specified number of times (the default is 100,000 times, but you can do any number). Download two_sample_rantest.R
The other uses every possible sample -in the case of the two samples of 10 observations,it gives the distribution for all 184,756 ways of selecting 10 observations from 20. Download 2-sample-rantest-exact.R
The launch party
Today the people who organise Chalkdust magazine held a party in the mathematics department at UCL. The editorial director is a graduate student in maths, Rafael Prieto Curiel. He was, at one time in the Mexican police force (he said he’d suffered more crime in London than in Mexico City). He, and the rest of the team, are deeply impressive. They’ve done a terrific job. Support them.
The party cakes
Rafael Prieto doing the introduction
Rafael Prieto doing the introduction
Rafael Prieto and me
I got the T shirt
Decoding the T shirt
The top line is "I" because that’s the usual symbol for the square root of -1.
The second line is one of many equations that describe a heart shape. It can be plotted by calculating a matrix of values of the left hand side for a range of values of x and y. Then plot the contour for a values x and y for which the left hand side is equal to 1. Download R script for this. (Method suggested by Rafael Prieto Curiel.) |
|
Follow-up
5 November 2015
The Mann-Whitney test
I was stimulated to write this follow-up because yesterday I was asked by a friend to comment on the fact that five different tests all gave identical P values, P = 0.0079. The paper in question was in Science magazine (see Fig. 1), so it wouldn’t surprise me if the statistics were done badly, but in this case there is an innocent explanation.
The Chalkdust article, and the video, are about randomisation tests done using the original observed numbers, so look at them before reading on. There is a more detailed explanation in Chapter 9 of Lectures on Biostatistics. Before it became feasible to do this sort of test, there was a simpler, and less efficient, version in which the observations were ranked in ascending order, and the observed values were replaced by their ranks. This was known as the Mann Whitney test. It had the virtue that because all the ‘observations’ were now integers, the number of possible results of resampling was limited so it was possible to construct tables to allow one to get a rough P value. Of course, replacing observations by their ranks throws away some information, and now that we have computers there is no need to use a Mann-Whitney test ever. But that’s what was used in this paper.
In the paper (Fig 1) comparisons are made between two groups (assumed to be independent) with 5 observations in each group. The 10 observations are just the ranks, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
To do the randomisation test we select 5 of these numbers at random for sample A, and the other 5 are sample B. (Of course this supposes that the treatments were applied randomly in the real experiment, which is unlikely to be true.) In fact there are only 10!/(5!.5!) = 252 possible ways to select a sample of 5 from 10, so it’s easy to list all of them. In the case where there is no overlap between the groups, one group will contain the smallest observations (ranks 1, 2, 3, 4, 5, and the other group will contain the highest observations, ranks 6, 7, 8, 9, 10.
In this case, the sum of the ‘observations’ in group A is 15, and the sum for group B is 40.These add to the sum of the first 10 integers, 10.(10+1)/2 = 55. The mean (which corresponds to a difference between means of zero) is 55/2 = 27.5.
There are two ways of getting an allocation as extreme as this (first group low, as above, or second group low, the other tail of the distribution). The two tailed P value is therefore 2/252 = 0.0079. This will be the result whenever the two groups don’t overlap, regardless of the numerical values of the observations. It’s the smallest P value the test can produce with 5 observations in each group.
The whole randomisation distribution looks like this
In this case, the abscissa is the sum of the ranks in sample A, rather than the difference between means for the two groups (the latter is easily calculated from the former). The red line shows the observed value, 15. There is only one way to get a total of 15 for group A: it must contain the lowest 5 ranks (group A = 1, 2, 3, 4, 5). There is also only one way to get a total of 16 (group A = 1, 2, 3, 4, 6),and there are two ways of getting a total of 17 (group A = 1, 2, 3, 4, 7, or 1, 2, 3, 5, 6), But there are 20 different ways of getting a sum of 27 or 28 (which straddle the mean, 27.5). The printout (.txt file) from the R program that was used to generate the distribution is as follows.
Randomisation test: exact calculation all possible samples INPUTS: exact calculation: all possible samples OUTPUTS Result of t test
|
Some problems. Figure 1 alone shows 16 two-sample comparisons, but no correction for multiple comparisons seems to have been made. A crude Bonferroni correction would require replacement of a P = 0.05 threshold with P = 0.05/16 = 0.003. None of the 5 tests that gave P = 0.0079 reaches this level (of course the whole idea of a threshold level is absurd anyway).
Furthermore, even a single test that gave P = 0.0079 would be expected to have a false positive rate of around 10 percent
The two posts on this blog about the hazards of s=ignificance testing have proved quite popular. See Part 1: the screening problem, and Part 2: Part 2: the false discovery rate. They’ve had over 20,000 hits already (though I still have to find a journal that will print the paper based on them).
Yet another Alzheiner’s screening story hit the headlines recently and the facts got sorted out in the follow up section of the screening post. If you haven’t read that already, it might be helpful to do so before going on to this post.
This post has already appeared on the Sense about Science web site. They asked me to explain exactly what was meant by the claim that the screening test had an "accuracy of 87%". That was mentioned in all the media reports, no doubt because it was the only specification of the quality of the test in the press release. Here is my attempt to explain what it means.
The "accuracy" of screening tests
Anything about Alzheimer’s disease is front line news in the media. No doubt that had not escaped the notice of Kings College London when they issued a press release about a recent study of a test for development of dementia based on blood tests. It was widely hailed in the media as a breakthrough in dementia research. For example, the BBC report is far from accurate). The main reason for the inaccurate reports is, as so often, the press release. It said
"They identified a combination of 10 proteins capable of predicting whether individuals with MCI would develop Alzheimer’s disease within a year, with an accuracy of 87 percent"
The original paper says
"Sixteen proteins correlated with disease severity and cognitive decline. Strongest associations were in the MCI group with a panel of 10 proteins predicting progression to AD (accuracy 87%, sensitivity 85% and specificity 88%)."
What matters to the patient is the probability that, if they come out positive when tested, they will actually get dementia. The Guardian quoted Dr James Pickett, head of research at the Alzheimer’s Society, as saying
"These 10 proteins can predict conversion to dementia with less than 90% accuracy, meaning one in 10 people would get an incorrect result."
That statement simply isn’t right (or, at least, it’s very misleading). The proper way to work out the relevant number has been explained in many places -I did it recently on my blog.
The easiest way to work it out is to make a tree diagram. The diagram is like that previously discussed here, but with a sensitivity of 85% and a specificity of 88%, as specified in the paper.
In order to work out the number we need, we have to specify the true prevalence of people who will develop dementia, in the population being tested. In the tree diagram, this has been taken as 10%. The diagram shows that, out of 1000 people tested, there are 85 + 108 = 193 with a positive test result. Out ot this 193, rather more than half (108) are false positives, so if you test positive there is a 56% chance that it’s a false alarm (108/193 = 0.56). A false discovery rate of 56% is far too high for a good test.
This figure of 56% seems to be the basis for a rather good post by NHS Choices with the title “Blood test for Alzheimer’s ‘no better than coin toss’
If the prevalence were taken as 5% (a value that’s been given for the over-60 age group) that fraction of false alarms would rise to a disastrous 73%.
How are these numbers related to the claim that the test is "87% accurate"? That claim was parroted in most of the media reports, and it is why Dr Pickett said "one in 10 people would get an incorrect result".
The paper itself didn’t define "accuracy" anywhere, and I wasn’t familiar with the term in this context (though Stephen Senn pointed out that it is mentioned briefly in the Wiikipedia entry for Sensitivity and Specificity). The senior author confirmed that "accuracy" means the total fraction of tests, positive or negative, that give the right result. We see from the tree diagram that, out of 1000 tests, there are 85 correct positive tests and 792 correct negative tests, so the accuracy (with a prevalence of 0.1) is (85 + 792)/1000 = 88%, close to the value that’s cited in the paper.
Accuracy, defined in this way, seems to me not to be a useful measure at all. It conflates positive and negative results and they need to be kept separate to understand the problem. Inspection of the tree diagram shows that it can be expressed algebraically as
accuracy = (sensitivity × prevalence) + (specificity × (1 − prevalence))
It is therefore merely a weighted mean of sensitivity and specificity (weighted by the prevalence). With the numbers in this case, it varies from 0.88 (when prevalence = 0) to 0.85 (when prevalence = 1). Thus it will inevitably give a much more flattering view of the test than the false discovery rate.
No doubt, it is too much to expect that a hard-pressed journalist would have time to figure this out, though it isn’t clear that they wouldn’t have time to contact someone who understands it. But it is clear that it should have been explained in the press release. It wasn’t.
In fact, reading the paper shows that the test was not being proposed as a screening test for dementia at all. It was proposed as a way to select patients for entry into clinical trials. The population that was being tested was very different from the general population of old people, being patients who come to memory clinics in trials centres (the potential trials population)
How best to select patients for entry into clinical trials is a matter of great interest to people who are running trials. It is of very little interest to the public. So all this confusion could have been avoided if Kings had refrained from issuing a press release at all, for a paper like this.
I guess universities think that PR is more important than accuracy.
That’s a bad mistake in an age when pretentions get quickly punctured on the web.
This post first appeared on the Sense about Science web site.
This post is now a bit out of date: there is a summary of my more recent efforts (papers, videos and pop stuff) can be found on Prof Sivilotti’s OneMol pages.
What follows is a simplified version of part of a paper that appeared as a preprint on arXiv in July. It appeared as a peer-reviewed paper on 19th November 2014, in the new Royal Society Open Science journal. If you find anything wrong, or obscure, please email me. Be vicious.
There is also a simplified version, given as a talk on Youtube..
It’s a follow-up to my very first paper, which was written in 1959 – 60, while I was a fourth year undergraduate(the history is at a recent blog). I hope this one is better.
‘”. . . before anything was known of Lydgate’s skill, the judgements on it had naturally been divided, depending on a sense of likelihood, situated perhaps in the pit of the stomach, or in the pineal gland, and differing in its verdicts, but not less valuable as a guide in the total deficit of evidence” ‘George Eliot (Middlemarch, Chap. 45)
“The standard approach in teaching, of stressing the formal definition of a p-value while warning against its misinterpretation, has simply been an abysmal failure” Sellke et al. (2001) `The American Statistician’ (55), 62–71
The last post was about screening. It showed that most screening tests are useless, in the sense that a large proportion of people who test positive do not have the condition. This proportion can be called the false discovery rate. You think you’ve discovered the condition, but you were wrong.
Very similar ideas can be applied to tests of significance. If you read almost any scientific paper you’ll find statements like “this result was statistically significant (P = 0.047)”. Tests of significance were designed to prevent you from making a fool of yourself by claiming to have discovered something, when in fact all you are seeing is the effect of random chance. In this case we define the false discovery rate as the probability that, when a test comes out as ‘statistically significant’, there is actually no real effect.
You can also make a fool of yourself by failing to detect a real effect, but this is less harmful to your reputation.
It’s very common for people to claim that an effect is real, not just chance, whenever the test produces a P value of less than 0.05, and when asked, it’s common for people to think that this procedure gives them a chance of 1 in 20 of making a fool of themselves. Leaving aside that this seems rather too often to make a fool of yourself, this interpretation is simply wrong.
The purpose of this post is to justify the following proposition.
If you observe a P value close to 0.05, your false discovery rate will not be 5%. It will be at least 30% and it could easily be 80% for small studies.
|
This makes slightly less startling the assertion in John Ioannidis’ (2005) article, Why Most Published Research Findings Are False. That paper caused quite a stir. It’s a serious allegation. In fairness, the title was a bit misleading. Ioannidis wasn’t talking about all science. But it has become apparent that an alarming number of published works in some fields can’t be reproduced by others. The worst offenders seem to be clinical trials, experimental psychology and neuroscience, some parts of cancer research and some attempts to associate genes with disease (genome-wide association studies). Of course the self-correcting nature of science means that the false discoveries get revealed as such in the end, but it would obviously be a lot better if false results weren’t published in the first place.
How can tests of significance be so misleading?
Tests of statistical significance have been around for well over 100 years now. One of the most widely used is Student’s t test. It was published in 1908. ‘Student’ was the pseudonym for William Sealy Gosset, who worked at the Guinness brewery in Dublin. He visited Karl Pearson’s statistics department at UCL because he wanted statistical methods that were valid for testing small samples. The example that he used in his paper was based on data from Arthur Cushny, the first holder of the chair of pharmacology at UCL (subsequently named the A.J. Clark chair, after its second holder)
The outcome of a significance test is a probability, referred to as a P value. First, let’s be clear what the P value means. It will be simpler to do that in the context of a particular example. Suppose we wish to know whether treatment A is better (or worse) than treatment B (A might be a new drug, and B a placebo). We’d take a group of people and allocate each person to take either A or B and the choice would be random. Each person would have an equal chance of getting A or B. We’d observe the responses and then take the average (mean) response for those who had received A and the average for those who had received B. If the treatment (A) was no better than placebo (B), the difference between means should be zero on average. But the variability of the responses means that the observed difference will never be exactly zero. So how big does it have to be before you discount the possibility that random chance is all you were seeing. You do the test and get a P value. Given the ubiquity of P values in scientific papers, it’s surprisingly rare for people to be able to give an accurate definition. Here it is.
The P value is the probability that you would find a difference as big as that observed, or a still bigger value, if in fact A and B were identical.
|
If this probability is low enough, the conclusion would be that it’s unlikely that the observed difference (or a still bigger one) would have occurred if A and B were identical, so we conclude that they are not identical, i.e. that there is a genuine difference between treatment and placebo.
This is the classical way to avoid making a fool of yourself by claiming to have made a discovery when you haven’t. It was developed and popularised by the greatest statistician of the 20th century, Ronald Fisher, during the 1920s and 1930s. It does exactly what it says on the tin. It sounds entirely plausible.
What could possibly go wrong?
Another way to look at significance tests
One way to look at the problem is to notice that the classical approach considers only what would happen if there were no real effect or, as a statistician would put it, what would happen if the null hypothesis were true. But there isn’t much point in knowing that an event is unlikely when the null hypothesis is true unless you know how likely it is when there is a real effect.
We can look at the problem a bit more realistically by means of a tree diagram, very like that used to analyse screening tests, in the previous post.
In order to do this, we need to specify a couple more things.
First we need to specify the power of the significance test. This is the probability that we’ll detect a difference when there really is one. By ‘detect a difference’ we mean that the test comes out with P < 0.05 (or whatever level we set). So it’s analogous with the sensitivity of a screening test. In order to calculate sample sizes, it’s common to set the power to 0.8 (obviously 0.99 would be better, but that would often require impracticably large samples).
The second thing that we need to specify is a bit trickier, the proportion of tests that we do in which there is a real difference. This is analogous to the prevalence of the disease in the population being tested in the screening example. There is nothing mysterious about it. It’s an ordinary probability that can be thought of as a long-term frequency. But it is a probability that’s much harder to get a value for than the prevalence of a disease.
If we were testing a series of 30C homeopathic pills, all of the pills, regardless of what it says on the label, would be identical with the placebo controls so the prevalence of genuine effects, call it P(real), would be zero. So every positive test would be a false positive: the false discovery rate would be 100%. But in real science we want to predict the false discovery rate in less extreme cases.
Suppose, for example, that we test a large number of candidate drugs. Life being what it is, most of them will be inactive, but some will have a genuine effect. In this example we’d be lucky if 10% had a real effect, i.e. were really more effective than the inactive controls. So in this case we’d set the prevalence to P(real) = 0.1.
We can now construct a tree diagram exactly as we did for screening tests.
Suppose that we do 1000 tests. In 90% of them (900 tests) there is no real effect: the null hypothesis is true. If we use P = 0.05 as a criterion for significance then, according to the classical theory, 5% of them (45 tests) will give false positives, as shown in the lower limb of the tree diagram. If the power of the test was 0.8 then we’ll detect 80% of the real differences so there will be 80 correct positive tests.
The total number of positive tests is 45 + 80 = 125, and the proportion of these that are false positives is 45/125 = 36 percent. Our false discovery rate is far bigger than the 5% that many people still believe they are attaining.
In contrast, 98% of negative tests are right (though this is less surprising because 90% of experiments really have no effect).
The equation
You can skip this section without losing much.
As in the case of screening tests, this result can be calculated from an equation. The same equation works if we substitute power for sensitivity, P(real) for prevalence, and siglev for (1 – specificity) where siglev is the cut off value for “significance”, 0.05 in our examples.
The false discovery rate (the probability that, if a “signifcant” result is found, there is actually no real effect) is given by
\[FDR = \frac{siglev\left(1-P(real)\right)}{power.P(real) + siglev\left(1-P(real)\right) }\; \]
In the example above, power = 0.8, siglev = 0.05 and P(real) = 0.1, so the false discovery rate is
\[\frac{0.05 (1-0.1)}{0.8 \times 0.1 + 0.05 (1-0.1) }\; = 0.36 \]
So 36% of “significant” results are wrong, as found in the tree diagram.
Some subtleties
The argument just presented should be quite enough to convince you that significance testing, as commonly practised, will lead to disastrous numbers of false positives. But the basis of how to make inferences is still a matter that’s the subject of intense controversy among statisticians, so what is an experimenter to do?
It is difficult to give a consensus of informed opinion because, although there is much informed opinion, there is rather little consensus. A personal view follows. Colquhoun (1970), Lectures on Biostatistics, pp 94-95.
This is almost as true now as it was when I wrote it in the late 1960s, but there are some areas of broad agreement.
There are two subtleties that cause the approach outlined above to be a bit contentious. The first lies in the problem of deciding the prevalence, P(real). You may have noticed that if the frequency of real effects were 50% rather than 10%, the approach shown in the diagram would give a false discovery rate of only 6%, little different from the 5% that’s embedded in the consciousness of most experimentalists.
But this doesn’t get us off the hook, for two reasons. For a start, there is no reason at all to think that there will be a real effect there in half of the tests that we do. Of course if P(real) were even bigger than 0.5, the false discovery rate would fall to zero, because when P(real) = 1, all effects are real and therefore all positive tests are correct.
There is also a more subtle point. If we are trying to interpret the result of a single test that comes out with a P value of, say, P = 0.047, then we should not be looking at all significant results (those with P < 0.05), but only at those tests that come out with P = 0.047. This can be done quite easily by simulating a long series of t tests, and then restricting attention to those that come out with P values between, say, 0.045 and 0.05. When this is done we find that the false discovery rate is at least 26%. That’s for the best possible case where the sample size is good (power of the test is 0.8) and the prevalence of real effects is 0.5. When, as in the tree diagram, the prevalence of real effects is 0.1, the false discovery rate is 76%. That’s enough to justify Ioannidis’ statement that most published results are wrong.
One problem with all of the approaches mentioned above was the need to guess at the prevalence of real effects (that’s what a Bayesian would call the prior probability). James Berger and colleagues (Sellke et al., 2001) have proposed a way round this problem by looking at all possible prior distributions and so coming up with a minimum false discovery rate that holds universally. The conclusions are much the same as before. If you claim to have found an effects whenever you observe a P value just less than 0.05, you will come to the wrong conclusion in at least 29% of the tests that you do. If, on the other hand, you use P = 0.001, you’ll be wrong in only 1.8% of cases. Valen Johnson (2013) has reached similar conclusions by a related argument.
A three-sigma rule
As an alternative to insisting on P < 0.001 before claiming you’ve discovered something, you could use a 3-sigma rule. In other words, insist that an effect is at least three standard deviations away from the control value (as opposed to the two standard deviations that correspond to P = 0.05).
The three sigma rule means using P= 0.0027 as your cut off. This, according to Berger’s rule, implies a false discovery rate of (at least) 4.5%, not far from the value that many people mistakenly think is achieved by using P = 0.05 as a criterion.
Particle physicists go a lot further than this. They use a 5-sigma rule before announcing a new discovery. That corresponds to a P value of less than one in a million (0.57 x 10−6). According to Berger’s rule this corresponds to a false discovery rate of (at least) around 20 per million. Of course their experiments can’t be randomised usually, so it’s as well to be on the safe side.
Underpowered experiments
All of the problems discussed so far concern the near-ideal case. They assume that your sample size is big enough (power about 0.8 say) and that all of the assumptions made in the test are true, that there is no bias or cheating and that no negative results are suppressed. The real-life problems can only be worse. One way in which it is often worse is that sample sizes are too small, so the statistical power of the tests is low.
The problem of underpowered experiments has been known since 1962, but it has been ignored. Recently it has come back into prominence, thanks in large part to John Ioannidis and the crisis of reproducibility in some areas of science. Button et al. (2013) said
“We optimistically estimate the median statistical power of studies in the neuroscience field to be between about 8% and about 31%”
This is disastrously low. Running simulated t tests shows that with a power of 0.2, not only do you have only a 20% chance of detecting a real effect, but that when you do manage to get a “significant” result there is a 76% chance that it’s a false discovery.
And furthermore, when you do find a “significant” result, the size of the effect will be over-estimated by a factor of nearly 2. This “inflation effect” happens because only those experiments that happen, by chance, to have a larger-than-average effect size will be deemed to be “significant”.
What should you do to prevent making a fool of yourself?
The simulated t test results, and some other subtleties, will be described in a paper, and/or in a future post. But I hope that enough has been said here to convince you that there are real problems in the sort of statistical tests that are universal in the literature.
The blame for the crisis in reproducibility has several sources.
One of them is the self-imposed publish-or-perish culture, which values quantity over quality, and which has done enormous harm to science.
The mis-assessment of individuals by silly bibliometric methods has contributed to this harm. Of all the proposed methods, altmetrics is demonstrably the most idiotic. Yet some vice-chancellors have failed to understand that.
Another is scientists’ own vanity, which leads to the PR department issuing disgracefully hyped up press releases.
In some cases, the abstract of a paper states that a discovery has been made when the data say the opposite. This sort of spin is common in the quack world. Yet referees and editors get taken in by the ruse (e.g see this study of acupuncture).
The reluctance of many journals (and many authors) to publish negative results biases the whole literature in favour of positive results. This is so disastrous in clinical work that a pressure group has been started; altrials.net “All Trials Registered | All Results Reported”.
Yet another problem is that it has become very hard to get grants without putting your name on publications to which you have made little contribution. This leads to exploitation of young scientists by older ones (who fail to set a good example). Peter Lawrence has set out the problems.
And, most pertinent to this post, a widespread failure to understand properly what a significance test means must contribute to the problem. Young scientists are under such intense pressure to publish, they have no time to learn about statistics.
Here are some things that can be done.
- Notice that all statistical tests of significance assume that the treatments have been allocated at random. This means that application of significance tests to observational data, e.g. epidemiological surveys of diet and health, is not valid. You can’t expect to get the right answer. The easiest way to understand this assumption is to think about randomisation tests (which should have replaced t tests decades ago, but which are still rare). There is a simple introduction in Lectures on Biostatistics (chapters 8 and 9). There are other assumptions too, about the distribution of observations, independence of measurements), but randomisation is the most important.
- Never, ever, use the word “significant” in a paper. It is arbitrary, and, as we have seen, deeply misleading. Still less should you use “almost significant”, “tendency to significant” or any of the hundreds of similar circumlocutions listed by Matthew Hankins on his Still not Significant blog.
- If you do a significance test, just state the P value and give the effect size and confidence intervals (but be aware that this is just another way of expressing the P value approach: it tells you nothing whatsoever about the false discovery rate).
- Observation of a P value close to 0.05 means nothing more than ‘worth another look’. In practice, one’s attitude will depend on weighing the losses that ensue if you miss a real effect against the loss to your reputation if you claim falsely to have made a discovery.
- If you want to avoid making a fool of yourself most of the time, don’t regard anything bigger than P < 0.001 as a demonstration that you’ve discovered something. Or, slightly less stringently, use a three-sigma rule.
Despite the gigantic contributions that Ronald Fisher made to statistics, his work has been widely misinterpreted. We must, however reluctantly, concede that there is some truth in the comment made by an astute journalist:
“The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and °flukes into funding. It is time to pull the plug“. Robert Matthews Sunday Telegraph, 13 September 1998.
There is now a video on YouTube that attempts to explain explain simply the essential ideas. The video has now been updated. The new version has better volume and it used term ‘false positive risk’, rather than the earlier term ‘false discovery rate’, to avoid confusion with the use of the latter term in the context of multiple comparisons.
The false positive risk: a proposal concerning what to do about p-values (version 2)
Follow-up
31 March 2014 I liked Stephen Senn’s first comment on twitter (the twitter stream is storified here). He said ” I may have to write a paper ‘You may believe you are NOT a Bayesian but you’re wrong'”. I maintain that the analysis here is merely an exercise in conditional probabilities. It bears a formal similarity to a Bayesian argument, but is free of more contentious parts of the Bayesian approach. This is amplified in a comment, below.
4 April 2014
I just noticed that my first boss, Heinz Otto Schild.in his 1942 paper about the statistical analysis of 2+2 dose biological assays (written while he was interned at the beginning of the war) chose to use 99% confidence limits, rather than the now universal 95% limits. The later are more flattering to your results, but Schild was more concerned with precision than self-promotion.
This post is about why screening healthy people is generally a bad idea. It is the first in a series of posts on the hazards of statistics.
There is nothing new about it: Graeme Archer recently wrote a similar piece in his Telegraph blog. But the problems are consistently ignored by people who suggest screening tests, and by journals that promote their work. It seems that it can’t be said often enough.
The reason is that most screening tests give a large number of false positives. If your test comes out positive, your chance of actually having the disease is almost always quite small. False positive tests cause alarm, and they may do real harm, when they lead to unnecessary surgery or other treatments.
Tests for Alzheimer’s disease have been in the news a lot recently. They make a good example, if only because it’s hard to see what good comes of being told early on that you might get Alzheimer’s later when there are no good treatments that can help with that news. But worse still, the news you are given is usually wrong anyway.
Consider a recent paper that described a test for "mild cognitive impairment" (MCI), a condition that may, but often isn’t, a precursor of Alzheimer’s disease. The 15-minute test was published in the Journal of Neuropsychiatry and Clinical Neurosciences by Scharre et al (2014). The test sounded pretty good. It had a specificity of 95% and a sensitivity of 80%.
Specificity (95%) means that 95% of people who are healthy will get the correct diagnosis: the test will be negative.
Sensitivity (80%) means that 80% of people who have MCI will get the correct diagnosis: the test will be positive.
To understand the implication of these numbers we need to know also the prevalence of MCI in the population that’s being tested. That was estimated as 1% of people have MCI. Or, for over-60s only, 5% of people have MCI. Now the calculation is easy. Suppose 10.000 people are tested. 1% (100 people) will have MCI, of which 80% (80 people) will be diagnosed correctly. And 9,900 do not have MCI, of which 95% will test negative (correctly). The numbers can be laid out in a tree diagram.
The total number of positive tests is 80 + 495 = 575, of which 495 are false positives. The fraction of tests that are false positives is 495/575= 86%.
Thus there is a 14% chance that if you test positive, you actually have MCI. 86% of people will be alarmed unnecessarily.
Even for people over 60. among whom 5% of the population have MC!, the test is gives the wrong result (54%) more often than it gives the right result (46%).
The test is clearly worse than useless. That was not made clear by the authors, or by the journal. It was not even made clear by NHS Choices.
It should have been.
It’s easy to put the tree diagram in the form of an equation. Denote sensitivity as sens, specificity as spec and prevalence as prev.
The probability that a positive test means that you actually have the condition is given by
\[\frac{sens.prev}{sens.prev + \left(1-spec\right)\left(1-prev\right) }\; \]
In the example above, sens = 0.8, spec = 0.95 and prev = 0.01, so the fraction of positive tests that give the right result is
\[\frac{0.8 \times 0.01}{0.8 \times 0.01 + \left(1 – 0.95 \right)\left(1 – 0.01\right) }\; = 0.139 \]
So 13.9% of positive tests are right, and 86% are wrong, as found in the tree diagram.
The lipid test for Alzheimers’
Another Alzheimers’ test has been in the headlines very recently. It performs even worse than the 15-minute test, but nobody seems to have noticed. It was published in Nature Medicine, by Mapstone et al. (2014). According to the paper, the sensitivity is 90% and the specificity is 90%, so, by constructing a tree, or by using the equation, the probability that you are ill, given that you test positive is a mere 8% (for a prevalence of 1%). And even for over-60s (prevalence 5%), the value is only 32%, so two-thirds of positive tests are still wrong. Again this was not pointed out by the authors. Nor was it mentioned by Nature Medicine in its commentary on the paper. And once again, NHS Choices missed the point.
Why does there seem to be a conspiracy of silence about the deficiencies of screening tests? It has been explained very clearly by people like Margaret McCartney who understand the problems very well. Is it that people are incapable of doing the calculations? Surely not. Is it that it’s better for funding to pretend you’ve invented a good test, when you haven’t? Do journals know that anything to do with Alzheimers’ will get into the headlines, and don’t want to pour cold water on a good story?
Whatever the explanation, it’s bad science that can harm people.
Follow-up
March 12 2014. This post was quickly picked up by the ampp3d blog, run by the Daily Mirror. Conrad Quilty-Harper showed some nice animations under the heading How a “90% accurate” Alzheimer’s test can be wrong 92% of the time.
March 12 2014.
As so often, the journal promoted the paper in a way that wasn’t totally accurate. Hype is more important than accuracy, I guess.
June 12 2014.
The empirical evidence shows that “general health checks” (a euphemism for mass screening of the healthy) simply don’t help. See review by Gøtzsche, Jørgensen & Krogsbøll (2014) in BMJ. They conclude
“Doctors should not offer general health checks to their patients,and governments should abstain from introducing health check programmes, as the Danish minister of health did when she learnt about the results of the Cochrane review and the Inter99 trial. Current programmes, like the one in the United Kingdom,should be abandoned.”
8 July 2014
Yet another over-hyped screening test for Alzheimer’s in the media. And once again. the hype originated in the press release, from Kings College London this time. The press release says
"They identified a combination of 10 proteins capable of predicting whether individuals with MCI would develop Alzheimer’s disease within a year, with an accuracy of 87 percent"
The term “accuracy” is not defined in the press release. And it isn’t defined in the original paper either. I’ve written to senior author, Simon Lovestone to try to find out what it means. The original paper says
"Sixteen proteins correlated with disease severity and cognitive decline. Strongest associations were in the MCI group with a panel of 10 proteins predicting progression to AD (accuracy 87%, sensitivity 85% and specificity 88%)."
A simple calculation, as shown above, tells us that with sensitivity 85% and specificity 88%. the fraction of people who have a positive test who are diagnosed correctly is 44%. So 56% of positive results are false alarms. These numbers assume that the prevalence of the condition in the population being tested is 10%, a higher value than assumed in other studies. If the prevalence were only 5% the results would be still worse: 73% of positive tests would be wrong. Either way, that’s not good enough to be useful as a diagnostic method.
In one of the other recent cases of Alzheimer’s tests, six months ago, NHS Choices fell into the same trap. They changed it a bit after I pointed out the problem in the comments. They seem to have learned their lesson because their post on this study was titled “Blood test for Alzheimer’s ‘no better than coin toss’ “. That’s based on the 56% of false alarms mention above.
The reports on BBC News and other media totally missed the point. But, as so often, their misleading reports were based on a misleading press release. That means that the university, and ultimately the authors, are to blame.
I do hope that the hype has no connection with the fact that Conflicts if Interest section of the paper says
"SL has patents filed jointly with Proteome Sciences plc related to these findings"
What it doesn’t mention is that, according to Google patents, Kings College London is also a patent holder, and so has a vested interest in promoting the product.
Is it really too much to expect that hard-pressed journalists might do a simple calculation, or phone someone who can do it for them? Until that happens, misleading reports will persist.
9 July 2014
It was disappointing to see that the usually excellent Sarah Boseley in the Guardian didn’t spot the problem either. And still more worrying that she quotes Dr James Pickett, head of research at the Alzheimer’s Society, as saying
These 10 proteins can predict conversion to dementia with less than 90% accuracy, meaning one in 10 people would get an incorrect result.
That number is quite wrong. It isn’t 1 in 10, it’s rather more than 1 in 2.
A resolution
After corresponding with the author, I now see what is going on more clearly.
The word "accuracy" was not defined in the paper, but was used in the press release and widely cited in the media. What it means is the ratio of the total number of true results (true positives + true negatives) to the total number of all results. This doesn’t seem to me to be useful number to give at all, because it conflates false negatives and false positives into a single number. If a condition is rare, the number of true negatives will be large (as shown above), but this does not make it a good test. What matters most to patients is not accuracy, defined in this way, but the false discovery rate.
The author makes it clear that the results are not intended to be a screening test for Alzheimer’s. It’s obvious from what’s been said that it would be a lousy test. Rather, the paper was intended to identify patients who would eventually (well, within only 18 months) get dementia. The denominator (always the key to statistical problems) in this case is the highly atypical patients that who come to memory clinics in trials centres (the potential trials population). The prevalence in this very restricted population may indeed be higher that the 10 percent that I used above.
Reading between the lines of the press release, you might have been able to infer some of thus (though not the meaning of “accuracy”). The fact that the media almost universally wrote up the story as a “breakthrough” in Alzeimer’s detection, is a consequence of the press release and of not reading the original paper.
I wonder whether it is proper for press releases to be issued at all for papers like this, which address a narrow technical question (selection of patients for trials). That us not a topic of great public interest. It’s asking for misinterpretation and that’s what it got.
I don’t suppose that it escaped the attention of the PR people at Kings that anything that refers to dementia is front page news, whether it’s of public interest or not. When we had an article in Nature in 2008, I remember long discussions about a press release with the arts graduate who wrote it (not at our request). In the end we decides that the topic was not of sufficient public interest to merit a press release and insisted that none was issued. Perhaps that’s what should have happened in this case too.
This discussion has certainly illustrated the value of post-publication peer review. See, especially, the perceptive comments, below, from Humphrey Rang and from Dr Aston and from Dr Kline.
14 July 2014. Sense about Science asked me to write a guest blog to explain more fully the meaning of "accuracy", as used in the paper and press release. It’s appeared on their site and will be reposted on this blog soon.
Last year, I was sent my answer paper for one of my final exams, taken in 1959. This has triggered a bout of shamelessly autobiographical nostalgia.
The answer sheets that I wrote had been kept by one of my teachers at Leeds, Dr George Mogey. After he died in 2003, aged 86, his widow, Audrey, found them and sent them to me. And after a hunt through the junk piled high in my office, I found the exam papers from that year too. George Mogey was an excellent teacher and a kind man. He gave most of the lectures to medical students, which we, as pharmacy/pharmacology students attended. His lectures were inspirational. |
Today, 56 years on, I can still recall vividly his lecture on anti-malarial drugs. At the end he paused dramatically and said “Since I started speaking, 100 people have died from malaria” (I don’t recall the exact number). He was the perfect antidote to people who say you learn nothing from lectures. Straight after the war (when he had seen the problem of malaria at first hand) he went to work at the Wellcome Research Labs in Beckenham, Kent. The first head of the Wellcome Lab was Henry Dale. It had a distinguished record of basic research as well as playing a crucial role in vaccine production and in development of the safe use of digitalis. In the 1930s it had an important role in the development of proper methods for biological standardisation. This was crucial for ensuring that, for example, each batch of tincture ot digitalis had the same potency (it has been described previously on this blog in Plants as Medicines. |
When George Mogey joined the Wellcome lab, its head was J.W. Trevan (1887 – 1956) (read his Biographical Memoir, written by J.H. Gaddum). Trevan’s most memorable contributions were in improving the statistics of biological assays. The ideas of individual effective dose and median effective dose were developed by him. His 1927 paper The Error of Determination of Toxicity is a classic of pharmacology. His advocacy of the well-defined quantity, median effective dose as a replacement for the ill-defined minimum effective dose was influential in the development of proper statistical analysis of biological assays in the 1930s. |
Trevan is something of hero to me. And he was said to be very forgetful. Gaddum, in his biographical memoir, recounts this story
“One day when he had lost something and suspected that it had been tidied away by his secretary, he went round muttering ‘It’s all due to this confounded tidiness. It always leads to trouble. I won’t have it in my lab.’ “
Trevan coined the abbreviation LD50 for the median lethal dose of a drug. George Mogey later acquired the car number plate LD50, in honour of Trevan, and his widow, Audrey, still has it (picture on right). |
Mogey wrote several papers with Trevan. In 1948 he presented one at a meeting of the Physiological Society. The programme included also A.V. Hill. E.J Denton, Bernhard [sic] Katz, J.Z. Young and Richard Keynes (Keynes was George Henry Lewes Student at Cambridge: Lewes was the Victorian polymath with whom the novelist George Eliot lived, openly unmarried, and a founder of the Physiological Society. He probably inspired the medical content of Eliot’s best known novel, Middlemarch).
Mogey may not have written many papers, but he was the sort of inspiring teacher that universities need. He had a letter in Nature on Constituents of Amanita Muscaria, the fly agaric toadstool, which appeared in 1965. That might explain why we went on a toadstool-hunting field trip. |
The tradition of interest in statistics and biological assay must have rubbed off on me, because the answers I gave in the exam were very much in that tradition. Here is a snippet (click to download the whole answer sheet).
A later answer was about probit analysis, an idea introduced by statistician Chester Bliss (1899–1979) in 1934, as an direct extension of Trevan’s work. (I met Bliss in 1970 or 1971 when I was in Yale -we had dinner, went to a theatre -then back to his apartment where he insisted on showing me his collection of erotic magazines!)
This paper was a pharmacology paper in my first final exam at the end of my third year. The external examiner was Walter Perry, head of pharmacology in Edinburgh (he went on to found the Open University). He had previously been head of Biological Standards at the National Institute for Medical Research, a job in which he had to know some statistics. In the oral exam he asked me a killer question “What is the difference between confidence limits and fiducial limits?”. I had no real idea (and, as I discovered later, neither did he). After that, I went on to do the 4th year where we specialised in pharmacology, and I spent quite a lot of time trying to answer that question. The result was my first ever paper, published in the University of Leeds Medical Journal. I hinted, obliquely, that the idea of fiducial inference was probably Ronald Fisher‘s only real mistake. I think that is the general view now, but Fisher was such a towering figure in statistics that nobody said that straight out (he was still alive when this was written -he died in 1962).
It is well-worth looking at a paper that Fisher gave to the Royal Statistical Society in 1935, The Logic of Inductive Inference. Then, as now, it was the custom for a paper to be followed by a vote of thanks, and a seconder. These, and the subsequent discussion, are all printed, and they could be quite vicious in a polite way. Giving the vote of thanks, Professor A.L. Bowley said
“It is not the custom, when the Council invites a member to propose a vote of thanks on a paper, to instruct him to bless it. If to some extent I play the inverse role of Balaam, it is not without precedent;”
And the seconder, Dr Isserlis, said
“There is no doubt in my mind at all about that, but Professor Fisher, like other fond parents, may perhaps see in his offspring qualities which to his mind no other children possess; others, however, may consider that the offspring are not unique.”
Post-publication peer review was already alive and well in 1935.
I was helped enormously in writing this paper by Dr B.L.Welch (1911 – 1989), whose first year course in statistics for biologists was a compulsory part of the course. Welch was famous particularly for having extended Student’s t distribution to the case where the variances in two samples being compared are unequal (Welch, 1947). He gave his whole lecture with his back to the class while writing what he said on a set of blackboards that occupied the whole side of the room. No doubt he would have failed any course about how to give a lecture. I found him riveting. He went slowly, and you could always check your notes because it was all there on the blackboards.
Walter Perry seemed to like my attempt to answer his question, despite the fact that it failed. After the 4th year final (a single 3 hour essay on drugs that affect protein synthesis) he offered me a PhD place in Edinburgh. He was one of my supervisors, though I never saw him except when he dropped into the lab for a cigarette between committee meetings. While in Edinburgh I met the famous statistician. David Finney, whose definitive book on the Statistics of Biological Assay was an enormous help when I later wrote Lectures on Biostatistics and a great help in getting my first job at UCL in 1964. Heinz Otto Schild. then the famous head of department, had written a paper in 1942 about the statistical analysis of 2+2 dose biological assays, while interned at the beginning of the war. He wanted someone to teach it to students, so he gave me a job. That wouldn’t happen now, because that sort of statistics would be considered too difficult Incidentally, I notice that Schild uses 99% confidence limits in his paper, not the usual 95% limits which make your results look better
It was clear even then, that the basis of statistical inference was an exceedingly contentious matter among statisticians. It still is, but the matter has renewed importance in view of the crisis of reproducibility in science. The question still fascinates me, and I’m planning to update my first paper soon. This time I hope it will be a bit better.
Postscript: some old pictures
While in nostalgic mood, here are a few old pictures. First, the only picture I have from undergraduate days. It was taken on a visit to May and Baker (of sulphonamide fame) in February 1957 (so I must have been in my first year). There were 15 or so in the class for the first three years (now, you can get 15 in a tutorial group). I’m in the middle of the back row (with hair!). The only names that I recall are those of the other two who went into the 4th year with me, Ed Abbs (rightmost on back row) and Stella Gregory (2nd from right, front row). Ed died young and Stella went to Australia. Just in front of me are James Dare (with bow tie) and Mr Nelson (who taught old fashioned pharmacognosy).
James Dare taught pharmaceutics, but he also had a considerable interest in statistics and we did lots of calculations with electromechanical calculators -the best of them was a Monroe (here’s a picture of one with the case removed to show the amazingly intricate mechanism). |
>
Monroe 8N-213 from http://www.science.uva.nl/museum/calclist.php |
The history of UCL’s pharmacology goes back to 1905. For most of that time, it’s been a pretty good department. It got top scores in all the research assessments until it was abolished by Malcolm Grant in 2007. That act of vandalism is documented in my diary section.
For most of its history, there was one professor who was head of the department. That tradition ended in 1983,when Humphrey Rang left for Novartis. The established chair was then empty for two years, until Donald Jenkinson, then head of department, insisted with characteristic modesty, that I rather than he should take the chair. Some time during the subsequent reign of David Brown, it was decided to name the chairs, and mine became the A.J. Clark chair. It was decided that the headship of the department would rotate, between Donald, David Brown and me. But when it came to my turn, I decided I was much too interested in single ion channels to spend time pushing paper, and David Brown nobly extended his term. The A.J. Clark chair was vacant after I ‘retired’ in 2004, but in 2014, Lucia Sivilotti was appointed to the chair, a worthy successor in its quantitative tradition.
The first group picture of UCL’s Pharmacology department was from 1972. Heinz Schild is in the middle of the front row, with Desmond Laurence on his left. Between them they dominated the textbook market: Schild edited A.J. Clark’s Pharmacology (now known as Rang and Dale). Laurence wrote a very successful text, Clinical Pharmacology. Click on the picture for a bigger version, with names, as recalled by Donald Jenkinson: (DHJ). I doubt whether many people now remember Ada Corbett (the tea lady) or Frank Ballhatchet from the mechanical workshop. He could do superb work, though the price was to spent 10 minutes chatting about his Land Rover, or listening to reminiscences of his time working on Thames barges. I still have a beautiful 8-way tap that he made. with a jerk-free indexing mechanism.
The second Departmental picture was taken in June 1980. Humphrey Rang was head of department then. My colleagues David Ogden and Steven Siegelbaum are there. In those days we had a tea lady too, Joyce Mancini. (Click pictures to enlarge)
Follow-up
After the announcement that the University of Central Lancashire (Uclan) was suspending its homeopathy “BSc” course, it seems that their vice chancellor has listened to the pressure, both internal and external, to stop bringing his university into disrepute.
An internal review of all their courses in alternative medicine was announced shortly after the course closure. Congratulations to Malcolm McVicar for grasping the nettle at last. Let’s hope other universities follow his example soon.
I have acquired, indirectly, a copy of the announcement of the welcome news.
Homeopathy, Herbalism and cupuncture Concern has been expressed by some colleagues as to whether the University should offer courses in homeopathy, Herbalism and Acupuncture. Therefore, to facilitate proper discussion on this matter I have set up a working party to review the issues. I have asked Eileen Martin, Pro Vice-Chancellor and Dean of the Faculty of Health, to lead this working party and report to me as soon as possible. Whilst the review is taking place, we need to recognise that there are students and staff studying and teaching on these courses which have satisfied the University’s quality assurance procedures and been duly validated. I would therefore ask that colleagues would refrain from comment or speculation which would cause concern to these students and staff. Staff who wish to express their views on this issue should direct these to Eileen Martin, by the end of September. Regards Malcolm McVicar Vice-Chancellor |
Times Higher Education today reports
“The University of Central Lancashire is to review all its courses in homoeopathy, herbalism and acupuncture after some staff said it should not be offering degrees in “quackery”, Times Higher Education has learnt.
A university spokesman said: “As a university we value and practise transparency and tolerance and welcome all academic viewpoints.”
(Later, an almost identical version of the story ran on the Times Online.)
So far, so good. But of course the outcome of a committee depends entirely on who is appointed to it. Quite often such committees do no more than provide an internal whitewash.
It does seem a bit odd to appoint as chair the dean of the faculty where all these course are run, and presumably generate income. Eileen Martin has often appeared to be proud of them in the past. Furthermore, the whole investigation will (or should) turn on the assessment of evidence. It needs some knowledge of the design of clinical trials and their statistical analysis, As far as I can see, Ms Martin has essentially no research publications whatsoever.
I also worry about a bit about “satisfied the University’s quality assurance procedures and been duly validated”. One point of the investigation should be recognise frankly that the validation process is entirely circular, and consequently worth next to nothing. It must be hard for a vice-chancellor to admit that, but it will be an essential step in restoring confidence in Uclan.
Let’s not prejudge though. If there are enough good scientists on the committee, the result will be good.
I hope that transparency extends to letting us know who will be doing the judging. Everything depends on that.
Follow-up
Well well, there’s a coincidence, Once again, the week after a there is an announcement about degrees in witchcraft, what should pop up again in the column of the inimitable Laurie Taylor in THE. The University of Poppleton’s own Department of Palmistry.
Letter to the editor
Dear Sir I was shocked to see yet another scurrilous attack upon the work of my department in The Poppletonian. Although Palmistry is in its early days as an academic discipline it cannot hope to progress while there are people like your correspondent who insist on referring to it as “a load of superstitious nonsense which doesn’t deserve a place on the end of the pier let alone in a university”. A large number of people claim to have derived considerable benefit from learning about life lines, head lines and heart lines and the role of the six major mounts in predicting their future. All of us in the Palmistry Department believe it vitally important that these claims are rigorously examined. How else can science advance? Yours sincerely, |
The article below is an editorial that I was asked to write for the New Zealand Medical Journal, as a comment on article in today’s edition about the misuse of the title ‘doctor’ by chiropractors [download pdf]. Titles are not the only form of deception used by chiropractors, so the article looks at some of the others too. For a good collection of articles that reveal chiropractic for what it is, look at Chirobase
THE NEW ZEALAND
MEDICAL JOURNAL
Journal of the New Zealand Medical Association
NZMJ 25 July 2008, Vol 121 No 1278; ISSN 1175 8716
URL: http://www.nzma.org.nz/journal/121-1278/3158/ ©NZMA
Doctor Who?
Inappropriate use of titles by some alternative “medicine” practitioners
David Colquhoun
Who should use the title ‘doctor’? The title is widely abused as shown by Gilbey1 in this issue of the NZMJ in an article entitled Use of inappropriate titles by New Zealand practitioners of acupuncture, chiropractic, and osteopathy. Meanwhile, Evans and colleagues 2, also in this issue, discuss usage and attitudes to alternative treatments.
Gilbey finds that the abuse of the title doctor is widespread and that chiropractors are the main culprits. An amazing 82% of 146 chiropractics used the title Doctor, andL most of them used the title to imply falsely that they were registered medical practitioners.
Although it is illegal in New Zealand to do that, it seems clear that the law is not being enforced and it is widely flouted. This is perhaps not surprising given the history of chiropractic. It has had a strong element of ruthless salesmanship since it was started in Davenport, Iowa by D.D. Palmer (1845–1913). His son, B.J. Palmer, said that their chiropractic school was founded on “a business, not a professional basis. We manufacture chiropractors. We teach them the idea and then we show them how to sell” (Shapiro 2008)3 It is the same now. You can buy advice on how to build “build high-volume, subluxation-based, cash-driven, lifetime family wellness practices”
In her recent book3 , Rose Shapiro comments on the founder of chiropractic as follows.
“By the 1890s Palmer had established a magnetic healing practice in Davenport, Iowa, and was styling himself “doctor”. Not everyone was convinced, as a piece about him in an 1894 edition of the local paper, the Davenport Leader, shows.
A crank on magnetism has a crazy notion hat he can cure the sick and crippled with his magnetic hands. His victims are the weak-minded, ignorant and superstitious,those foolish people who have been sick for years and have become tired of the regular physician and want health by the short-cut method he has certainly profited by the ignorance of his victim. His increase in business shows what can be done in Davenport, even by a quack.”
D.D. Palmer was a curious mixture: grocer, spiritual healer, magnetic therapist, fairground huckster, religious cult leader—and above all, a salesman. He finally found a way to get rich by removing entirely imaginary “subluxations”.
Over 100 years later, it seems that the “weak-minded, ignorant, and superstitious” include the UK’s Department of Health, who have given chiropractics a similar status to the General Medical Council.
The intellectual standards of a 19th Century Mid-Western provincial newspaper journalist are rather better than the intellectual standards of the UK’s Department of Health, and of several university vice-chancellors in 2007.
Do the treatments work?
Neither Gilbey nor Evans et al. really grasp the nettle of judging efficacy. The first thing one wants to know about any treatment —alternative or otherwise — is whether it works. Until that is decided, all talk of qualifications, regulation, and so on is just vacuous bureaucratese. No policy can be framed sensibly until the question of efficacy has been addressed honestly.
It is one good effect of the upsurge of interest in alternative treatments that there are now quite a lot of good trials of the most popular forms of treatments (as well as many more bad trials). Some good summaries of the results are now available too. Cochrane reviews set the standard for good assessment of evidence. New Zealand’s Ministry of Health commissioned the Complementary and Alternative Medicine
website to assess the evidence, and that seems to have done a good job too. Their assessment of chiropractic treatment of low back pain is as follows:
There appears to be some evidence from one systematic review and four other studies, although not conclusive, that chiropractic treatment is as effective as other therapies but this may be due to chance. There is very little evidence that chiropractic is more effective than other therapies.
And two excellent summaries have been published as books this year. Both are by people who have had direct experience of alternative treatments, but who have no financial interest in the outcome of their assessment of evidence. The book by Singh and Ernst4 summarises the evidence on all the major alternative treatments, and the book by Bausell5 concentrates particularly on acupuncture, because the author was for 5 years involved in research in that area, Both of these books come to much the same conclusion about chiropractic. It is now really very well-established that chiropractic is (at best) no more effective than conventional treatment. But it has the disadvantage of being surrounded by gobbledygook about “subluxations” and, more importantly, it kills the occasional patient.
Long (2004)7 said “the public should be informed that chiropractic manipulation is the number one reason for people suffering stroke under the age of 45.”
The chiropractors of Alberta (Canada) and the Alberta Government are now facing a class-action lawsuit8. The lead plaintiff is Sandra Nette. Formerly she was a fit 41 year old. Now she is tetraplegic. Immediately
after neck manipulation by a chiropractor she had a massive stroke as a result of a torn vertebral artery.
Acupuncture comes out of the assessments equally badly. Bausell (2007) concludes that it is no more than a theatrical placebo.
Are the qualifications even real?
It is a curious aspect of the alternative medicine industry that they often are keen to reject conventional science, yet they long for academic respectability. One aspect of this is claiming academic titles on the flimsiest of grounds. You can still be held to have misled the public into thinking you are a medical
practitioner, even if you have a real doctorate. But often pays to look into where the qualifications come from.
A celebrated case in the UK concerned the ‘lifestyle nutritionist’, TV celebrity and multi-millionaire, Dr Gillian McKeith, PhD. A reader of Ben Goldacre’s excellent blog, badscience.net did a little investigation. The results appeared in Goldacre’s Bad Science column in the Guardian9.
She claimed that her PhD came from the American College of Nutrition, but it turned out to come from a correspondence course from a non-accredited US ‘college’. McKeith also boasted of having “professional membership” of the American Association of Nutritional Consultants, for which she provided proof of her degree and three professional references.
The value of this qualification can be judged by the fact that Goldacre sent an application and $60 and as a result “My dead cat Hettie is also a “certified professional member” of the AANC. I have the certificate hanging in my loo”.
Is the solution government regulation?
In New Zealand the law about misleading the public into believing you are a medical practitioner already exists. The immediate problem would be solved if that law were taken seriously, but it seems that it is not.
It is common in both the UK and in New Zealand to suggest that some sort of official government regulation is the answer. That solution is proposed in this issue of NZMJ by Evans et al2. A similar thing has been proposed recently in the UK by a committee headed by Michael Pittilo, vice-chancellor of Robert Gordon’s University, Aberdeen.
I have written about the latter under the heading A very bad report. The Pittilo report recommends both government regulation and more degrees in alternative medicine. Given that we now know that most alternative medicine doesn’t work, the idea of giving degrees in such subjects must be quite ludicrous to any thinking person.
The magazine Nature7 recently investigated the 16 UK universities who run such degrees. In the UK, first-year students at the University of Westminster are taught that “amethysts emit high yin energy” . Their vice chancellor, Professor Geoffrey Petts, describes himself a s a geomorphologist, but he cannot be tempted to express an opinion about the curative power of amethysts.
There has been a tendency to a form of grade inflation in universities—higher degrees for less work gets bums on seats. For most of us, getting a doctorate involves at least 3 years of hard experimental research in a university. But in the USA and Canada you can get a ‘doctor of chiropractic’ degree and most chiropractic (mis)education is not even in a university but in separate colleges.
Florida State University famously turned down a large donation to start a chiropractic school because they saw, quite rightly, that to do so would damage their intellectual reputation. This map, now widely distributed on the Internet, was produced by one of their chemistry professors, and it did the trick.
Other universities have been less principled. The New Zealand College of Chiropractic [whose President styles himself “Dr Brian Kelly”,though his only qualification is B. App Sci (chiro)] is accredited by the New Zealand Qualifications Authority (NZQA). Presumably they, like their UK equivalent (the QAA), are not allowed to take into account whether what is being taught is nonsense or not. Nonsense courses are accredited by experts in nonsense. That is why much accreditation is not worth the paper it’s written on.
Of course the public needs some protection from dangerous or fraudulent practices, but that can be done better (and more cheaply) by simply enforcing existing legislation on unfair trade practices, and on false advertising. Recent changes in the law on unfair trading in the UK have made it easier to take legal action against people who make health claims that cannot be justified by evidence, and that seems the best
way to regulate medical charlatans.
Conclusion
For most forms of alternative medicine—including chiropractic and acupuncture—the evidence is now in. There is now better reason than ever before to believe that they are mostly elaborate placebos and, at best, no better than conventional treatments. It is about time that universities and governments recognised the evidence and stopped talking about regulation and accreditation.
Indeed, “falsely claiming that a product is able to cure illnesses, dysfunction, or malformations” is illegal in Europe10.
Making unjustified health claims is a particularly cruel form of unfair trading practice. It calls for prosecutions, not accreditation.
Competing interests: None.
NZMJ 25 July 2008, Vol 121 No 1278; ISSN 1175 8716
URL: http://www.nzma.org.nz/journal/121-1278/3158/ ©NZMA
Author information: David Colquhoun, Research Fellow, Dept of Pharmacology, University College London, United Kingdom (http://www.ucl.ac.uk/Pharmacology/dc.html)
Correspondence: Professor D Colquhoun, Dept of Pharmacology, University College London, Gower Street, London WC1E 6BT, United Kingdom. Fax: +44(0)20 76797298; email: d.colquhoun@ucl.ac.uk
References:
1. Gilbey A. Use of inappropriate titles by New Zealand practitioners of acupuncture, chiropractic, and osteopathy. N Z Med J. 2008;121(1278). [pdf]
2. Evans A, Duncan B, McHugh P, et al. Inpatients’ use, understanding, and attitudes towards traditional, complementary and alternative therapies at a provincial New Zealand hospital. N Z Med J. 2008;121(1278).
3 Shapiro. Rose. Suckers. How Alternative Medicine Makes Fools of Us All Random House, London 2008. (reviewed here)
4. Singh S, Ernst E. Trick or Treatment. Bantam Press; 2008 (reviewed here)
5. Bausell RB. Snake Oil Science. The Truth about Complementary and Alternative Medicine. (reviewed here)
Oxford University Press; 2007
6. Colquhoun D. Science degrees without the Science, Nature 2007;446:373–4. See also here.
7. Long PH. Stroke and spinal manipulation. J Quality Health Care. 2004;3:8–10.
8. Libin K. Chiropractors called to court. Canadian National Post; June21, 2008.
9. Goldacre B. A menace to science. London: Guardian; February 12, 2007/
10. Department for Business Enterprise & Regulatory Reform (BERR). Consumer Protection from Unfair Trading Regulations 2008. UK: Office of Fair Trading.
This is a fuller version, with links, of the comment piece published in Times Higher Education on 10 April 2008. Download newspaper version here.
If you still have any doubt about the problems of directed research, look at the trenchant editorial in Nature (3 April, 2008. Look also at the editorial in Science by Bruce Alberts. The UK’s establishment is busy pushing an agenda that is already fading in the USA.
Since this went to press, more sense about “Brain Gym” has appeared. First Jeremy Paxman had a good go on Newsnight. Skeptobot has posted links to the videos of the broadcast, which have now appeared on YouTube.
Then, in the Education Guardian, Charlie Brooker started his article about “Brain Gym” thus
Dr Aust’s cogent comments are at “Brain Gym” loses its trousers. |
The Times Higher’s subeditor removed my snappy title and substituted this.
So here it is.
“HR is like many parts of modern businesses: a simple expense, and a burden on the backs of the productive workers”, “They don’t sell or produce: they consume. They are the amorphous support services” .
So wrote Luke Johnson recently in the Financial Times. He went on, “Training advisers are employed to distract everyone from doing their job with pointless courses”. Luke Johnson is no woolly-minded professor. He is in the Times’ Power 100 list, he organised the acquisition of PizzaExpress before he turned 30 and he now runs Channel 4 TV.
Why is it that Human Resources (you know, the folks we used to call Personnel) have acquired such a bad public image? It is not only in universities that this has happened. It seems to be universal, and worldwide. Well here are a few reasons.
Like most groups of people, HR is intent on expanding its power and status. That is precisely why they changed their name from Personnel to HR. As Personnel Managers they were seen as a service, and even, heaven forbid, on the side of the employees. As Human Resources they become part of the senior management team, and see themselves not as providing a service, but as managing people. My concern is the effect that change is having on science, but it seems that the effects on pizza sales are not greatly different.
The problem with having HR people (or lawyers, or any other non-scientists) managing science is simple. They have no idea how it works. They seem to think that every activity
can be run as though it was Wal-Mart That idea is old-fashioned even in management circles. Good employers have hit on the bright idea that people work best when they are not constantly harassed and when they feel that they are assessed fairly. If the best people don’t feel that, they just leave at the first opportunity. That is why the culture of managerialism and audit. though rampant, will do harm in the end to any university that embraces it.
As it happens, there was a good example this week of the damage that can be inflicted on intellectual standards by the HR mentality. As a research assistant, I was sent the Human Resources Division Staff Development and Training booklet. Some of the courses they run are quite reasonable. Others amount to little more than the promotion of quackery. Here are three examples. We are offered a courses in “Self-hypnosis”, in “Innovations for Researchers” and in “Communication and Learning: Recent Theories and Methodologies”. What’s wrong with them?
“Self-hypnosis” seems to be nothing more than a pretentious word for relaxation. The person who is teaching researchers to innovate left science straight after his PhD and then did courses in “neurolinguistic programming” and life-coaching (the Carole Caplin of academia perhaps?). How that qualifies him to teach scientists to be innovative in research may not be obvious.
The third course teaches, among other things, the “core principles” of neurolinguistic programming, the Sedona method (“Your key to lasting happiness, success, peace and well-being”), and, wait for it, Brain Gym. This booklet arrived within a day or two of Ben
Goldacre’s spectacular demolition of Brain Gym “Nonsense dressed up as neuroscience”
“Brain Gym is a set of perfectly good fun exercise break ideas for kids, which costs a packet and comes attached to a bizarre and entirely bogus pseudoscientific explanatory framework”
“This ridiculousness comes at very great cost, paid for by you, the taxpayer, in thousands of state schools. It is peddled directly to your children by their credulous and apparently moronic teachers”
And now, it seems, peddled to your researchers by your credulous and
moronic HR department.
Neurolinguistic programming is an equally discredited form of psycho-babble, the dubious status of which was highlighted in a Beyerstein’s 1995 review, from Simon Fraser University.
“ Pop-psychology. The human potential movement and the fringe areas of psychotherapy also harbor a number of other scientifically questionable panaceas. Among these are Scientology, Neurolinguistic Programming, Re-birthing and Primal Scream Therapy which have never provided a scientifically acceptable rationale or evidence to support their therapeutic claims.”
The intellectual standards for many of the training courses that are inflicted on young researchers seem to be roughly on a par with the self-help pages of a downmarket women’s magazine. It is the Norman Vincent Peale approach to education. Uhuh, sorry, not education, but training. Michael O’Donnell defined Education as “Elitist activity. Cost ineffective. Unpopular with Grey Suits . Now largely replaced by Training .”
In the UK most good universities have stayed fairly free of quackery (the exceptions being the sixteen post-1992 universities that give BSc degrees in things like homeopathy). But now it is creeping in though the back door of credulous HR departments. Admittedly UCL Hospitals Trust recently advertised for spiritual healers, but that is the NHS not a university. The job specification form for spiritual healers was, it’s true, a pretty good example of the HR box-ticking mentality. You are in as long as you could tick the box to say that you have a “Full National Federation of Spiritual Healer certificate. or a full Reiki Master qualification, and two years post certificate experience”. To the HR mentality, it doesn’t matter a damn if you have a certificate in balderdash, as long as you have the piece of paper. How would they know the difference?
A lot of the pressure for this sort of nonsense comes, sadly, from a government that is obsessed with measuring the unmeasurable. Again, real management people have already worked this out. The management editor of the Guardian, said
“What happens when bad measures drive out good is strikingly described in an article in the current Economic Journal. Investigating the effects of competition in the NHS, Carol Propper and her colleagues made an extraordinary discovery. Under competition, hospitals improved their patient waiting times. At the same time, the death-rate e emergency heart-attack admissions substantially increased.”
Two new government initiatives provide beautiful examples of the HR mentality in action, They are Skills for Health, and the recently-created Complementary and Natural Healthcare Council.(already dubbed OfQuack).
The purpose of the Natural Healthcare Council .seems to be to implement a box-ticking exercise that will have the effect of giving a government stamp of approval to treatments that don’t work. Polly Toynbee summed it up when she wrote about “ Quackery
and superstition – available soon on the NHS “ . The advertisement for its CEO has already appeared, It says that main function of the new body will be to enhance public protection and confidence in the use of complementary therapists. Shouldn’t it be decreasing confidence in quacks, not increasing it? But, disgracefully, they will pay no attention at all to whether the treatments work. And the advertisement refers you to
the Prince of Wales’ Foundation for Integrated Health for more information (hang on, aren’t we supposed to have a constitutional monarchy?).
Skills for Health, or rather that unofficial branch of government, the Prince of Wales’ Foundation, had been busy making ‘competences’ for distant healing, with a helpful bulletted list.
“This workforce competence is applicable to:
- healing in the presence of the client
- distant healing in contact with the client
- distant healing not in contact with the client”
And they have done the same for homeopathy and its kindred delusions. The one thing they never consider is whether they are writing ‘competences’ in talking gobbledygook. When I phoned them to try to find out who was writing this stuff (they wouldn’t say), I made a passing joke about writing competences in talking to trees. The answer came back, in all seriousness,
“You’d have to talk to LANTRA, the land-based organisation for that”,
“LANTRA which is the sector council for the land-based industries uh, sector, not with us sorry . . . areas such as horticulture etc.”.
Anyone for competences in sense of humour studies?
The “unrepentant capitalist” Luke Johnson, in the FT, said
“I have radically downsized HR in several companies I have run, and business has gone all the better for it.”
Now there’s a thought.
The follow-up
The provost’s newletter for 24th June 2008 could just be a delayed reaction to this piece? For no obvious reason, it starts thus.
“(1) what’s management about?
Human resources often gets a bad name in universities, because as academics we seem to sense instinctively that management isn’t for us. We are autonomous lone scholars who work hours well beyond those expected, inspired more by intellectual curiosity than by objectives and targets. Yet a world-class institution like UCL obviously requires high quality management, a theme that I reflect on whenever I chair the Human Resources Policy Committee, or speak at one of the regular meetings to welcome new staff to UCL. The competition is tough, and resources are scarce, so they need to be efficiently used. The drive for better management isn’t simply a preoccupation of some distant UCL bureaucracy, but an important responsibility for all of us. UCL is a single institution, not a series of fiefdoms; each of us contributes to the academic mission and good management permeates everything we do. I despair at times when quite unnecessary functional breakdowns are brought to my attention, sometimes even leading to proceedings in the Employment Tribunal, when it is clear that early and professional management could have stopped the rot from setting in years before. UCL has long been a leader in providing all newly appointed heads of department with special training in management, and the results have been impressive. There is, to say the least, a close correlation between high performing departments and the quality of their academic leadership. At its best, the ethos of UCL lies in working hard but also in working smart; in understanding that UCL is a world-class institution and not the place for a comfortable existence free from stretch and challenge; yet also a good place for highly-motivated people who are also smart about getting the work-life balance right.”
I don’t know quite what to make of this. Is it really a defence of the Brain Gym mentality?
Of course everyone wants good management. That’s obvious, and we really don’t need a condescending lecture about it. The interesting question is whether we are getting it.
There is nothing one can really object to in this lecture, apart from the stunning post hoc ergo propter hoc fallacy implicit in “UCL has long been a leader in providing all newly appointed heads of department with special training in management, and the results have been impressive.”. That’s worthy of a nutritional therapist.
Before I started writing this response at 08.25 I had already got an email from a talented and hard-working senior postdoc. “Let’s start our beautiful working day with this charging thought of the week:”.
He was obviously rather insulted at the suggestion that it was necessary to lecture academics with words like ” not the place for a comfortable existence free from stretch and challenge; yet also a good place for highly-motivated people who are also smart about getting the work-life balance right.”. I suppose nobody had thought of that until HR wrote it down in a “competence”?
To provoke this sort of reaction in our most talented young scientists could, arguably, be regarded as unfortunate.
I don’t blame the postdoc for feeling a bit insulted by this little homily.
So do I.
Now back to science.
“the report is more hypothesis-generating for future research than a rigorous scientific study. Find us some money and we will do a proper job. You can quote me for that.” Professor David Smith (Oxford). Scientific adviser for Food for the Brain. |
For a quick synopsis, look at Holfordmyths.org.
Patrick Holford and Drew Fobbester are joint researchers and authors of the Food for the Brain Child Survey , September 2007 (pdf). Holfordwatch has made a very thorough study of this report, in eight parts (so far). They conclude
“HolfordWatch can not share the optimism for these claimed benefits and finds that there is insufficient data to support them in a robust manner.”
There are many detailed questions, but the basic problem with the report is very simple. The fact that is (a) self-selected and (b) not randomised make it just another naive observational study. The stunningly obvious confounder in this case is, as so often, the socio-economic background of the kids. That was not even assessed, never mind any attempt being made to allow for it.
This isn’t just pedantry because what matters is causality. It is worth very little to know that eating vegetables is correlated with high SAT score if the correlation is a result of having well-off parents. If that were the reason, then forcing kids with poor parents to eat vegetables would make no difference to their SAT score because their parents would still be poor. The only conclusion of the study seems to be that we should eat more fruit and vegetables, something that we are already lectured about in every waking moment.
Many questions about the report have not yet been answered by its authors. But the report has a panel of scientific advisors, some of whom at least seem to be very respectable (though not ‘orthomolecular medicine‘, which is a cult founded on the batty late-life beliefs of the once great Linus Pauling that Vitamin C is a magic bullet).
Furthermore they are thanked thus
As it happens, David Smith is an old friend, so I wrote to him, and also to Philip Cowen, with some detailed questions. I didn’t get detailed answers, but the responses were none the less interesting. Cowen said
“I did see the report and quite agree with your conclusions that it an observational study and therefore not informative about causality.”
“The advice about diet seems reasonable although, as you point out, probably somewhat redundant.”
But still more interesting, David Smith told me (my emphasis)
“the survey was the largest of its kind and was done on minimal funding; hence several matters could not be dealt with and so the report is more hypothesis-generating for future research than a rigorous scientific study. Find us some money and we will do a proper job. You can quote me for that, if you wish.”
I’d grateful to David for his permission to quote this comment, It seems that Holford’s top scientific advisor agrees that it is not a rigorous study, and even agrees that the “proper job” is still to be done.
But it does seem a shame that that was not made clear in the report itself.
As I have often said, you don’t need to be a scientist to see that most alternative medicine is bunk, though it is bunk that is supported and propagated by an enormously wealthy industry..
There were two good examples this week, John Sutherland, who was until recently professor of English literature at UCL, understands it very well. And so does political columnist, Polly Toynbee.
“Complementary and Natural Healthcare Council”
Polly Toynbee’s column, “Quackery and superstition – available soon on the NHS“, was prompted by the announcement in The Times that the government was to set up a “Natural Healthcare Council”. It was soon renamed the “Complementary and Natural Healthcare Council” (CNHC) It was instantly dubbed ‘OfQuack’ in an admirable analysis by quackometer.a>
href=”http://www.quackometer.net/blog/2008/01/prince-charles-ofquack-is-dead-duck.html” target=”_blank”>
The very name is tendentious and offensive to any thinking person. What is “natural” about sticking needles in yourself, or taking homeopathic polonium?
Toynbee comments
“Put not your trust in princes, especially not princes who talk to plants. But that’s what the government has decided to do. The Department of Health has funded the Prince of Wales Foundation for Integrated Healthcare to set up the Natural Healthcare Council to regulate 12 alternative therapies, such as aromatherapy, reflexology and homeopathy. Modelled on the General Medical Council, it has the power to strike therapists off for malpractice.”
There was only one thing wrong in this article. Toynbee says
“The alternative lobby replies that conventional medicine can also do more harm than good. They chortle with glee at an article in the Lancet suggesting there is no scientific evidence for the efficacy of 46% of conventional NHS treatments. But that’s no reason to encourage more of it.”
Professor John Garrow has pointed out (see, also Healthwatch )
“It is true they chortle, but they have got their facts wrong. The 46% of treatments which are not proven to be effective is 46% of all treatments for 240 common conditions – and very few are used in the NHS. The great majority are treatments used by alternative practitioners. “
The unconstitutional interference by the Prince of Wales in public affairs has been noted often before, and it seems that it’s happening again.
For example, there is the TV programme, “Charles, the Meddling Prince”, or, for a US view, see “Homeopathy: Holmes, Hogwarts, and the Prince of Wales“. And then there’s Michael Baum’s superb “An open letter to the Prince of Wales: with respect, your highness, you’ve got it wrong“.
It isn’t that regulation isn’t needed, but that the sort of regulation being proposed won’t do the trick. The framework for the “Natural Healthcare Council” has been set up by Professor Dame Joan Higgins, and it seems to be very much along the lines proposed by the Prince of Wales. Here’s what’s wrong.
Professor Dame Joan Higgins (Jan 10th) says “Complementary therapists have been in practice for many years” and “If complementary therapy is not to be banned, is it not, therefore, wise to regulate it and offer the public some measure of protection”. That’s fine, but I think the sort of regulation that she, and the Prince of Wales, are proposing won’t do the trick. We don’t need new laws, or new quangos, just the even-handed application of existing laws. Homeopathic arnica 30C contains no arnica, and one would expect that the Office of Fair Trading would have banned it. It is no different from selling strawberry jam that contains no strawberries. But absurd legal loopholes make homeopaths immune to prosecution for this obvious mislabeling, whereas jam fraudsters would be in deep trouble. The Advertising Standards Authority, likewise, is prevented from doing its job by legal loopholes, and by the fact that it has no jurisdiction over web advertising, which is now the main source of untrue claims. If alternative medicine advocates had to obey the same laws as the rest of us, the public would be better protected from fraud and delusion. What won’t work is to insist that homeopaths are “properly trained”. If one takes the view that medicines that contain no medicine can’t work, then years of being trained to say that they do work, and years spent memorizing the early 19th century mumbo-jumbo of homeopathy, does not protect the public, it imperils them. |
The “Natural Healthcare Council” isn’t the only example either. Try Skills for Health.
Skills for Health
This appears to be a vast bureaucratic enterprise devoted to HR-style box-ticking. Just in case you don’t know about this latest bit of HR jargon, there is a flash movie that explains all.
“Competences are descriptors of the performance criteria, knowledge and understanding that are required to undertake work activities. They describe what individuals need to do, and to know, to carry out the activity -regardless of who performs it.”
That sounds OK until you realise that no attention whatsoever is paid to the little problem of whether the “knowledge and understanding” are pure gobbledygook or not. It’s rather like the HR form that ensures UCLH that you are a fully-qualified spiritual healer “Laying on of hands: just tick the box“.
It is an invidious insult to human intelligence to suppose that exercises like this are an appropriate way to select people for jobs. They have precisely the opposite effect to that intended.
An indication of the level of their critical thinking is provided what is written about the 62 items listed under “Complementary Medicine” These include “CHH5 Provide Healing”.
“This workforce competence is applicable to:
- healing in the presence of the client
- distant healing in contact with the client
- distant healing not in contact with the client
Both healing in the presence of the client and distant healing use exactly the same mental and spiritual processes. Clearly, however, distant healing does not involve many of the physical aspects of healing in the presence of the client. The performance criteria have been written so as to be able to be interpreted for use in any healing situation.
The workforce competence links to CHH6 which is about evaluating the effectiveness of the healing.”
It also includes homeopathy, for example “HM_2: Plan, prescribe and review homeopathic treatment“.
I sent an email to Skills for Health to ask who wrote this stuff. A reply from their Technical Development Director failed to elicit any names.
We develop competences to fit sector needs and demands. When that need is moved into a competence project we establish a number of groups from the specific area to work with us to develop the competences. One of these groups is a “reference” group which is made up of experts from the field. In effect these experts give us the content of the competences, we write them in our format. So I guess the answer as to who is the author is Skills For Health, but with more complexity behind statement.Please do not hesitate to get in touch with me for further clarity. |
A conversation with Skills for Health
I did want more clarity, so I phoned Skills for Health. Here are some extracts from what I was told.
“It’s not quite as simple as that”
“the competencies on our data base are written by “experts in the field”
DC. Yes and it is their names that I was asking for
“I’m not sure I can give you the names . . . We’re starting to review them in the New Year. Those competencies are around six years old. ”
“We are working with the Prince’s Foundation for Integrated Health [FIH] via Ian Cambray-Smith to review these competencies, all the complementary therapy competences on our web site”
“They are written as a consensus decision across a wide number of stakeholders across that area of …”
DC. Written by whom though?
“written by a member of Skills for Health staff or a contractor that we employ simply to write them, and the writing is a collation of information rather than their original thoughts, if you like”
DC yes, I still think the sources can and should be given.
“FIH didn’t spend any money with us on this project. This project was funded by the Education act regulatory bodies, QCA, the Qualifications and Curriculum Authority . . . ”
“They [FIH] may well have put in and supported members of their professions or groups to do part of this . . they were there as experts on that particular area of complementary therapy ”
DC it’s their names that I was after
“There may well have been members [of FIH] on the reference groups that I’ve referreed to who are members of the FiH . . .they were there as experts from that area of complementary therapies.”
DC Oh, and are the names of [the people on] these reference groups published?
“No they are not published”
DC ah, why not?
“We do not consider it necessary”
DC Well, I consider it very necessary myself
“Tell me why”
DC It’s a question of public accountability
“I guess the accountability lies with us as the owners of those competencies”
DC Uh I’m afraid your bureaucratic jargon is a bit much for me there. “The owners of those competencies”? I’m not sure what that means
“Why do you want the information?”
DC haha, well if you want me to be entirely blunt, it’s because I’m appalled that this black magic is appearing on a government web site
“. . . can I say that as an organisation funded by a number of sources, one being Department of Health England, none of our work condones the practice you’ve just suggested. Our work supports best practice in areas that are evidence- and research-based”
DC Ah would you mind pointing me to the evidence for homeopathy and distant healing?
“Uh [pause] there is [pause]”
DC Yes, go on
“Well homeopathy is a contentious issue, because every newspaper article I read seems to suggest that homeopathy, in itself, is not an appropriate, uh, not an, uhm, appropriate, uh, therapy.”
DC Yes so why are you laying down standards in it?. You know I’m curious. I’m genuinely curious about this
“The areas involved in them have asked us to, including the Prince’s Trust hence the reason we are doing . . .”
DC But the Prince’s Trust is not part of government. Ha, it behaves as though it was , I agree, sometimes but it is surely for the Department of Health to ask you to do these things, not the Prince of Wales.
“We cover the whole health sector.. We don’t purely work for, or are an organisation of, the Department of Health.”
DC. I’m very baffled by the fact that you say, you very accurately the research on homeopathy, namely that it doesn’t work, but you are still setting standards for it. It’s quite baffling to me.
“Working with the Foundation for Integrated Health, as we are doing, homeopathy is one of the 10 areas that is listed for regulation by FIH ”
DC. Well yes the Prince of Wales would like that. His views on medicine are well known, and they are nothing if not bizarre. Haha are you going to have competencies in talking to trees perhaps?
“You’d have to talk to LANTRA, the land-based organisation for that.”
DC. I’m sorry, I have to talk to whom?
“LANTRA which is the sector council for the land-based industries uh, sector, not with us sorry . . . areas such as horticulture etc.”
DC. We are talking about medicine aren’t we? Not horticulture.
“You just gave me an example of talking to trees, that’s outside our remit ”
After explaining that talking to trees was a joke, the conversation continued
DC So can I clarify then? Who is it that said you must include these fairly bizarre things like distance healing and homeopathy? Who decides whether it goes in?
“We did”
“We are going to do a major review. We are doing that review in partnership with the FiH and the awarding bodies that award the qualifications that are developed from these competencies”
“When that need is moved into a competence project we establish a number of groups from the specific area to work with us to develop the competences. One of these groups is a “reference” group which is made up of experts from the field. In effect these experts give us the content of the competences, we write them in our format.”
Conclusions from this dialogue
We still don’t know the names of the people who wrote the stuff, but a Freedom of Information Act request has been submitted to find out
The Skills for Health spokesperson seems to a a bit short of a sense of humour when it comes to talking to trees.
The statement that “Our work supports best practice in areas that are evidence- and research-based” is not true, and when pressed the spokesperson more or less admitted as much.
Most importantly, though, we do now know that the revision of this gobbledygook will be carried out entirely by the Prince’s Foundation for Integrated Health and the people who set exams in the relevant form of gobbledygook. No critical voice will have an input, so don’t expect much improvement. “We are working with the Prince’s Foundation for Integrated Health [FIH] via Ian Cambray-Smith to review these competencies”. And in case you don’t know about the medical expertise of Ian Cambray-Smith, it is described on the FIH web site. He is the FIH’s Health Professionals Manager.
Ian Cambray-Smith acts as the focus for FIH’s involvement with healthcare professionals. He works collaboratively to develop a range of work programmes, policies and initiatives to support healthcare professionals and help them to deliver a truly integrated approach to health. Ian’s background is in plastics research, project management and business development; he has an MSc in polymer technology. He joined the Foundation in 2006. |
Happy new year. not least to the folks at the homeopathy4health site . They are jubilant about a “proof” that homeopathic dilutions could produce effects. albeit only on wheat seedlings. But guess what? After some questioning it was found that they hadn’t actually read the paper. Well I have read it, and this is the result.
The paper is “A Biostatistical Insight into the As2O3 High Dilution Effects on the Rate and Variability of Wheat Seedling Growth”. Brizzi,
Lazzarato, Nani, Borghini, Peruzzi and Betti, Forsch Komplementärmed Klass Naturheilkd 2005;12:277–283
The authors compared these treatments (30 seedlings each).
- C1, C2, C3 (untreated water p.a. Merck, control);
- WP (potentized water p.A. Merck) 5x, 15x, 25x, 35x, 45x;
- AD (diluted arsenic trioxide) 10–5, 10–15, 10–25, 10–35, 10–45;
- AP (potentized arsenic trioxide) 5x, 15x, 25x, 35x, 45x.
The allocation of seedlings to treatments was stated to be blind and randomised. So far, so good.
But just look at the results in Figure 1. They are all over the place, with no obvious trend as ‘potency’ (i.e. dilution) is increased. The
results with homeopathic arsenic at 45 days (the only effect that is claimed to be real) is very little different from the that of shaken water (water that has been though the same process but with no arsenic present initially).
For some (unstated) reason the points have no standard errors on them. Using the values given in Table 3 I reckon that the observation for AP45 is 1.33 ± 0.62 and for the plain water (WP45). it is 1.05 ± 0.69. The authors claim (Table 3) that the former is ‘significant’ (with a profoundly unimpressive P = 0.04) and the latter isn’t. I can’t say that I’m convinced, and in any case, even if the effect were real, it would be tiny.
Later the authors do two things that are a very dubious from the statistical point of view. First they plot cumulative distributions which are notoriously misleading about precision (because the data in adjacent bins are almost the same). They then do some quite improper data snooping by testing only the half of the results that came out lowest. If this were legitimate (it isn’t) the results would be even worse for homeopaths, because the difference between the controls and plain water (WP45) now, they claim, comes out “significant”.
Homeopaths claim that the smaller the dose, the bigger the effect (so better water down your beer as much as possible, making sure to bang the glass on the bar to potentise it). I have yet to see any dose-response curve that has the claimed negative slope. Figure 1 most certainly doesn’t show it.
Of course there is no surprise at all for non-homeopaths in the discovery that arsenic 45x is indistinguishable from water 45x.
That is what we have been saying all along.
Using potassium dichromate to treat patients in intensive care (rather than to clean the glassware)?
No, that isn’t a joke. The respectable journal, Chest, official journal of the American College of Chest Physicians, published an article that purported to show that homeopathic potassium dichromate (i.e. water) was a useful way to treat patients in intensive care. [Frass M, Dielacher C, Linkesch M, et al. Influence of potassium dichromate on tracheal secretions in critically ill patients. Chest 2005; 27:936–941].
The title and abstract don’t mention the word ‘homeopathy’ at all. Potassium dichromate, like all hexavalent chromium compounds, is very toxic, but luckily for the patients there was no potassium dichromate present whatsoever in the treatment (it was a 30C dilution). The editor of Chest didn’t seem to think that there was anything very odd about this, but he did publish a response from me: Treating Critically Ill Patients With Sugar Pills, Chest, 131 , 645, 2007 [Get pdf ].
“It is one thing to tolerate homeopathy as a harmless 19th century eccentricity for its placebo effect in minor self-limiting conditions like colds. It is quite another to have it recommended for seriously ill patients. That is downright dangerous.”
This was accompanied by an unrepentant response from Frass.
The Frass paper has now received some close attention on the Respectful Insolence blog. Someone posting under the name ‘getzal’ has done a nice analysis which shows that the control group must have contained patients who were were more seriously ill than the homeopathically-treated group.