Sometimes I see data analysis that makes me rather cross
Especially when it's in Nature, even if it is only Nature Medicine
Summary: A scientific paper published recently supplied evidence that certain serious side effects from covid vaccines were rare, and that the rate of these side effects from natural infection with covid was substantially worse. The paper was used by the mass media to support the vaccination of younger individuals who were not at great risk from covid disease. In this three part series I discuss how the authors of that paper have made a fundamental mistake in their analysis that brings into question their findings.
In part 1 I discuss the general method that they used, the self-control methodology, in part 2 I discuss how their analysis was flawed and in part 3 I discuss how the flawed analysis will have had an impact far beyond the specialist scientific community.
Sometimes I see scientific papers that really make me cross.
The example for today is his paper in Nature
Neurological complications after first dose of COVID-19 vaccines and SARS-CoV-2 infection
This is a study into hospitalisations of certain conditions, following covid and vaccination. The conclusions are that vaccination is safe and that covid is worse than the vaccines.
The problem is, I believe that they've made a fundamental error in their analysis. This isn't just a error in the detail that changes things a little -- it looks like it is a error that has likely suppressed their estimate of vaccine side effects. What's more, it is a beginners' error — this type of thing was probably the one of the first aspects of data analysis that I was taught many decades ago -- I'm dumbfounded by them getting it into Nature -- what's happened to peer review...
Anyway, onto the analysis.
The paper is a 'self-controlled’ study. This means that each person 'acts as their own control' — in this paper they're looking for side effects 28 days after vaccination, and comparing it with background rates before and after this period.
The problem is, to use this method it is important at a fundamental level that they check that the side effect rate returns to 'normal' by the time the 28 days is up.
I'll use some graphs to illustrate. I'll start with a near-perfect made-up example of a suitable self-control data-set.
In the graph above I've plotted the made up near perfect data. I've shaded in the broad areas that are used in the data collection.
The yellow shaded areas are the baseline data. Note that there are two areas that contribute, one prior to 28 days before 'vaccine' and the other after 28 days after 'vaccine'. It is very important to note that the average for baseline is the average of both areas -- there's no separate 'before' and 'after' baseline, they're just treated as the same.
The pink shaded area is for 28 days before vaccine. These data are ignored; the idea is that someone just diagnosed with a weird condition might wait a while to get vaccinated. Now, I'm not sure about this -- there's no evidence to support it, and, indeed, there's evidence that medics were actually advising anyone 'newly ill' to get the covid vaccines as it would protect them. But the authors' treat this period differently so I will too.
There's a gap at day 0 -- day zero (vaccination day or positive-test day) is 'a bit weird' and I agree that this should usually be treated separately.
The green shaded area is the actual effect of the vaccine.
There's a second yellow shaded area -- I'll state again that this is mixed up with the data in the first yellow shaded area to get the 'average baseline'.
I've included some horizontal red lines that give the average at each broad time period. This is across all of each coloured block, except for the green shaded area, which is divided into four equal periods (7 days each).
Right. What are the important points in the above graph? In this made up example the vaccine induces some side effects, which then rapidly decline back to zero. Note that the effect of the vaccine for this 'perfect' example is completely constrained within the green shaded part. It is crucial that the effect of the vaccine returns to baseline before we get to the second yellow shaded area.
Okay. We've seen perfect data. What about real data?
Well, there's some fairly good examples of 'that's good enough' in the paper's data-set. It is the data for side effects after covid infection. One of the better datasets (because the numbers are bigger) is the data for the Incident Rate Ratio (IRR) for Bell's Palsy. I've plotted these data below using the same approach as for the perfect data in the previous post.
Note that we've now not got the actual data points, only the averages shown by the horizontal red lines -- we have to 'guess in our mind's eye' what the underlying data might look like.
But the above graph looks pretty good — though note that there's a nice gotcha in it.
Here the data is for Incidence Rate Ratio — that is, we’re using the author’s calculated incidence of Bell’s Palsy relative to the baseline, which is defined to be 1. We’re plotting diagnosis after (and before) covid infection — so in this example 'day 0' is the day of the first PCR positive test. I should come clean and note that I've not included the 'day 0' data -- this was because there were rather a few Bell's Palsy cases diagnosed on the same day as the PCR test and if I'd included these data it would have made the graph more difficult to interpret. In my defence this is almost certainly due to people being tested for covid on the way in to see the consultant about their Bell's Palsy (day 0 data is always fraught with potential pitfalls).
The yellow shaded areas are baseline, again. Note that they’re set as equal to 1.0 because that’s how we defined the IRR; we’re saying that we define the before and after rate to be 1, and all of the other data points are relative to this.
We see a nice increase in rates of Bell's Palsy in the green shaded area. It is fairly clear that covid infection results in Bell's Palsy, but also that the risk disappears shortly after the covid infection appears (probably the risk appears with symptoms and disappears when they go).
Note that the IRR in the green section appears to go slightly below 1.0 for days 21-28 — this is likely to be due to the natural variability in the data rather than there being an actual protective effect of the infection.
Crucially, the diagnosis of Bell's Palsy returns to baseline before we get to the second yellow shaded area. Thus we might think these data are accurate and believable.
The 'gotcha' is the pink shaded area, the time period immediately before the positive covid test. That is, people appear to be getting Bell's Palsy in increased numbers before catching covid! How can this be? Well, we just don't know — all we can see is the data — but it is probably related to people catching covid in hospital while being treated for Bell's Palsy that they just happened to get at that point in time. Or it is possible that having Bell’s Palsy increases the risk of covid infection. The point here is that it just goes to show how difficult it can be to analyse and interpret data like these -- it is seldom as clear-cut as our 'perfect' example.
Anyway -- these data support the use of the case-control approach with a 28 day period after the PCR positive test result, for this particular example at least.
To be continued…
Very interesting, thank you. May I ask: why is a 28 day period chosen? Is this standard? Is it on the basis that after 28 days the incidence (of any putative side effect) returns to ‘normal’? Which seems....somewhat circular given that ‘normal’ is defined by reference to what happens after the 28 day period?