key: cord-0216487-lrg11s5n
authors: Stoye, Jorg
title: A Critical Assessment of Some Recent Work on COVID-19
date: 2020-05-20
journal: nan
DOI: nan
sha: f4dd66e81470d649f3fcb9da6c6cbe65851935ed
doc_id: 216487
cord_uid: lrg11s5n

I tentatively re-analyze data from two well-publicized studies on COVID-19, namely the Charit'{e}"viral load in children"and the Bonn"seroprevalence in Heinsberg/Gangelt"study, from information available in the preprints. The studies have the following in common: - They received worldwide attention and arguably had policy impact. - The thrusts of their findings align with the respective lead authors' (different) public stances on appropriate response to COVID-19. - Tentatively, my reading of the Gangelt study neutralizes its thrust, and my reading of the Charit'{e} study reverses it. The exercise may aid in placing these studies in the literature. With all caveats that apply to n=2 quickfire analyses based off preprints, one also wonders whether it illustrates inadvertent effects of"researcher degrees of freedom."

This note takes a second look at two recent papers on COVID-19 released by leading German research groups. Both received widespread attention and arguably had policy impact. The first, by Christian Drosten's research group, was cited as evidence against opening schools; the second one, by Hendrick Streeck with coauthors, was cited as evidence in favor of a partial reopening of Germany. Both made important data collection efforts and then interpreted those data. I limit my analysis exclusively to this last step of interpretation, which is also the only aspect on which I can claim expertise. I find that, based on information available in the preprints, my own analysis would plausibly have had a neutral or even the opposite tendency in each case. I next elaborate this somewhat formally and then discuss implications.

2 The Charité Study: Viral Load in Children 2.1 Background Jones, Mühlemann, Veith, Zuchowski, Hofmann, Stein, Edelmann, Corman, and Drosten (2020, henceforth " the Charité study") analyze viral load by age in a convenience sample of Covid-positive patients in Berlin. The aim of the investigation was to see whether epidemiological and anecdotal evidence of children being less contagious could be corroborated. The study received massive media coverage in Germany and abroad. Its data are visualized in Figure 1 . 1 The seemingly strong association between age and viral load is corroborated by a nonparametric omnibus test of association. However, the paper next tests for significant differences across all 45 pairwise combinations of age bins (and repeats the analysis with different binning). After adjustment for multiple hypothesis testing, no specific comparison is significant. This is reported in the abstract as follows: " [T] hese data indicate that viral loads in the very young do not differ significantly from those of adults."

Some disagreements with the analysis may come down to taste. The authors focus on a hypothesis test as deliverable of their analysis; I would have recommended a nonparametric mean regression with error bands, resulting in some estimated age effect. The main reason is that substantive significance does not equal statistical significance -if we found a minuscule but highly significant effect, we would not care either. 2 Next, the core finding is that a certain hypothesis does not get rejected. Thus, the analysis suggests absence of evidence, Table 2 , panel 1, in the Charité study. Error bars display ±1.96 standard errors. From the study's abstract: " [T] hese data indicate that viral loads in the very young do not differ significantly from those of adults." (Inspired by the image accompanying Textor (2020) .) which is of course not evidence of absence. Although this distinction is central to how we teach hypothesis tests, including in medicine (Altman and Bland, 1995) , it is at most barely maintained in the paper (see the aforecited sentence) and completely lost in its perception.

As a result, the data that are visualized in Figure 1 fueled headlines like "Covid patients found to have similar virus levels across ages" (Gale and Mulier, 2020) and "Children are just as contagious as adults" (Kieselbach, 2020) .

More troublingly, I am not convinced that the evidence is absent. To the statistically educated reader, the above headlines may suggest that the study tests, and fails to reject, H 0 : "Children have the same viral load as adults." It does not. It rather splits the sample into 10 age bins and then simultaneously tests whether any of the 45 possible pairwise combinations of age groups exhibit a significant difference in load. With proper adjustment for the possibility of false discoveries from 45 tests, not one comparison shows up as significant. This is reported as main conclusion, even though it is at tension with a test reported elsewhere in the preprint that indicates some difference across age bins (Kruskal-Wallis test, Table A3 , p = 0.8% for [3] the bins in Figure 1 ). That is, the analysis finds a significant age effect -it is merely that post-hoc comparison does not single out a specific one of 45 pairs as culprit.

If one's take-home message is a finding of nonsignificance, one should make a good faith effort to apply a powerful test. That did not seem to happen here. On the contrary, artificially coarsening the age variable and, especially, pooling the comparison of interest with 44 comparisons that nobody inquired about is bound to sacrifice power. What would the result have been without these choices? To take an educated guess, I will now try to recover from the paper a test of H 0 : "Children have the same viral load as adults." To this purpose, I combine the first two and the remaining age bins of Figure 1 to find means of 4.74 and 5.21 with 95% confidence intervals of [4.42, 5.05] and [5.15, 5.27], respectively. A simple t-test rejects the null of equal means at a two-sided p-value of 0.4%. 3 It would seem that the lower viral load in the youngest age groups is highly significant. What gives?

I reiterate that this simple test would not be my preferred analysis of the raw data.

However, I am willing to defend it vis a vis the preprint's multiple nonparametric hypothesis tests. Specifically:

• The test only considers one rather than 45 hypotheses. I consider this a feature and not a bug. The data were collected specifically to investigate the relative viral load of children, and the methodological justification for pooling this comparison with 44 others is not elaborated. Indeed, one could arguably conduct a one-sided test and report a p-value of 0.2%.

• The test is influenced by the authors' choice of bins and by my inspection of the visualization, so it strictly speaking involved an informal specification search. However, given the study's purpose, my own pre-data plan (conditionally on involving a simple hypothesis test to begin with) would have proposed more or less the same comparison.

Hence, this should not a major concern. That said, taking the above p-value entirely at face value is not recommended, but it is also not necessary for my point.

• The test is exact only if relevant true distributions are normal. This is clearly not the case, neither exactly (negative values are impossible) nor approximately (the paper reports corresponding tests). However, I compare means across two samples of size 127 and 3585, and the empirical distributions appear slightly skewed but well-behaved (see Figure 1 in the Charité study). An appeal to the Central Limit Theorem would seem innocuous.

In fact, this consideration is more pressing for the Charité study itself. They perform 45 simultaneous tests on relatively small bins and therefore encounter the one-two punch of (i) looking at the further-out-in-the-tails quantiles of sampling distributions that become relevant with Bonferroni-type adjustments and (ii) rather small sample cells.

Indeed, 9 of the tests involve the "91-100" age bin, which contains 17 observations. This makes it harder to appeal to normal approximations, yet two of the three multiple hypothesis tests reported in the paper do so anyway and essentially differ from my test only through adjusting for multiplicity of hypotheses. 4

This simple analysis is not meant to be conclusive. My only point is that the paper's nonsignificance result seems to be driven by researcher choices. For any conclusions beyond that, I would want to see the results of a nonparametric regression analysis that undoes the age binning of data. The closest approximation to this that I am aware of is the insightful renalysis by Held (2020) , which qualitatively agrees with mine. The ideal solution here would be publication of the data; while this might not be legally easy, I note that a lot could be done with literally only age and viral load.

Streeck, Schulte, Kümmerer, Richter, Höller, Fuhrmann, Bartok, Dolscheid, Berger, Wessendorf, Eschbach-Bludau, Kellings, Schwaiger, Coenen, Hoffmann, Stoffel-Wagner, Nöthen, Eis-Hübinger, Exner, Schmithausen, Schmid, and Hartmann (2020, henceforth "the Gangelt study") is an early seroprevalence study conducted in the small German town of Gangelt, which had witnessed a superspreader event. The study made headlines for estimating a local prevalence of 15% and especially an IFR of .4%, which was perceived as surprisingly low (and would certainly be considered on the very low end now). From the latter number, the authors also informally extrapolate to 1.8 million infected Germans, an extrapolation that was correspondingly perceived as high, therefore arguably reassuring, but also implausible.

The study received considerable attention that was apparently seeked out by the lead author, who vocally advocated for partial reopening of North Rhine-Westphalia. 5

Reanalysis. It has been widely remarked that, even if one feels confidence extrapolating Gangelt's IFR, one cannot extrapolate the CI to either nationwide IFR or nationwide prevalence. The reason is that Gangelt's fatality count was considered nonstochastic, but for the purpose of extrapolating results to Germany, it is a realization of 8 "successes" (scare quotes very much intended) in a large trial. For illustration, a simplistic 95% confidence interval for Gangelt's expected fatality count covers all integers from 4 to 16. 6 But the point estimator itself may not be the only plausible choice either. It implies that only 10% of cases were detected nationwide, an implication that was picked up by the media (Burger, 2020) but also frequently called out as implausible. Indeed, the discovery rate in Gangelt was estimated to be 20%. Extrapolating from this number to Germany would imply an IFR of .9% rather than .4%. What gives?

The discrepancy stems from the fact that, if we believe Gangelt to be representative of Germany, we have overdetermined but not quite consistent estimating equations. For sake of argument, say we have 8 fatalities, 388 officially confirmed cases, and 1956 true infections in Gangelt, while we have 7928 fatalities, 174975 confirmed cases, and an unknown true prevalence in Germany. 7 By the rule of proportions, we could back out an estimate of nationwide prevalence either as 1956 × 7928/8 = 1938396 (an update of the study's preferred estimate) or as 1956 × 174975/388 = 882090 (extrapolating the undercount). The numbers differ by so much because the crude case fatality rate (CFR) equals 8/388 = 2.1% for Gangelt and 7928/174975 = 4.5% for Germany. Thus, assuming that the IFR (interpreted as population-level expectation) is actually the same, Gangelt either tested more or got really lucky. By extrapolating the local IFR, the study assumes the former but without discussing this assumption.

At the same time, proponents of extrapolating the undercount could point out that the underlying binomial samples of 8/388 versus 7928/174975 are of vastly different size. Should we really prioritize the former over the latter in extrapolation? Or was Gangelt indeed just lucky? This is marginally implausible in that the (exact binomial) two-sided p-value for the former sample being drawn from the latter urn is 2.6%. But modest demographic difference between Gangelt and Germany, or accounting for clustered sampling of both numerator and denominator (the sample units were households; the superspreading event apparently spared local nursing homes), could easily compensate for that. What's more, the fatality count meanwhile increased to 9. While it is not clear that this fatality was among the infected during the study's time frame, adding it would change the aforementioned p-value to 5.2%.

Also, it would mean that the local and nationwide numbers are consistent up to sampling error according to the more sophisticated analysis in Rothe (2020, numerical claim verified in personal communication).

Again, my point is not that the study's implicit argument is necessarily false. Gangelt's undercount might indeed be atypically low: Being a hotspot, the town saw much more testing than other places. However, that does not logically imply that Gangelt had more testing relative to true prevalence. My complaint is that this case was not made, and a 6 This CI just inverts the Poisson distribution. It is reported to illustrate that the issue is not negligible. See Rothe (2020) for a sophisticated analysis of the multilayered inference problem. Note also that I use the higher (=8) of two fatality counts offered in the preprint. 7 The nationwide numbers were reported by Johns Hopkins University on 5/14. They implied a slightly lower case fatality rate at the time of the study.

[6] different extrapolation that would align the study with other recent seroprevalence studies (and have shorter confidence intervals to boot) was not discussed. A balanced analysis might well provide both and also some form of interpolation. Partial identification would be one framework close to my heart, but the primary concern is transparency about how different identifying assumptions would interact with the data at hand.

This is not a "gotcha" paper. Both aforecited studies are carefully and competently executed, especially considering the extreme time pressure. They are transparent about computations, the computations appear correct, and I have no doubt that they were executed in good faith.

Compared to some concurrent work by prominent U.S. researchers (e.g., as discussed by

Gelman (2020)), both studies are models of good science. Also, there are obvious limitations to quickfire reanalysis without raw data, and everything I have to say is accordingly tentative.

With that said, I stand by the following assessments: There are many good arguments against a quick reopening of schools, but the Charité study does not add to them. Neither does the Gangelt study paint a substantially more optimistic picture on IFR than other recent seroprevalence studies. 8

With the due caution warranted by a sample size of n = 2, it is of note that both studies were perceived as having a certain policy thrust; in both cases, this thrust correlated with public perception of some authors' stand on COVID-19 policy; and in neither case was it corroborated by my reading. That is my methodological point. Specific, reasonable but not objectively forced choices of statistical analysis -what is sometimes referred to as "researcher degrees of freedom"-informed results that align with researchers' public stands. I emphasize that I do not suggest any intent. However, even at the very top of the discipline, the effects of, for example, deciding when an analysis is complete cannot be assumed to just go away.

While I think that my comments on the specific studies are of interest, an important message to you -the reader-and me is to redouble consideration for such effects in our next study. On an immediately practical note, it illustrates that a data repository for COVID-19related research would be invaluable. Finally, it is a reminder of the importance of professional peer review. While both of the above studies were controversially discussed by experts right away, my impression is that critical voices gained any traction with the public only for the Gangelt one. In sum, this may be the rare case where rapid peer review of an academic paper would plausibly have affected the news cycle.

Statistics notes: Absence of evidence is not evidence of absence

German Study Suggests Infections Are 10 Times the Number of Confirmed Cases

Fatal Flaws in Stanford Study of Coronavirus Prevalence

A discussion and reanalysis of the results

An analysis of SARS-CoV-2 viral load by patient age

Kinder sind genauso ansteckend wie Erwachsene (Children are just as contagious as adults)

Open review of the manuscript: 'An Analysis of Sars-Cov-2 Viral Load

Combining Population and Study Data for Inference on Event Rates

Infection fatality rate of SARS-CoV-2 infection in a German community with a super-spreading event

Here's a replot of their own data. Means +/-2*SE from their Table 2. Their conclusion: No sig. viral load difference between children and adult. Media: Children just as infectious as adults, study shows