key: cord-0735390-p0q31ox3 authors: Bothe, Tomas L.; Patzak, Andreas; Schubert, Rudolf; Pilz, Niklas title: Getting it right matters! Covid‐19 pandemic analogies to everyday life in medical sciences date: 2021-07-14 journal: Acta Physiol (Oxf) DOI: 10.1111/apha.13714 sha: ad9cad739b8c14ac1ce75ae928656404f7defaa4 doc_id: 735390 cord_uid: p0q31ox3 What a time it has been since the world quite literally changed within weeks in early 2020. For all of us, as researchers and as normal citizens life changed drastically.1 Lab meetings turned into lockdown video calls, cultural events were deemed dispensable and a sentimental coffee on the balcony became the luxury version of any vacation plans. Now, as the pandemic is retreating in most countries and optimistic prognosis for the upcoming summer months are no more mutually exclusive, people take a step back to evaluate what happened over the last months, both on a societal and a personal level. What a time it has been since the world quite literally changed within weeks in early 2020. For all of us, as researchers and as normal citizens life changed drastically. 1 Laboratory meetings turned into lockdown video calls, cultural events were deemed dispensable and a sentimental coffee on the balcony became the luxury version of any vacation plans. Now, as the pandemic is retreating in most countries and optimistic prognosis for the upcoming months is no more mutually exclusive, people take a step back to evaluate what happened over the last months, both on a societal and a personal level. We wish to express our deepest condolences to everyone who has lost a loved one during the pandemic, or who struggle themselves with Covid or its late consequences. A quick grasp of history tells us, that things could have turned out much, much worse. 2 That is not to discredit the personal tragedies caused by this pandemic, but a quest for a silver lining. Those moments of relief, of seeing positive even in the eye of severe adversity is not only helpful in preserving mental health, but it helps us make the best of the situation. When taking a closer look, there have been many little things brought up by the pandemic, which could turn out positive in the long run. Admittedly, video calls can never supersede real, personal encounters, but questioning, whether there are things researchers (and members of any other profession) can do from home could add a lot of flexibility to the way we work. The interruption of most studies relying on human participants forced many researchers to restructure work. Taking out this cornerstone of medical science is disheartening, but then again free time to ponder can be a blessing in disguise. [3] [4] [5] For contemplation to be a blessing requires worthwhile thoughts. With regard to science, the focus on Covid-19 case numbers, herd immunity, vaccinations, and appreciation of metrics for decision-making have transformed the general thought process. Laymen reflect on incidences, R-values, essential vaccination rates and might even have heard the terms specificity and sensitivity regarding antigen tests. It is even more heartening when those people start to learn from analogy and use the newly acquired scientific concepts to talk about the things they care about. Seeing everyday people engaged in science bodes well for society. Learning sound great. Scientists quickly focussed on learning about the pandemic and the virus's characteristics. [6] [7] [8] They even presented first results about the virus's implication on other organ systems, beside the respiratory system. [9] [10] [11] Yet, is there something all scientists can learn by looking closer at the concepts Covid-19 has so plastically presented us with? It turns out there is: Because this pandemic has been so overarching, we all share common experiences that can help enhance insight into complex problems. In this sense, let us have a little thought experiment on a particular issue related to the pandemic: Imagine taking a Covid-19 antigen test. As you are a scientist, you are aware of the concepts of specificity (true positive rate) and sensitivity (true negative rate). As the test is of very high quality (sensitivity = 0.99/specificity = 0.90), you feel very confident about the negative result you got and happily proceed with your plans (eg to visit your relatives). Now let us test your initial joy: We just assume the current Covid-19 incidence in your area to be at around 200 per 100 000 per week and your countries test rate at 1 million per week. For argument's sake, we also assume the average Covid-19 infection (mean time a patient might be contagious) to last 14 days, which leads us to a point prevalence of 0.4% or 400 per 100,000 at the time you take your test. We can now plot all the information into a tree diagram and evaluate the results (Figure 1 ). It turns out your feeling of relief was indeed warranted. It is very unlikely for you to be one of the false negative results and even if you had been tested positive, your chance of actually being infected would only have been around 3. though test performance measures seem to be very high, has been described for many screening tests and should be considered emphatically in particular for widely advertised tests. It is of course the rationale behind secondary PCR tests for individuals tested positive: We do not want to force noncontiguous people into quarantine. Great. After all a lot of scientists are all too familiar with PCRs, so there is not much to learn there. [12] [13] [14] [15] [16] We made some assumptions and visualized a statistical concept which most of us would admit we had not been fully considering before the pandemic. So, this is it, we can all go (or stay) home and feel satisfied, because we had a little academic workout and a better understanding of screening tests usually only cared about by a narrow circle of physicians? Not quite. One central point in learning from analogy is to have an analogy, preferably one, which relates to everyday life. Nonetheless, how can we relate basic Covid-19 test statistics with the vast diversity of your, the readers', lives in science? Workdays in science are diverse. Scientists work in a multitude of different and highly specialized fields, so finding a common ground sounds rather difficult. [17] [18] [19] [20] [21] It is the work of David Colquhoun, which allows us to do so by presenting a beautiful analogy between screening tests and the crisis of unreproducible results in medical sciences. 22, 23 Unreproducible results are obstacles in the way of scientific process that question scientific integrity and constrain or even terminate scientific careers. To make the analogy, we must think about what unreproducible results are and then figure out a way to stay on top of this issue. Unreproducible results can involve inappropriate data acquisition and handling, or false positive results. We will focus on the latter, as they are very common and readily nailed down. To reduce the occurrence of false positive results, the scientific community has taken measures such as reporting the P value of statistical tests and assumed power in study design. It has been brought back to our attention recently why P values alone are no good representation of experimental results, as they only provide a measure of certainty without recognizing the practical relevance of a shown effect. 24 Moreover, keep in mind that P values work similarly to a test's specificity-it only provides insight into cases without a real effect, that is, only into the cases shown in the right part of the tree in Figure 1 . Let us step further and think about what the P value and the statistical power are and why they, in their current form, may provide unreproducible results. To do so, we will draw up another tree diagram for which we will need some assumptions: You, your working group and everyone in your institute test roughly 300 hypotheses each year altogether, which will add up to 10,500 lifetime hypotheses in 35 years in research. Then we say that in about 20% of these cases there is a real underlying effect, whether we can measure it or not. This number will most likely be much smaller, since we often explore the unknown. 22 We then assume everyone to design their experiment right in line with best practice, choosing an alpha level (significance level) of 0.05, a statistical power of 0.80, and all projects yield are perfectly randomized, normally distributed results, which will be tested for one hypothesis only. All these assumptions portray an ideal scenario, which will lead to the best-case number F I G U R E 1 Tree diagram portraying discovery rates in a Covid-19 antigen test paradigm. The numbers are fictional but chosen to mimic plausible real-world data. The point prevalence (percentage of Covid-19 infected people in the general public at the time of the test), the sensitivity and the specificity are provided of correctly reported and therefore most likely reproducible results. Now for some clearing up: The P value does not tell us the likelihood of a shown effect being fallacious. It tells us how likely it is that we observe an effect (or a more extreme) based on chance in the particular case that there is no true underlying effect. In analogy to our antigen test example: Our chosen alpha equals the test's specificity. It tells us only how likely it is that a test is false positive. Again, keep in mind that this number is calculated based on the number of hypotheses without true effect neglecting the existence of hypotheses with true effects. Similarly, the statistical power is to be set equal to the antigen test sensitivity: The statistical power tells us how many of the true positives our test will detect. With this in mind, we can draw up our tree diagram (Figure 2) . Taking a close look, we see that this best-case scenario yields grim results. If we divide the true positives by all the positives our statistical tests yielded over 35 years there are only 1680/(1680 + 420) = 80% of all positive appearing results truly positive. Consequently, 20% of all positive (and therefore probably reported) experimental results are false discoveries. Keep in mind, this is the best-case scenario. When only slightly tweaking the numbers to a more realistic power of 60% and a true effect rate of 10% we can calculate that only 57% of all reported results are indeed genuine, leaving us with a false discovery risk of 43%. This is, still, assuming the best-case scenario of completely randomized, normally distributed, single-variable tests and the assumptions we made are not unrealistic and still more on the optimistic side. Button et al stated in a widely recognized paper that a serious estimation of the median statistical power in published neuroscience research projects may be as low as somewhere between 8% and 31%. 25 All of this is already bad enough, but it gets even worse: When accepting low statistical power, you do not only sacrifice the reliability of your results. Your reported effect sizes will inflate beyond what the actual effect is. The reason for this inflation comes quite naturally: Picture the possible measured effect sizes normally distributed around their true mean. A power of 0.3 tells you that only 30% of your experiments with a true underlying effect will show a true-positive outcome. Quite intuitively, the positive test results skew strongly towards the instances of data in which the effect appears to be large. Plainly said: If your observed effect is twice as large as the real effect, your test is way more likely to detect it as positive. This leads to reported effect sizes being overblown and, yet again, not reproducible. The lower the statistical power, the more false negative results are produced by your test, which will skew its detections more and more towards inflated effect sizes. The situation is dire, to say the least. With our analogy in mind, we can now picture why there are so many unreproducible results published in scientific literature. So, is it time to throw your hands up and realize that we are powerless against the statistical fate of science we have just worked out? It is not. As for most problems, its solution begins with becoming aware of it. In this case, it drives us already halfway home. Listen to Colquhoun and take some easy measures to drastically mitigate these effects. 22, 23 First, getting things right matters! We must make sure that we set up our experiments properly. Statistical power analyses are not just a F I G U R E 2 Tree diagram portraying lifetime research. The numbers are fictional but chosen to mimic plausible real-world data. The probability of a true underlying effect is provided as P(true effect). The statistical power and the chosen alpha level are provided in parallel to the sensitivity and specificity in Figure 1 statistician's way of passing time, they matter. A power of at least 0.8 helps tremendously to keep inflating effect sizes and false positive results in check. Second, there is the P value, and we simply should be more stringent with it. If we apply a three-sigma (P ≤ .001) instead of the current two-sigma (P ≤ .05) rule, we can shrink the risk of false positives to values <2%, which is in fact the range the scientific community targeted as acceptable when choosing 0.05 as their alpha level of choice. 22 Additionally, changing our perception of P values would go a long way. There are experimental settings which will not allow a P value beneath .001. In those cases (p close to 0.05), we have to be cautious when interpreting our results as proof of an effect. When knowledge is meagre, a significant result makes the existence of an effect more likely, nevertheless doubts remain. Such results only indicate something interesting warranting further investigation. However, for shown reasons, drawing conclusions, or even making treatment decisions based on those results is irresponsible. Only when existing knowledge provides strong evidence for a probable effect, a significant result can be interpreted with confidence that there is a true effect. 26 Ultimately, reporting the false positive risk along with P values, 95% confidence intervals and observed effect sizes in relation to relevant effect sizes, can prevent us and our fellow researchers from misinterpreting scientific results. Anyone would be more than cautious to consider a study with a high false positive risk (eg 30%) as proof of an actual effect. All we need to provide for the calculation of the false positive risk is the number of biological samples in our study, the observed P value, effect size and an estimate of the prior probability of a real effect. 23 The first three values are easily obtained when performing any statistical test. The prior probability can only be approximated but a value of 10% seems to be reasonable. Alternatively, one may set the prior probability to 0.5 and calculate the corresponding false positive risk (FPR 0.5 ). Notably, higher prior probabilities mean that you are studying something well established, lower prior probabilities increase the false positive risk. Thus, a prior probability of 0.5 provides a minimum estimate for a meaningful false positive risk. 27 There are simple and free of charge online tools for calculating the false positive risk. 28 It is now in our hands as members of the research community to make use of these tools and provide our readers with reliable estimates of our studies' false positive risks. In parallel to the pandemic, there is a silver lining on the horizon, at least when looking in the right direction. We can take the necessary steps to modify the scientific process sufficiently to allow warranted confidence in its results. Just as societies are striving to utilize the lessons learned to build the post-pandemic future, we as the scientists can strive to tackle the crisis of unreproducible results. All we have to do is to do what humans are so wonderfully proficient at: Trying to get things right and learn from analogy. SARS-CoV-2: what do we know so far? Reassessing the global mortality burden of the 1918 influenza pandemic Passive exposure to heat improves glucose metabolism in overweight humans The concept of skeletal muscle memory: Evidence from animal and human studies The importance of being rhythmic: living in harmony with your body clocks How simulations may help us to understand the dynamics of COVID-19 spread. -visualizing non-intuitive behaviours of a pandemic (pansim.uni-jena.de) microRNAs as new possible actors in gender disparities of Covid-19 pandemic An update on ACE2 amplification and its therapeutic potential Neuroinfection may contribute to pathophysiology and clinical manifestations of COVID-19 SARS-CoV-2 effects on the renin-angiotensin-aldosterone system, therapeutic implications Covid-19, ACE2 and the kidney Overexpression of the histidine triad nucleotide-binding protein 2 protects cardiac function in the adult mice after acute myocardial infarction Th17/Treg imbalance modulates rat myocardial fibrosis and heart failure by regulating LOX expression Attenuation of inward rectifier potassium current contributes to the α1-adrenergic receptor-induced proarrhythmicity in the caval vein myocardium Regulation of integrin α6A by lactogenic hormones in rat pancreatic β-cells: implications for the physiological adaptation to pregnancy Natriuretic peptides relax human intrarenal arteries through natriuretic peptide receptor type-A recapitulated by soluble guanylyl cyclase agonists D-serine-a useful biomarker for renal injury? A triple sense of oxygen promotes neurovascular angiogenesis in NG2-derived cells Prolonged therapeutic effects of photoactivated adipose-derived stem cells following ischaemic injury Exercise-dependent increases in protein synthesis are accompanied by chromatin modifications and increased MRTF-SRF signalling Co-cultures of renal progenitors and endothelial cells on kidney decellularized matrices replicate the renal tubular environment in vitro An investigation of the false discovery rate and the misinterpretation of p-values The reproducibility of research and the misinterpretation of p-values Significant significance? Power failure: Why small sample size undermines the reliability of neuroscience The false positive risk: a proposal concerning what to do about p-values False Positive Risk Web Calculator We thank Laura Josefa Dippel for insightful comments and discussions.