key: cord-0779425-yhtg39rz authors: Silverstein, Todd P. title: CoViD‐19 epidemic follows the “kinetics” of enzymes with cooperative substrate binding date: 2020-06-30 journal: Biochem Mol Biol Educ DOI: 10.1002/bmb.21397 sha: a090930dd2eae8b81f9ce3c8031b5d9677ead532 doc_id: 779425 cord_uid: yhtg39rz The Hill equation, which models the cooperative ligand‐receptor binding equilibrium, turns out to be useful in modeling the progression of infectious disease outbreaks such as CoViD‐19. The equation fits well the data for total and daily case numbers, allows tentative predictions for the half‐point and end point of the epidemic, and presents a mathematical characterization of how social interventions “flatten the curve” of the disease progression. Those who followed the CoViD-19 epidemic (and who did not?) watched with trepidation as the number of cases rose steadily. A common question was "when will this end and life return to normal?" The answer was invariably "who knows?" Well, it turns out, biochemists know. A glance at the plot of total number of cases vs. day showed a sigmoidal rise. This curve is familiar to biochemists as the saturation plot for a protein that binds ligand cooperatively, that is, with a Hill coefficient ≥ 2. There are a number of sophisticated mathematical models of infectious disease progression. [1] [2] [3] [4] They can be stochastic or deterministic, statistical or mechanistic. The most complex models can account for variables like reproduction number (R0 = infectivity), steady state, vaccination, social distancing, hospital capacity, etc. But wouldn't it be nice if there was a simple, familiar mathematical model that we all had access to and could use to model the severity and timeline of a pandemic? Enter, the Hill equation. The Hill equation describes how ligand binding to a protein receptor saturates with the concentration of free ligand, [L]: where f b is the fraction of total receptor that has bound n ligands, and K d is the PL n dissociation equilibrium constant; the exponent n is known as the Hill coefficient. Because K d is the free ligand concentration that gives f b = 0.5, also known as [L] 50 , Equation (1) can be written as. For an enzyme, the enzyme-substrate complex (E n = P n ) goes on to catalyze the reaction, so the Hill equation can be transformed to describe the dependence of initial reaction rate (v 0 ) on initial substrate concentration, [S] 0 : where V max is the asymptotic v 0 at saturating [S] Here t represents days, t 50 is the day that you reach half of the final number of cases (cases final ), and n controls the steepness of the sigmoidal rise. Together, t 50 and n provide information on how long the outbreak will last. For example, Figure 1a shows sigmoidal plots for n = 3, 5, and 8, given cases final = 100 and t 50 = 40 days. Note that a lower n flattens the curve, so that the rate of increase of cases is shallower. Although this does not decrease the final number of cases, it is advantageous in that it would decrease the number of hospital admittances on a particular day. The curve is also flattened by decreasing cases final (Figure 1b ) and by increasing t 50 (Figure 1c ). Many print media stories that discussed flattening the CoViD-19 epidemic curve illustrated the effect not with total cases vs. time curve, but with the bell-shaped daily cases vs. time curve. Daily cases as a function of time is the first derivative of the total cases function: In Figure 2 we can see how the three fit parameters affect the shape of the first derivative plot. Lowering cases final (Figure 2a ) decreases the amplitude, that is, the number of daily cases on each day, but leaves the relative width and time-axis position of the bell-curve unchanged. Lowering n (Figure 2b ) increases the relative width of the curve, and shifts its peak time to the left (to shorter times). This has the effect of increasing the number of daily new cases early and late in the epidemic (before day 25-30, and after day 55-60), while dramatically decreasing this number at the peak of the epidemic. Finally, increasing t 50 (Figure 2c ) also increases the relative width of the curve, while shifting the peak time to the right (later times). This spreads out the number of new daily cases, decreasing it early in the epidemic, while increasing it later on. From Figure 2c we can see that the curve is asymmetric-it rises faster than it falls. Furthermore, this asymmetry declines as n increases (Figure 2b , n = 8). Another surprising observation regarding the first derivative curve is that its peak is always ≤ t 50 (vertical dotted line in Figure 2a ,b). The peak approaches t 50 as n increases, rising from about 0.6 × t 50 for n = 2, to 0.975 × t 50 for n = 8. To explore some real epidemic data sets, Figure 3 presents a plot of the number of CoViD-19 cases in China and South Korea. Clearly, the sigmoidal fits to Equation (4) are good, although not perfect: For example, the early undershoot and late overshoot of the China curve are due to the fact that Chinese authorities changed their method of diagnosing the disease after day 24, suddenly increasing the number of cases. Before we go any further, it is important to state up front that fitting infection data to the Hill equation is purely a statistical process. There is no theoretical model explaining why this equation fits the data, thus there is no way to connect the Hill equation parameters to viral or human biology, or to social policy implementation. On the other hand, one might hypothesize that faster viral reproduction inside the host would raise both n and cases final ; the same would be expected for anything that raises infectivity, for example, more infectious viral particles, easier viral spread from asymptomatic individuals, higher frequency of close human contact, etc. Hill equation fits are quite good for almost all data sets (R 2 = 0.9940-0.9997). Table 1 lists Equation (4) fitted parameters for data sets from 35 countries (fitted plots for each data set can be found in the Figures S2-S38 ). The Hill coefficients (n) vary about two-fold (from 4 to 8), and half-times (t 50 ) vary about five-fold (from 20 to 100 days); on the other hand, cases final vary over 3,000-fold, from 384 for Taiwan to 1.4 million for the United States. Above, we speculated that anything that enhanced infection rate might raise both n and cases final . In fact, the fitted data from 35 countries show no strong relationship between any of the three fit parameters ( Figure S1 ). For example, the United States had a cases final value almost six times higher than the next-highest country, but its t 50 and n Hill are in the middle of the range of values. In general, countries with the lowest and highest values of t 50 span the entire range of n Hill ; the same is true of the lowest and highest values of cases final . There may be a weak correlation between t 50 and cases final ( Figure S1 ): Countries with the lowest t 50 values (< 36 days) all had low cases final (< 30,000); conversely, countries with the highest cases final (> 170,000) all had moderate to high t 50 values (≥ 40 days). It seems logical that strong social intervention, including Table 1 T A B L E 1 Equation (4) wide-spread testing, contact tracing, and quarantining, would lower both t 50 and cases final . Only three data sets were not well-fit by Equation (4), and their deviations allow some interesting speculation on the effectiveness of social intervention. The Taiwan and Hong Kong ( Figure 4 ) data sets had to be fit with a non-zero intercept. This could be because their interventions, after the first scores of cases were identified, were so effective that these initial cases did not lead to a sigmoidal rise; only after the first 120 cases appeared in Hong Kong did the disease spread as an epidemic, presumably via community transmission. The South Korea data set showed a sigmoidal rise in cases until day 25, after which the cases numbers rose more slowly ( Figure 5 ). One could speculate that South Korea, which was the second nation to experience an outbreak of CoViD-19, managed to mount an intensive social intervention around day 25, including for example, extensive testing, contact tracing, quarantining, etc. This would be another example of "flattening the curve". Note that although the curve was indeed flattened from day 25-55, the number of cases final was unchanged by this intervention: 10,594 ± 26 for the blue dashed curve (i.e., the normal sigmoidal rise without the intervention) vs. 10,550 ± 110 for the red solid curve. In this case, flattening the curve did not decrease the eventual number of cases, it just slowed the rate at which those cases appeared every day, allowing hospitals to accommodate the influx of patients. As mentioned above, in the midst of a pandemic, one question on everyone's mind is "when will it be over?" The nature of an asymptote is that you never actually reach it. However, one can ask how long it would take to get close to the end, for example, within 90 or 95% of the maximum value. Manipulating Equation (4), we can show that if f is the fraction of the total you are interested in (e.g., 0.9 or 0.95), then the time to reach that value (t f ) can be calculated: (4) with a zero y-intercept (red curve), and with a non-zero intercept (black curve) of 122 ± 4 cases F I G U R E 5 South Korea CoViD-19 cases fit to Equation (4) with all points (red solid curve), and with days 25-55 omitted (blue dashed curve) We can see from Equation (6) that, unsurprisingly, t f increases linearly with t 50 , but it also decreases with n (larger n gives a steeper rise, which gives a shorter t f ). For example, if n = 3 and t 50 = 20 days, then t 90 = 42 days, but for n = 4, t 90 = 35 days. Again, decreasing n flattens the curve, which spreads out the duration of the epidemic (Figure 2b ). Using Equation (6) and fit parameters from Table 1 , t 90 for the United States and New York state come out to 92 and 72 days, respectively; that would be May 24 (2020) for the United States as a whole, and May 6 for New York state. A better way to judge if you are close enough to the end of an epidemic to ease social restrictions is to identify a target maximum number of new daily cases. The World Health Organization has stated that an outbreak is deemed "contained" when < 10% of tests for the disease are positive; to be even safer, we can aim for < 1%. Harvard University's Global Health Institute has opined that the "bare minimum" of testing capacity in the United States should be 500,000 per day (0.15% of the population), so 1% would be 5,000 positive tests per day. Taking the fit parameters from Table 1 and using the first derivative equation (Equation 5), we can predict that the outbreak could be deemed contained in the U.S. by day 99 (May 31). For New York, the target would be < 300 new daily cases (0.015% × 20 million), which should occur by day 106 (June 9). Of course, it is easy to fit an asymptotic curve when your data approach the asymptote, as they do in Figure 3 . But what if you are in the middle of the epidemic, and you want to predict the end? This is more difficult. And here the function and limitations of datafitting must be considered. By fitting an entire data set to a mathematical equation, we can then use the fit parameters to predict the value of the dependent variable (y) for a value of the independent variable (x) that has not been measured, but lies within the range of measured values. Furthermore, the R 2 -value gives you a notion of the quality of the fit, and if we know the uncertainties in the fit parameters, we can calculate the uncertainty in the predicted y-value. However, that is not the situation in the middle of an epidemic. Here we are collecting data points one day at a time, and trying to use the fit equation to predict y-values appertaining to x-values that lie far beyond any that we have measured so far. In other words, we are using nonlinear regression to attempt to fit an incomplete data set. In this case, the R 2 value and the fit parameter uncertainties do not mean the same thing that they do when working with a complete data set. Consider the countries that were below 70% completion as of April 27, 2020, that is, the last ten countries in Table 1 . One can see (Figure 6 ) that the relative uncertainties of the cases final and t 50 parameters are substantially higher than for the countries closer to the end of their epidemic. On the other hand, n Hill relative uncertainties are only slightly higher for these bottom 10 countries, compared to the other countries ( Figure 6) . Table 2 affords a closer look at this effect. It turns out that even the high uncertainties for the low-completion data sets are woefully underestimated. And it goes almost without saying that an R 2 value above 0.99 for fitting a data set that is less than 50% complete is meaningless. One way to examine this important issue is to take a complete data set, and remove points from the end, examining how this affects the fit. For example, Germany's data set was very well-fit by Equation (4), with R 2 = 0.9996, and its final reported number of total cases (from April 27) was 94% of the fitted cases final value. Thus, its fit parameters are extremely well-determined: cases final = 171,400 ± 900; t 50 = 42.44 ± 0.12 days; and n Hill = 5.48 ± 0.06. Removing points from the end of the data set allows us to examine the fit parameters at 90% completion, 80%, and so on. T A B L E 2 Average relative uncertainties (in %) for cases final , t 50 , and n Hill for all countries in Table 1 Completion ≥90% F I G U R E 6 Relative uncertainties for cases final (red), t 50 (blue), and n Hill (black), plotted against the percentage completion of the case load for all countries listed in Table 1 In Figure 7 we plot the fitted values for cases final at completion percentages of 94% down to 14.5% (below 14% the fitting process failed). The first thing to notice is that the uncertainties in cases final returned by nonlinear regression, plotted as vertical error bars, do not reach the value obtained for the full data set (dotted line). From this we learn that when using nonlinear regression to fit an incomplete data set, the reported uncertainties are greatly underestimated. The second observation is that below 30% completion, the cases final value is far too low, by 2-to three-fold. If your set of reported total cases ended within this range, that is, well below t 50 , then one cannot expect the fitted parameters returned by nonlinear regression to be accurate (compared to values obtained from the full data set). Similar errors were noted for fitted values of t 50 and n Hill below 30% completion (data not shown). Another way to look at this problem is to examine t 90 predictions at various points of completion %. Figure 8 plots t 90 (calculated from the Equation (4) fitted parameters, using Equation (6)) for the Germany data set ending on day 31 (14.5% completion), day 32 (17% completion), etc., up to the full 68-day data set (94% completion). Note that t 90 rises dramatically from day 31 to day 39 (39% completion), after which it continues to rise more slowly all the way to the end of the data set on day 69 (94% completion). This receding "goal line" is another example of the dangers of trying to fit an incomplete data set. Consider the Germany data set ending on day 34, which is 22% complete. From fitting this incomplete data set, one would expect t 90 to be day 47 (Figure 8) , and the final number of total cases to end up around 75,000 (Figure 7) . But moving forward 13 days to day 47, t 90 is still 12 days away (talk about running and getting nowhere!), and the predicted final number of cases has more than doubled, to 155,000. Here is another example of this problem: An earlier draft of this paper included data up to March 31 (2020). At that point (day 39), the U.S. case load was 213,000, and fitting suggested cases final = 440,000 ± 14,000 and t 90 = day 51. As I write this on May 1 (day 69), the current U.S. case load is 1.1 million, and t 90 = day 83! The bottom line is that one has to take these fitting results with a grain of salt, especially when conditions such as testing frequency and social intervention change during the course of the epidemic. Having said that, the predictions do become fairly reliable above 60% completion. Of course there is an endless loop problem here, because one does not know the actual cases final until the data set is complete, so your calculation of % completion will have an uncertain denominator. However, Figure 7 allows us to calculate that this error should be less than ≈10%. A word is in order concerning the use of diagnosed cases, as opposed to hospitalizations or deaths. I have chosen to use reported cases because they are a current indicator of the epidemic, whereas deaths and hospitalizations are trailing indicators. However, it is definitely true that increasing testing during the course of the epidemic skews the fits. This explains, at least partially, the modest rise in cases final seen in Figure 7 after 40% completion. The skewing effect of increased testing can also be seen in the slight upward tilt of the California total cases curve after day 48 ( Figure S31) , and in the increase in cases final and t 50 , and decrease in n Hill and completion % from April 27 to May 3 for almost all of the countries that were re-fit on May 3 ( Table 1 ). The good news is that hospitalization and death data sets follow the same trend as total cases (only delayed by 1 or 3 weeks, respectively), and can also be fit to the Hill equation (total cases, Equation 4) and its first derivative (daily cases, Equation 5). Let us close with an example of the usefulness of this fitting procedure. On April 8 (2020), the governor of New York announced that, based on the models his experts were using, he expected the peak of the COViD-19 outbreak would be reached in mid-June. Assuming that by "peak" he meant the maximum of the new daily cases (first derivative) curve, simple plotting and fitting would have shown him that New York state had in fact already passed its peak of new infections earlier that week, on April 4! Some experts predict that the world will have to deal with a resurgence of CoViD-19 in the fall of 2020, and perhaps even in years afterward. Even if not, we are only one interspecies virus hop and genetic recombination away from the next epidemic. In the meantime, biochemistry students can be regaled with the unexpected usefulness of the Hill equation in predicting the progression of epidemics. ACKNOWLEDGMENTS I wish to thank my colleagues J. Charles Williamson for his data-fitting insight, and Mark Janeba for his help with equation differentiation, and my partner Rosalie Schele for her help with data entry. Todd P. Silverstein https://orcid.org/0000-0003-3349-2876 Additional supporting information may be found online in the Supporting Information section at the end of this article. How to cite this article: Silverstein TP. CoViD-19 epidemic follows the "kinetics" of enzymes with cooperative substrate binding. Biochem Mol Biol Educ. 2020;1-8. https://doi.org/10.1002/bmb.21397 A contribution to the mathematical theory of epidemics Infectious diseases of humans: dynamics and control Mathematical models in population biology and epidemiology Mathematical modelling of infectious disease