key: cord-0555935-mvjr3482 authors: Wang, Dongdong; Zhang, Shunpu; Wang, Liqiang title: Deep Epidemiological Modeling by Black-box Knowledge Distillation: An Accurate Deep Learning Model for COVID-19 date: 2021-01-20 journal: nan DOI: nan sha: 44848650160accb48558e9afedb5d8a74df6e756 doc_id: 555935 cord_uid: mvjr3482 An accurate and efficient forecasting system is imperative to the prevention of emerging infectious diseases such as COVID-19 in public health. This system requires accurate transient modeling, lower computation cost, and fewer observation data. To tackle these three challenges, we propose a novel deep learning approach using black-box knowledge distillation for both accurate and efficient transmission dynamics prediction in a practical manner. First, we leverage mixture models to develop an accurate, comprehensive, yet impractical simulation system. Next, we use simulated observation sequences to query the simulation system to retrieve simulated projection sequences as knowledge. Then, with the obtained query data, sequence mixup is proposed to improve query efficiency, increase knowledge diversity, and boost distillation model accuracy. Finally, we train a student deep neural network with the retrieved and mixed observation-projection sequences for practical use. The case study on COVID-19 justifies that our approach accurately projects infections with much lower computation cost when observation data are limited. The spread of infectious diseases is a serious threat to public health and may cause million deaths every year. To effectively battle against infectious diseases, accurate modeling on their transmission patterns is critical. This issue becomes more pressing when the infectious disease, like COVID-19, is unprecedented, transmission dynamics is complex, and observation data are limited. Due to data limitation, we need to solve this problem with the help of conventional physicsbased epidemiological models. However, it is still difficult to accurately describe complex dynamics with a single model. Mixture models are widely used to accurately solve complex transient modeling problems. They can refine temporal scale into several states with different onsets, model these states separately, and then mix modeling results to represent complex dynamics. Although this refinement on temporal scale more accurately depicts the variation in a physical system, the difficulty of calibrating a mixture model and computational complexity can exponentially increase since it can result in very large parameter space, i.e., curse of Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. dimensionality. When prior knowledge about an infectious disease, such as COVID-19, is limited, exhaustive search in such large space is inevitable for accurate model calibration, which can easily render a mixture model impractical. In reality, some modelers propose some assumptions to truncate search space with coarse grid and trade for efficiency and feasibility, but it can cause large uncertainty and model degradation. To address this problem, we formulate a new approach with black-box knowledge distillation. This approach is developed based on three-fold objectives, including higher prediction accuracy, lower modeling cost, and higher data efficiency. To achieve higher prediction accuracy, we first leverage mixture models to create a comprehensive, accurate, but probably impractical epidemic simulation system. This system is viewed as a black-box teacher model which contains sophisticated modeling knowledge. To reduce modeling cost and make this system feasible, we employ knowledge distillation to transfer the accurate modeling knowledge from this impractical black-box teacher model to a deep neural network for practical use. To realize this knowledge transfer, we collect a set of simulated observation sequences to query the teacher model and acquire their corresponding simulated projection sequences as knowledge. Particularly, for improvement in model performance with limited data, we propose sequence mixup to augment data pool, thus reducing model queries, increasing sequence diversity, and boosting modeling accuracy. With all retrieved and mixed observation-projection sequence pairs, we train a student deep neural network for infection prediction. This student network can perform prediction as accurately as teacher model, but save lots of computation cost, and require fewer observation data. To the best of our knowledge, we are the first to propose a black-box knowledge distillation based framework to solve epidemiological modeling by leveraging mixture models. Besides this novelty, our work also includes the following contributions: (1) the distilled student deep neural network enables accurate model calibration and projection automatically. (2) Sequence mixup is proposed to reduce teacher model queries for higher efficiency, improve the coverage of obtained data for better accuracy, and further enhance knowledge transfer with fewer observation data. (3) We justify our approach by solving COVID-19 infection projection and it performs on par or even better than some state-ofthe-art methods, like CDC Ensemble, with adequate accuracy over the evaluation period. (4) Our approach provides a general solution to render impractical physics-based models feasible. Epidemiological modeling has been extensively studied for decades. It is focused on how to accurately quantify infectious disease transmission dynamics. The proposed methods can be classified into two main categories, classical physicsbased modeling and data-driven approach. For physicsbased modeling, compartmental modeling, like SEIR (Kermack and McKendrick 1927) , is well justified for practical projection. Different from physics-based modeling, thanks to the improvement on data collection, data-driven approaches have been developed based upon statistical modeling on real observation data and widely used for transmission dynamics projection, such as ARIMA (Benjamin, Rigby, and Stasinopoulos 2003) and ARGO (Yang, Santillana, and Kou 2015; Yang et al. 2017) . With rapid advances in artificial intelligence, deep learning based modeling as an alternative is proposed to solve infection projection, especially for emergency pandemic like COVID-19 (Wu, Leung, and Leung 2020; Hu et al. 2020; Yang et al. 2020; Fong et al. 2020) . However, these data-driven approaches can suffer from observation data limitation. Recently, a hybrid approach named DEFSI (Wang, Chen, and Marathe 2019) adopts compartmental modeling to alleviate data limitation problem in deep neural network training. Knowledge distillation (Hinton, Vinyals, and Dean 2015) is widely used to solve deep neural network compression problem. Conventional distillation process is carried out by training a smaller neural network called student model with class probability, which is referred to as "dark knowledge", to retain the performance of original cumbersome ensemble of models called teacher model. This approach can effectively reduce model size, which makes complex models feasible for real-world applications. Many complex applications in computer vision or natural language processing have justified its merits for model size reduction. For example, Dis-tilBERT (Sanh et al. 2019) successfully reduces the size of original BERT model by 40% with maintaining accuracy; TinyBERT (Jiao et al. 2019 ) leverages knowledge distillation to design a framework for the reduction of transformerbased language model, which leads to the models with lower time and space complexity, thus facilitating its application; relational knowledge distillation (Park et al. 2019 ) further optimizes distillation process and enables more productive student model, which can even outperform teacher model. However, this effective approach has not been applied to solve complex epidemiological modeling, especially the infeasibility of mixture epidemiological models. Figure 1 : Modeling with black-box knowledge distillation. Teacher model is an accurate but significantly complex comprehensive simulation system. Both observation and projection sequences are simulated results. Model query is optimized by sequence mixup. Mixup is a simple yet effective approach to augment training data and improve model performance (Zhang et al. 2017 ). This method is proposed to improve the generalization of deep neural network by enhancing coverage of data distribution, especially when training data are limited. The main idea is to incorporate convex combination into data synthesis, which involves mixing features and mixing labels. It has been widely used to address computer vision and natural language processing problems, like Between-Class learning in speech recognition (Tokozume, Ushiku, and Harada 2017) and image classification (Tokozume, Ushiku, and Harada 2018) , AutoAugment with learning strategy augmentation for classification (Cubuk et al. 2018) , and wordMixup or senMixup with embedding mixup for sentence classification (Guo, Mao, and Zhang 2019) . More studies explore its potential for data-efficient learning, such as active mixup ) and ranking distillation in (Laskar and Kannala 2020) . However, there is no work using mixup to enhance epidemiological modeling efficacy and efficiency. Figure 1 shows an overview of our approach on epidemiological modeling by black-box knowledge distillation. We leverage mixture models to build a comprehensive simulation system with accurate modeling knowledge yet significantly high complexity. Then, we use simulated observation sequences to query this system to retrieve simulated projection sequences as knowledge. To improve query efficiency and enhance knowledge transfer, sequence mixup is designed to further efficiently augment data pool. With retrieved and mixed observation-projection sequence pairs, a deep neural network is trained to retain the modeling accuracy of the original impractical simulation system and prepared for practical use. Many approaches can be used to create mixture models and build a comprehensive simulation system M. To ensure reliability, we select a widely accepted compartmental model of SEIR as the modeling approach. In SEIR, people in the modeled society, aka host society, must be in one of the four health states, i.e., susceptible, exposed, infectious, and recovered. The state transition starts from "susceptible", and then moves to "exposed", then to "infectious", and finally reaches "recovered" state. Thus, the model is constrained with the boundary condition of N = S + E + I + R, where S, E, I, and R denote susceptible, exposed, infected, and recovered population, respectively, and N represents the population of the entire host society. For accurate depiction of transient transmission dynamics, we employ linear mixture model (Brauer 2017) to represent the heterogeneity of host society (Bansal, Grenfell, and Meyers 2007) . The host society N is divided into several component host communities N i with the linear combination in Equation 1, and modeling results from these communities will be mixed to represent the dynamics of entire host society N . The division of host society is based on heuristics, which depends on modeling resolution. Within each community N i , transmission dynamics can be described by an ordinary differential equation (ODE) system, as shown in Equation 2, across all compartments. where S t i , E t i , I t i , and R t i denote susceptible, exposed, infected, and recovered population, respectively, at time t. β, σ, and γ denote infectious, latent, and recovery rate over the entire incidence, respectively. α and µ are referred to as natural birth and death rates during this period, respectively, which are assumed to be zero in this study. SEIR modeling is a typical boundary value problem (Farlow 1993), the solution of which relies on boundary condition (BC), initial condition (IC), and ODEs. In this study, for each component host community, constant BC is assigned by the total population N i due to no vital dynamics, IC is determined by the compartment state information , and ODEs are specified by the dynamics coefficients {β, σ, γ}. Conventional numerical modeling requires model calibration, , which adjusts parameters to obtain agreement between real observation data and modeled results, using grid search for an optimal combination of BC, IC, and ODEs ({BC, IC, ODEs}) within constraints in search space. If the search space for {BC, IC, ODEs} is larger and fine-grained, the calibration results are better fit to the real observation data and simulated projected results are more reliable. Therefore, we construct a comprehensive simulation system with an ensemble of simulation scenarios from large and fine search space, which enables accurate model calibration and projection. However, the complexity of this simulation ensemble system is very time-consuming for grid search due to curse of dimensionality. For example, suppose we have just 2 options for BC, IC, and ODEs (the real problems require much more). For each component host community, there are 8 simulation scenarios. However, if we have 10 component communities, the ensemble for the entire society N will reach 8 10 simulation scenarios. It is infeasible to find an optimal solution with random grid search. Therefore, we conduct knowledge distillation to distill this ensemble simulation system into a deep neural network for practical use. Conventional knowledge distillation is carried out by querying the teacher model to obtain prediction probabilities that are referred to as "knowledge". In our problem, the "knowledge" are simulated projection sequences from the simulation system since they contain the features of modeling process. To facilitate acquiring such kind of modeling "knowledge", we conduct model querying as follows. First, we prepare a simulated observation sequence over the calibration period with a {BC, IC, ODEs} for each host community. Each {BC, IC, ODEs} is used as a "key" to query teacher model. Then, the teacher model will use the "key" to return a query answer with a simulated sequence over the calibration and projection period, i.e., a projection sequence. With more queries, more projection sequences are obtained and more accurate modeling knowledge is acquired. To ensure adequate knowledge, distillation usually requires lots of training data from many model queries. However, too many queries can be time-consuming, and more importantly, the simulated observation sequences are still too limited to acquire diverse knowledge. For improvement in distillation efficacy and data diversity, we employ sequence mixup to reduce the number of queries and enlarge knowledge coverage.x = ω 1 x 1 + ω 2 x 2 + ... + ω n x n y = ω 1 y 1 + ω 2 y 2 + ... + ω n y n Our sequence mixup is developed with convex combinations of multiple observation sequences x i and projection sequences y i with mix rates ω i , where Σω i = 1. Equation 3 presents this mixup process which mixes observation sequences x and projection sequences y in the same manner. The mixup projection sequenceŷ in Equation 3 uses the same coefficients ω 1 , ω 2 , ..., ω n as inx and it can be briefly These mixed sequences as an alternative to query knowledge efficiently augment training data and enhance the knowledge transfer from teacher model. Thus, all retrieved and mixed sequences construct a training set (X, Y ). With the acquired observation-projection sequence pairs (X, Y ), a deep neural network is trained to distill the modeling knowledge within the comprehensive simulation system. The conventional distillation process is carried out by minimization on the distillation loss function L dis = D 1 (y true n , S(x n )) + D 2 (T (x n ), S(x n )), where T (x n ) is the output of data x n from teacher model T , S(x n ) is the output of data x n from student network S, D 1 is the supervised loss for supervised learning with data label y true n , and D 2 is the imitation loss for model output imitation. In our problem, there is no knowledge about the true label y true n for x n , and thus, the distillation loss is modified to the imitation loss only, as shown in Equation 5. We select mean squared error loss as distillation loss function. The proposed black-box knowledge distillation is a general approach that can be applied to different student networks. In the problem of COVID-19, we use multilayer perceptron (MLP) which is detailed in the case study. Algorithm 1 presents the overall procedure of our proposed black-box knowledge distillation based epidemiological modeling. Beginning with a modeling approach, a comprehensive epidemic simulation system is built as a teacher model M T . We then pick a few simulated observation sequences x to query the teacher model and retrieve their simulated projection sequences y. With obtained sequences (x, y), we construct a large observation-projection pool (X, Y ) using sequence mixup. Finally, we train a student deep neural network M S with (X, Y ). INPUT: A modeling approach F such as mixture SEIR. where ω is heuristically chosen. 4: Train a student deep neural network M S with (X, Y ) = (X obs , Y query ) ∪ (X mix , Y mix ) to minimize distillation loss L dis . Data. We evaluate our approach on the open COVID-19 datasets provided by Johns Hopkins University (Dong, Du, and Gardner 2020) . In this dataset, our experiments are focused on daily infection case increase. With these reported data, we derive active infection cases based on 7-day transmission duration (Thevarajan et al. 2020) , as the data do not explicitly report the number of recovered patients. The observation period starts from 04/06/2020 to 08/23/2020 and the evaluation period is from 08/24/2020 to 09/13/2020. . We refine them to more reliable ranges. With the refined parameter choices, this simulation system contains 160000 10 Table 1 : Error assessment of model calibration (04/06 -08/23) and projection (08/24 -09/13). scenarios for the entire society N , which is impractical. To facilitate distillation assessment, we conduct random sampling to reduce it to 10 7 scenarios as an approximate version of teacher model to the simulation system for comparative study. The teacher model generates a simulated projection sequence by minimizing the mean squared error between real observation and the simulation over the calibration period, which is similar to exhaustive search. Query Sequences and Mixup. We randomly pick 1000 {BC, IC, ODEs}s to prepare simulated observation sequences which are used to query teacher system. Note that, compared to the size of the ensemble, this number is so limited that we acquire little knowledge about simulation system with selected sequences, which still follows black-box teacher model setting. Given 1000 query results, we construct a large pool with 100K sequences by sequence mixup, where ω is set heuristically. Student Deep Neural Network Training. Our student network architecture is an MLP which has 3 hidden layers with 80 neurons each. The batch size is 128 and learning rate is set to 0.1. Adam optimizer is chosen. Weight decay is specified to 1e-5. The total epoch is set to 300 and learning rate is reduced by 90% after every 100 epochs. We select 1K sequences from the constructed sample pool as a training set for efficient training. Studied Cases. We implement our black-box distillation framework to distill comprehensive infection modeling system for US, Mexico, Philippines, and Brazil. The infection patterns of these countries are representative of complex dynamics which involves multiple peaks and complicates model calibration. To achieve an adequate teacher model on each studied country, we heuristically specify the search space boundaries for {BC, IC, ODEs}s with the information of national population, reported positive cases on March 30th (a week before April 6th), and outbreak severity for each country. where y o is the real observation sequence, y m is the modeled sequence, and n is the total number of sequences. MAPE and RMSE are two widely adopted metrics to evaluate regression models. While lower MAPE suggests that the general trend is better captured, higher error can occur at larger observation data. RMSE is a better indicator for large values since it offers higher penalty for these errors. Therefore, we use both metrics for accuracy evaluation. As to computation efficiency, we evaluate model complexity with required simulation scenarios and total time cost for each projection query. For student network, the network training cost is included in each query process although network retraining is not always necessary. Competing Methods. First, we compare our approach with the approximate teacher model and coarse search to examine accuracy and efficiency. Coarse search is developed upon coarse grid search space for mixture models. We reduce the number of compartment communities to 5, the options for BC to 5, and the choices for each ODE coefficient to 10, which could be taken as a reduced teacher model, but still with the complexity of 10000 5 . Similar to teacher model, for practical performance evaluation, we reduce it to 10 5 scenarios with random sampling, which ensures its similar data complexity to student network. In the following sections, approximate teacher model and coarse grid search are referred to as teacher model and coarse search, respectively. Next, we compare our student network with 7 state-of-theart forecasting models reported from CDC (Bracher et al. 2020) . These models are developed with machine learning based methods, like UM and UCLA-SuEIR, statistical methods, like DDS, physics-based model, like JHU-IDD and Columbia, and ensemble approaches, like UVA and CDC Ensemble (Ray et al. 2020) . Accuracy. Our calibration and projection results are reported with weekly increase cases in Figure 2 . Student network is comparable to teacher model and significantly outperforms coarse search. These performance differences are quantified with MAPE and RMSE in Table 1 . It is shown that, compared to the teacher model, student network achieves similarly low or even lower MAPE and RMSE, over the calibration or projection periods. This observation results from the approximation of teacher model and sequence mixup for student network training. Coarse search yields highest errors due to limited search space. We compare our student network with 7 state-of-the-art models in Table 2 , which are based on the reported data from CDC (Bracher et al. 2020) . Our model consistently outperforms CDC Ensemble, which incorporates all reported stateof-the-art models, with 30%−50% MAPE reduction over this period. In particular, our model yields more accurate 1 week ahead prediction and more consistent performance over three weeks compared to other models. Efficiency. From Table 3 , student network saves both simulations and time cost by orders of magnitude. Student network and coarse search are on par in total time cost, while the network training takes approximately 300 CPU seconds in our study. This performance gain results from the optimization with sequence mixup and lightweight network design. It justifies that our approach significantly improves modeling efficiency and can facilitate the application of complex and cumbersome epidemiological models. Significance of Mixup. Sequence mixup, as an efficient method for data augmentation, is very important to enhance knowledge transfer in our approach. Compared to coarse search and teacher model, our student network can learn more scenarios out of search space due to sequence mixup, and this knowledge can overcome the limit from search space, thus even improving calibration and projection accuracy. To justify its importance, we conduct experiments with 100K, 50K, and 25K mixed sequences from 1000 retrieved observation-projection sequences and evaluate their performance difference in calibration and projection for US. From Table 4 , the reduction in mixed sequences causes model degradation. The degradation becomes worse in the projection period due to calibration error propagation. Thus, sequence mixup is critical to accurate projection. Discussion. First, a comprehensive and accurate modeling system is critical in our framework. When this comprehensive teacher model is more complex and accurate, our student network can yield more accurate results. Next, student network can interpolate information in latent space which can resolve space discretization problem in grid search. The space of grid search is often too sparse to find an optimal solution. Therefore, dense search space is imperative, Complete Approximate Student Coarse Teacher Teacher Network Search Simulations 160000 10 10 7 10 3 10 5 Time(s) N/A ∼3×10 4 ∼ 400 ∼300 Table 4 : Calibration and projection errors from student network for US with 100K, 50K, and 25K mixed sequences. but its cost will exponentially increase. This can be alleviated by our proposed knowledge distillation. In addition, sequence mixup improves training data coverage and boosts model distillation, which helps student network even outperform teacher model. It implies that our proposed knowledge distillation scheme has potential to improve teacher model. Also, if a well-trained student network is obtained, the model could be reused many times, even when new data are included. In contrast, conventional random grid search, like teacher model or coarse search, has to be reset and query all entries again to retrieve projection solutions. This implies student network can save extra query cost. We propose an innovative accurate modeling approach which leverages mixture models to ensure high accuracy and employs black-box knowledge distillation to reduce complexity and improve accuracy. It consists of teacher model development, model querying, sequence mixup, and student network training. The developed teacher model is a comprehensive simulation system which can accurately model challenged transient dynamics but is impractical. Then, we prepare simulated observation sequences to query this simulation system and retrieve simulated projection sequences as knowledge for distillation. In particular, to save number of queries and enhance knowledge transfer, sequence mixup is designed and effectively augments training data. With retrieved and mixed observation-projection sequences, a student deep neural network is trained as a distilled model for practical use. Our COVID-19 case study on US, Mexico, Philippines, and Brazil justifies that this approach brings in high accuracy but lower complexity. Also, our approach outperforms some state-of-the-art methods, like CDC Ensemble, over the studied period. In future, this work will be extended and applied to more epidemiological studies. When individual behaviour matters: homogeneous and network models in epidemiology Generalized autoregressive moving average models Evaluating epidemic forecasts in an interval format Mathematical epidemiology: Past, present, and future Autoaugment: Learning augmentation policies from data An interactive webbased dashboard to track COVID-19 in real time Partial differential equations for scientists and engineers Composite Monte Carlo decision making under high uncertainty of novel coronavirus epidemic using hybridized deep learning and fuzzy rule induction Augmenting data with mixup for sentence classification: An empirical study Distilling the knowledge in a neural network Artificial intelligence forecasting of covid-19 in china Tinybert: Distilling bert for natural language understanding A contribution to the mathematical theory of epidemics Data-Efficient Ranking Distillation for Image Retrieval The reproductive number of COVID-19 is higher compared to SARS coronavirus Relational knowledge distillation Dis-tilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter Breadth of concomitant immune responses prior to patient recovery: a case report of non-severe COVID-19 Learning from between-class examples for deep sound recognition Betweenclass learning for image classification Neural Networks Are More Productive Teachers Than Human Raters: Active Mixup for Data-Efficient Knowledge Distillation from a Blackbox Model DEFSI: Deep learning based epidemic forecasting with synthetic information Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study Using electronic health records and Internet search information for accurate influenza forecasting Accurate estimation of influenza epidemics using Google search data via ARGO Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions mixup: Beyond empirical risk minimization This project was support in part by NSF 1704309 and UCF COVID-19 Artificial Intelligence and Big Data Initiative.