key: cord-1000139-1kj3ivn9 authors: Lin, Mengxuan; Chen, Hui; Wang, Yuqi; Qiu, Shaofu; Yang, Mingjuan; Du, Xinying; Zheng, Tao; Song, Hongbin; Wang, Ligui title: A Model Study on Predicting New COVID-19 Cases in China Based on Social and News Media date: 2022-01-10 journal: J Infect DOI: 10.1016/j.jinf.2022.01.009 sha: 2b43e4e92660f6ee90450e5c23c296406a3adfd5 doc_id: 1000139 cord_uid: 1kj3ivn9 nan media-based predictions of the COVID-19 pandemic were sourced only from single social media platforms, such as Twitter and Weibo. As users aged under 30 years account for more than half of Twitter and Weibo users, the user demographics of Twitter and Weibo are too one-sided in age composition, which is not preferable in statistical analysis; thus, Twitter and Weibo cannot represent all social media. 5 In addition, few reports use news media data, which will lead to low comprehensiveness and objectivity of prediction. In this study, we collected the daily new confirmed cases in China released by the National Health Commission from January 1, 2020, to March 18, 2020, totaling 78 days, as the data for analysis and prediction (See Supplementary Appendix). 8 We chose these dates because the health commissions at all levels in China officially began to count newly confirmed cases every day starting from January 1, 2020, and newly confirmed cases in Wuhan, China, fell to 0 on March 18, 2020. Correspondingly, a web crawler technology was used to capture public information from major news websites, electronic newspapers, Weibo, WeChat, and other APPs. The meta-search crawler obtained data from search engine webpages using 32 keywords, such as "fever", "pyrexia", and "cough" et al (See Supplementary Appendix). We included more than 1,000 mainstream news outlets and electronic newspapers in China, such as China News, China Daily, and People's Daily. By analyzing data from major news media websites, electronic newspapers, Weibo, WeChat, and other APPs related to the COVID-19 pandemic, we obtained the daily total relative index of each keyword sourced from different platforms. We calculated the daily total relative indexes of the 32 keywords and their correlation coefficients with daily new confirmed COVID-19 cases in China ( Figure 1A ). The keywords showing strong correlation (Pearson correlation coefficients >0·8) with new confirmed cases were identified to be "fever", "cough", "fatigue", "coronavirus", and "novel coronavirus". We plotted the daily relative indexes of these five keywords and the trend curves of daily new confirmed cases in China for visual analysis ( Figure 1B ). The trend curves showed that "coronavirus" and "novel coronavirus" had the best correlation with new confirmed cases in China, which is consistent with the histogram results. (Table 1) . Table 1 shows that best subset selection, partial least squares regression, stepwise regression, and elastic net regression all achieved good performance and prediction accuracy. Partial least squares regression was the best model we identified according to the parameters. It had the best performance and the lowest error among the five models (Adjusted 2 =93·75%, Cross-validation 2 =88·26%, RSS=7·143, MSE=0·101). The effects of 0.632 bootstrapping training set and prediction set verify the results of Table 1 (See Supplementary Appendix). In the results of predicting future cases by using the date before February 19, 2020 as the training set, the 2 of principal component analysis, best subset selection, partial least squares regression, stepwise regression, and elastic net regression were 68·54%, 79·32%, 89·19%, 79·32%, and 77·60%, respectively. Partial least squares regression has the best goodness-of-fit. In this article, the correlation and hysteresis between more than 1000 social and news media and COVID-19 cases were analyzed and calculated. The results showed that compared with social media, news media had stronger average correlation, played a more important role in COVID-19 prediction, and was a data source that cannot be ignored. Using social and news media data, we proposed five different prediction models to predict the daily new confirmed cases in China, compared the five models, and selected partial least squares regression as the optimal model. This comprehensive model had excellent accuracy and low error and can effectively predict the daily new confirmed cases in China 3 days in advance based on social and news media data. In the future, our proposed model could be a powerful supplement to traditional methods of infectious disease surveillance. This work was financially supported by grants from the China Mega-Project on Infectious Disease Prevention (No. 2017ZX10303401). Note: 1. The final regression equation and various evaluation indexes of best subset selection and stepwise regression were the same. 2. NA represents the default value, and for model verification, partial least squares regression does not require calculation of the response degree . Social media WeChat infers the development trend of COVID-19 A review of infectious disease surveillance to inform public health action against the novel coronavirus SARS-CoV-2. SocArXiv Ensemble predictions of coronavirus disease 2019 (COVID-19) in the US Epidemic model guided machine learning for COVID-19 predictions in the United States Prediction of COVID-19 waves using social media and Google search: a case study of the US and Canada Using reports of symptoms and diagnoses on social media to predict COVID-19 case counts in Mainland China: observational infoveillance study incidence using anosmia and other COVID-19 symptomatology: preliminary analysis using Google and Twitter National Health Commission of the People's Republic of China