key: cord-0878195-3y2t7l45
authors: Dong, Chunjiao; Qiao, Yixian; Shang, Chunheng; Liao, Xiwen; Yuan, Xiaoning; Cheng, Qin; Li, Yuxuan; Zhang, Jianan; Wang, Yunfeng; Chen, Yahong; Ge, Qinggang; Bao, Yurong
title: Non-contact screening system based for COVID-19 on XGBoost and logistic regression
date: 2021-11-03
journal: Comput Biol Med
DOI: 10.1016/j.compbiomed.2021.105003
sha: 88a63ce613cb7fa499f6fb8cf8b02e23f9af366b
doc_id: 878195
cord_uid: 3y2t7l45

BACKGROUND: The coronavirus disease (COVID-19) effected a global health crisis in 2019, 2020, and beyond. Currently, methods such as temperature detection, clinical manifestations, and nucleic acid testing are used to comprehensively determine whether patients are infected with the severe acute respiratory syndrome coronavirus 2. However, during the peak period of COVID-19 outbreaks and in underdeveloped regions, medical staff and high-tech detection equipment were limited, resulting in the continued spread of the disease. Thus, a more portable, cost-effective, and automated auxiliary screening method is necessary. OBJECTIVE: We aim to apply a machine learning algorithm and non-contact monitoring system to automatically screen potential COVID-19 patients. METHODS: We used impulse-radio ultra-wideband radar to detect respiration, heart rate, body movement, sleep quality, and various other physiological indicators. We collected 140 radar monitoring data from 23 COVID-19 patients in Wuhan Tongji Hospital and compared them with 144 radar monitoring data from healthy controls. Then, the XGBoost and logistic regression (XGBoost + LR) algorithms were used to classify the data according to patients and healthy subjects. RESULTS: The XGBoost + LR algorithm demonstrated excellent discrimination (precision = 92.5%, recall rate = 96.8%, AUC = 98.0%), outperforming other single machine learning algorithms. Furthermore, the SHAP value indicates that the number of apneas during REM, mean heart rate, and some sleep parameters are important features for classification. CONCLUSION: The XGBoost + LR-based screening system can accurately predict COVID-19 patients and can be applied in hotels, nursing homes, wards, and other crowded locations to effectively help medical staff.

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has resulted in a large-scale global pandemic with its extreme contagiousness and high fatality rate [1] . temperature testing were proposed as alternative diagnostic methods in hospitals [2, 3] . However, owing to the relative novelty of the disease, doctors and medical personnel encounter challenges in accurately identifying coronavirus disease (COVID-19) cases in underdeveloped regions. As a result, many suspected cases cannot be tested, treated, and quarantined in time; thus, the spread of the virus may continue [4] [5] [6] . Moreover, because SARS-CoV-2 is highly infectious, the risk of infection is three times higher in doctors and nurses than in the general population, as they are closer to patients during treatment [7] . Therefore, this study aims at finding a non-contact automated detection device and method for testing SARS-CoV-2 infections without assistance from professional physicians.

Infected patients experience symptoms such as fever, fatigue, dry cough, and dyspnea [8] . Some researchers have attempted to detect these J o u r n a l P r e -p r o o f symptoms using non-contact devices and machine learning algorithms [9] . They focused on two main issues: the features extracted from the subject and the algorithms used to recognize infected subjects.

Regarding the features, many infectious disease detection systems have focused on detecting abnormal changes in heart rate, respiration rate, and facial temperature. Yang et al. [10] designed a contactless dengue fever screening system with microwave sensors (detecting heart rate, respiration rate, and the standard deviation of heartbeat interval) while using a neural network and the SoftMax function to determine whether the disease is infectious as well as the probability of infection, achieving a 98% accuracy. However, they believed that facial temperature could be greatly influenced by the environment. Matsui et al. [18, 19] used respiratory rate, heart rate, and facial thermal imaging to rapidly screen influenza based on linear discriminant analysis (LDA). Their results revealed that the system has higher accuracy than systems with only thermal imaging, achieving a 88.9% accuracy.

In terms of single classification methods, most studies applied a single machine learning classifier to detect infections; for example, logistic regression [10] , support vector machine (SVM) [11, 12] , k-means [13] , k-nearest neighbor (KNN) [14] , decision trees [15] , and random forest (RF) [16, 17] 

In this study, a non-contact vital sign monitoring system was used to monitor COVID- The sleep monitoring report has 25 data points, as shown in Table 1 , which fully reflect the patient's nighttime breathing, heartbeat, sleep structure, body movement, apnea, and other aspects. The data distribution of patients was significantly different from that of healthy subjects. Therefore, we can classify patients and healthy subjects based on these features. 

where hypothesis ( ) is the output of the -th tree, ̂( ) is the current output of the model, and is the actual result. represents the number of decision trees, and represents the -th iteration, that is, every time we find an optimal model to add J o u r n a l P r e -p r o o f to the existing model to make the predicted value closer to the real value. We build the optimal model by minimizing the loss function. When the training dataset is small, it is easy to over-fit; therefore, it is generally necessary to add a regular term to reduce the complexity of the model.

where F is the hypothetical space and ( ) is the control of the complexity of the model. Therefore, the objective function is given by the following equation:

The first part on the right side of equation (4) is the training error, and the middle part is the complexity of the penalty model (the sum of the complexity of all trees), which contains two parts:

the number of leaf nodes and the value of each leaf node. The expression is given as

where T is the number of leaf nodes, ||w|| is the module of the leaf node vector, Υ is the difficulty of node segmentation, and λ is the L2 regularization coefficient.

We expand the loss function using Taylor's approximation to the quadratic term and use the greedy algorithm to solve the model parameters.

LR is a basic binary classification model. Based on the importance of the feature values, the top eight features, namely, "REMSATims," "meanHR," "slepMin," "latnMin," "AHI," "meanNMD," "maxRR," and "medHR," are selected for model training.

We selected precision, recall, and the receiver operating characteristic (ROC) curves as the evaluation criteria [25] . The precision measures the accuracy of the model in terms of false positives, that is, the number of healthy subjects misdiagnosed with the disease. The lower the false positive rate, the higher the precision of the model.

The recall rate provides information regarding false-negative cases, that is, the number of infected patients predicted to be healthy. In the context of disease screening, the missed detection of infected patients is considered more serious than that of healthy subjects. Therefore, in terms of evaluation models, recall is more significant than precision.

The 

In this study, six machine learning algorithms, The 10-fold cross-validation method [26] was used to further divide the training set into 10 copies, one such copy was cyclically extracted as the validation set for the optimal parameters. Table 2 lists the final parameter settings for each algorithm. The confusion matrix of the six algorithms is shown in Figure 5 . Here, it can be seen that the RF shown in Figure 6 . The classification performance of XGBoost+LR is also better than that of other algorithms, with an AUC of 0.988. In summary, we demonstrated the relationship between the data obtained from non- 

Clinical features of patients infected with 2019 novel coronavirus in Wuhan

New standards for devices used for the measurement of human body temperature

Medical applications of infrared thermography: A review

The non-contact handheld cutaneous infra-red thermometer for fever screening during the COVID-19 global emergency

Symptom Screening at Illness Onset of Health Care Personnel With SARS-CoV-2 Infection in King County

Symptom-based screening for COVID-19 in health care workers: The importance of fever

Use of Physiological Data From a Wearable Device to Identify SARS-CoV-2 Infection and Symptoms and Predict

Diagnosis: Observational Study

Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study

How machine learning could be used in clinical practice during an epidemic. Critical care

Dengue Fever Screening Using Vital Signs by Contactless Microwave Radar and Machine Learning

Support-vector networks

A non-contact infection screening system using medical radar and Linux-embedded FPGA: Implementation and preliminary validation

A novel infection screening method using a neural network and k-means clustering algorithm which can be applied for screening of unknown or unexpected infectious diseases

Multiple Vital-Sign-Based Infection Screening Outperforms 5

A novel machine-learning-based infection screening system via 2013-2017 seasonal influenza patients' vital signs as training datasets. The Journal of infection

Random Forests

Pattern Recognition and Machine Learning

A novel screening method for influenza patients using a newly developed non-contact screening system

Short Time and Contactless Virus Infection Screening System with Discriminate Function Using Doppler Radar. Bio-inspired Computing: Theories and Applications

infectious diseases: IJID: official publication of the International Society for Infectious Diseases

XGBoost: A Scalable Tree Boosting System

Noncontact accurate measurement of cardiopulmonary activity using a compact quadrature Doppler radar sensor

A Unified Approach to Interpreting Model Predictions

A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation

A study of cross-validation and bootstrap for accuracy estimation and model selection

19 diagnosis is difficult in underdeveloped and understaffed areas  Physiological indicators e.g., heart rate and sleep quality were measured  XGBoost and logic regression were combined to classify patient data  Achieved precision = 92.5%, recall rate = 96.8%, and AUC = 98

Apneas during REM, mean heart rate, and sleep parameters are key features