key: cord-0915959-vxtcsd7o
authors: Fathi-Kazerooni, Sina; Rojas-Cessa, Roberto; Dong, Ziqian; Umpaichitra, Vatcharapan
title: Correlation of subway turnstile entries and COVID-19 prevalence and deaths in New York City
date: 2020-12-04
journal: Infect Dis Model
DOI: 10.1016/j.idm.2020.11.006
sha: 4efcfb0c1ea0ce976a7f34e8b3cac4608f880503
doc_id: 915959
cord_uid: vxtcsd7o

In this paper, we show a strong correlation between turnstile entries data of the New York City (NYC) subway provided by NYC Metropolitan Transport Authority and COVID-19 deaths and cases reported by the NYC Department of Health from March to May 2020. This correlation is obtained through linear regression and confirmed through the prediction of the number of deaths by a Long Short-Term Memory neural network. The correlation is significantly accentuated through the consideration of incubation and symptomatic phases of this disease as experienced by people who died from it. We extend the analysis to each individual NYC borough. We also estimate the dates when the number of COVID-19 deaths and cases would approach zero by using the Auto-Regressive Integrated Moving Average model on the reported deaths and cases. We also backward forecast the dates when the first cases and deaths may have occurred.

New York City (NYC) was the epicenter of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in early 2020 [16] . Many people have been hospitalized and also many others lost their lives to this disease. It is unknown when the first COVID-19 cases occurred in the city, but the first case was reported on Feb. 29, 2020. The COVID-19 cases and deaths in NYC increased rapidly after that date, peaked in mid-April 2020 [21] , and started to decline after the city implemented a series of containment measures, including closing schools, social distancing, business shut-downs, and shelter-in-place [20] . As a result, the COVID-19 pandemic has brought and continues to bring economic hardship to NYC residents and people worldwide, and especially those in the lower economical strata [31] .

As city dwellers mainly rely on public transportation and more notably the subway to move around, it is expected that a highly contagious virus such as SARS-CoV-2 would easily spread through the NYC population. It is expected that the NYC subway, being an enclosed environment and also a major indicator of people's social activity, would be directly correlated with the spreading of SARS-CoV-2.

Although this correlation has been described by reported cases throughout neighborhoods, the correlation between the subway ridership and COVID-19 prevalence and deaths has not been shown. In this paper, we show this correlation be-tween the public turnstile entry data of NYC subway provided by the NYC Metropolitan Transportation Authority (MTA) [23, 32] and the statistics of the confirmed COVID-19 deaths provided by the NYC Department of Health (DOH) [22] . We used linear regression and tested the turnstile data from a few days before the day of the reported deaths and cases to support this correlation.

There are many prediction models for forecasting the spread of COVID-19 [2, 3, 6-8, 10, 14, 15, 17, 18, 24-26, 28, 30, 33, 34] . Much of the literature focuses on modeling the prevalence of COVID-19 in different countries and partially enclosed environments, such as cruise ships [6, 8, 14, 17, 28, 30, 34] . These models use local and regional data to examine the correlation of COVID-19 cases with environmental or social factors. Many of the approaches use the Susceptible-Exposed-Infective-Recovered epidemiological model to analyze data on this outbreak. These models rely on the accuracy of the reported cases. However, the accuracy of case reports is expected low at the early stage of this new outbreak and that may add significant errors in their analysis and forecast. To avoid such errors, some models forecast the number of COVID-19 cases according to the population size [13, 25, 26] . There is an increasing interest in models that estimate the impact of public transit on prevalence and the possibilities of contagion by airborne transmissions [10, 18] .

Our analysis leverages on the reported number of deaths because their count has higher accuracy than the reported number of cases. However, the data on cases is larger than the one on deaths, so that we also analyze the reported cases. We employ linear regression for analyzing the correlation of the reported number of COVID-19 deaths and the NYC subways turnstile entries and Long Short-Term Memory (LSTM) neural network [12] for predicting the number of deaths and cases through time-series data analysis to confirm the correlation between these two data sets. We extend this analysis to each of the five NYC boroughs: Bronx, Brooklyn, Manhattan, and Staten Island.

We also use the Auto-Regressive Integrated Moving Average (ARIMA) model to 1) predict the dates when COVID-19 deaths and cases would approach zero according to the reported data, and to 2) estimate the time when the contagion might have started in NYC by identifying the dates when the first case and death may have occurred through analysis of the provided cases and deaths data.

The analysis performed in this paper uses exclusively the mentioned public datasets, and so it leaves out factors that may have affected the actual counts of cases and deaths. For example, the data analyzed in this paper was recorded while no face coverings were widely considered by NYC dwellers.

The remainder of this paper is organized as follows. Section 2 describes the models and methods used in our data analysis. Section 3 describes the obtained results. Section 4 presents a discussion and remaining questions. Section 5 presents our conclusions.

In this section, we present the data sources and analysis tools used in this paper.

The NYC DOH dataset reports the first COVID-19 case on Feb. 29, 2020 and the first death on Mar. 11, 2020. The MTA turnstile entry data includes the number of subway entries and exits per station in NYC, recorded every four hours each day from a total of 379 subway stations [23, 32] . We calculate the average daily entries of NYC subway stations per day for our analysis. We also use the number of daily cases and the number of daily deaths of the COVID-19 dataset published by the NYC DOH [22] .

We consider the reported COVID-19 incubation and symptomatic periods [5, 11] , as a combined period, in the reported subway entry data as features in the analysis. We call these features shifted days. For example, a shifted day is a day prior to the day under prediction. The number of daily subway entries is denoted as 0 for a given day 0 , and the number of turnstile entries of days before 0 is denoted as 0 − . We use 0 and 0 − as features to predict the number of deaths and cases for day 0 , where 1 ≤ ≤ 25. We use 20% of the NYC DOH data as test data for validation and the rest for training.

In our analysis, we use linear regression with L1 regularization to find the correlation between the two datasets and the precedence of the different shifted days. We use LSTM, a recurrent neural network (RNN) [12] that learns the time evolution of data sequences [12, 27] , to predict the number of daily deaths and the number of daily cases and evaluate their accuracy as verification of the correlation in NYC and its five boroughs.

We use ARIMA, a tool to forecast time-series data [4, 9] , to predict the number of daily deaths and the number of daily cases, evaluate their accuracy, and estimate the dates of when the first case and death may have occurred. For the latter use, we analyze the data in a backward (time) sequence.

We adopt multiple linear regression with independent variables [19] . The estimated valuê in the linear regression model follows:

where represents the model weights or regression coefficients and is the error. L1 regularization is the absolute sum of the model weights multiplied by a shrinkage value ( ). It is formulated as ∑ | | and the regression model goal is to minimize:

where is the actual data point and is the value of the independent variables for each data point.

We evaluate the accuracy of the prediction performed through LSTM as an indicator of the correlation extent between the two datasets. We use a Savitzky-Golay filter to smooth out the prediction results of LSTM [29] .

ARIMA consists of three models: auto-regressive (AR), integration, and moving average (MA). It uses three hyper parameters: , , and , where is the order of auto-regressive model, is the degree of difference, and is the order of moving average [4] . The ARIMA( , , ) is defined as

where is the coefficient of discrete time linear equation of AR, is a time lag operator defined as = −1 , is the observed value at time , is the coefficient for the noise term, , in MA [9] .

The performance of the prediction models is evaluated by the 2 score, mean absolute error (MAE), and root mean square error (RMSE). The 2 score indicates how close the data are from the obtained regression [19] . 2 score is calculated as

where is the residual sum of squares and is the total sum of squares.

is calculated as 

where is the number of observations. An MAE that approaches 0 means that the model predicts data points with high accuracy. We use MAE to measure the training and validation losses of LSTM and, in turn, to show whether the prediction is a good fit and not overfitting or underfitting occurs. If the training loss is significantly lower than the vali-dation loss, the model is overfitting the training data, which means the model is learning complex patterns of training data that may not generalize in predicting unseen test data and it results in poor performance. If the validation loss is significantly lower than training loss, the model is underfitting the training data, which means that the model is unable to learn important patterns of the training data, and it results in poor model performance. The training is stopped and the model is improved to avoid overfitting and underfitting if the losses grow (although in different directions) as the training progresses [1] . RMSE is defined as

In other words, RMSE is the standard deviation of residuals. With normally distributed data points, one can expect 68% of predicted points to be within one RMSE from the mean and 95% to be within two RMSE from the mean. In LSTM predictions, we consider the combination of 2 scores and RMSE to assess accuracy. Figure 1 shows the number of subway turnstile entries and the reported daily COVID-19 deaths and cases from Mar. 1, 2020 to May 15, 2020. The trend of deaths trails the cases by a few days. The turnstile data include variations of ridership before the city executed a lockdown. We tested the correlation between NYC subway turnstile entry data and reported daily COVID-19 deaths with turnstile entry data of the predicted day 0 and shifted days ( 0 − ) for 1 ≤ ≤ 40, using linear regression. Figure 2 shows the calculated 2 score for 0 and each shifted day. The score of each considered shifted day is about or larger than 0.2, showing a strong correlation between these two data sets. Next, we predict the cases and deaths using LSTM to verify the correlation of the data sets. From the different number of shifted days shown in Figure  2 , we selected those that correspond to the possible reported incubation period of SARS-CoV-2 in a person plus possible symptomatic days. We used three feature sets (Feature sets A, B, and C). Feature set A uses the entries of the predicted day and the 14th prior day ( 0 , 0 − ; where = 14). Feature set B uses entries of the predicted day and five prior days: 10th, 12th, 13th, 14th, and 17th prior days ( 0 , 0 − , where = 10, 12, 13, 14, 17) . Feature set C includes the entries of the predicted day and prior-day entries of each of the considered 25 days ( 0 , 0 − ; where = 1, ..., 25) . This set is selected for completeness and as a reference.

We used the feature sets described in Section 2 on LSTM to predict the number of deaths and the number of cases. We compared the predicted numbers using the different features with the actual reported numbers. Figures 3(a) and 3(b) show the MAE of LSTM training and validation losses for the prediction of the number of deaths using Feature sets A and C, respectively. Feature Figure 3 (c) shows the prediction of the number of deaths using Feature sets A and C, respectively. Feature set C is the top performing feature set and Feature set B is the bottom performing one for the prediction of deaths, as shown in Table 2 . The results show that the correlation is strong with one shifted day and the correlation strengthens by adding more shifted days to the feature set as the scores show for the three considered feature sets. We also tested the use of the three feature sets on LSTM to predict the number of cases. Figures 4(a) and 4(b) show the MAE of training and validation losses in the prediction of cases for Feature sets A and C, respectively. The 2 score on the test data is 0.74 and the RMSE is 1,510.43 for Feature set A, and the 2 score and the test data is 0.99 and the RMSE is 389.71 for Feature set C. The LSTM predicted cases using Feature set A (Figure 4(c) ) follows the actual data with, however; small discrepancies. The prediction using Feature set C also follows the actual data; however, it is closer than that predicted with Feature set A. Table 2 summarizes the 2 scores and RMSE for each feature set in the prediction of cases. The RMSE of the predictions of the number of cases is much higher than that of the prediction of the number of deaths because the number of cases each day is larger than deaths. These losses further confirm the lower accuracy of the reported cases despite the high R 2 scores.

We analyzed the LSTM predictions of turnstile data and COVID-19 cases and deaths for each NYC borough to show the correlation for the different boroughs of NYC. Each borough may have different socioeconomic status. Table 3 lists the 2 scores and RMSE for each feature set and borough.

We also evaluated a prediction of the number of deaths and cases per borough considering the turnstile entry data of the boroughs. Table 3 summarizes the 2 and RMSE of the prediction of deaths per borough. Each of these scores was estimated using NYC-subway turnstile entry data per borough for each of the three feature sets used; A, B, and C. The table shows that 2 is the highest for Brooklyn and Queens using Feature sets A and C, and Bronx and Queens using Feature set B.

The RMSE is the smallest for Manhattan and Staten Island for all feature sets. These values combined with their 2 shows that the correlation is high. The RMSE is highest for Brooklyn and Queens in general. During the analyzed period of time, these two boroughs reported the highest number of deaths. Table 3 also shows 2 and RMSE for the number of cases per borough. Different from deaths, cases shows high 2 for the five boroughs; and reaching above 0.81 with Fea-ture set C. However, their RMSE is also higher than those for deaths. These scores present close similarities to those obtained at NYC level (i.e., all boroughs together). Figure 5 shows the predicted number of deaths for each of the five NYC boroughs. This figure shows the bottom and top performing feature sets; Feature sets A and C, respectively. The figure shows that for Feature set A, Brooklyn shows the highest similarity to the actual values, and Manhattan and Staten Island the lowest similarity to the actual values. As for Feature set C, the prediction of deaths is very similar to the actual data for the five boroughs. Manhattan and Staten Island reported the smallest number of cases during this period. Therefore, Feature set C confirms the high correlation of deaths and turnstile data for these boroughs. Figure 6 shows the predicted number of cases for each of the five NYC boroughs. Bronx, Brooklyn, and Queens reported the larger number of cases during this period. Different from the predicted number of deaths, Feature sets A and C closely follow the actual data on reported cases for the five boroughs. Such predictions are very similar to the one obtained for NYC as a whole.

These results show that the LSTM predictions can be transported to smaller segments of the city. They are consistent with those obtained for the entire city.

Figure 7(a) shows that considering the time series of the number of deaths on May 16, 2020, the lower bound (with 95% confidence interval) of ARIMA forecast zero deaths on that date. On the other hand, the upper bound does not cross zero, indicating that the number of deaths could also increase. Figure 7 (b) shows the ARIMA backward forecast of the number of deaths to estimate the date when the first death may have occurred in NYC. The graph shows that the forecast date of the first death could have occurred on Mar. 9, 2020, two days earlier than the first reported death. Figure 7(c) shows the ARIMA forecast of the progression of the number of COVID-19 cases in NYC after May 15, 2020 (last day of reported data). The forecast follows the decreasing trend of the number of cases, reaching zero on Jun. 20, 2020. However, this is a very early date as the number of cases has not reached zero by the time this paper was written. Figure  7 (d) shows the backward forecast of the number of cases to estimate the date of the first case in NYC. The lower and upper bounds of 95% confidence interval of ARIMA forecast shows that the first case of COVID-19 in NYC might have occurred between Jan. 28 and Mar. 20, 2020. The first case was reported on Feb. 29, 2020, but the actual first case is unknown. Table 4 shows the first reported COVID-19 cases 

The reporting of cases has been affected by the deployment and execution of testing for SARS-CoV-2. The effective and widespread testing offers better identification of the cases and their locations. But because of the time needed to detect a likely case, the limitation of testing capabilities at the beginning of the pandemic, and the possible occurrence of asymptomatic cases, the reported COVID-19 cases are likely inaccurate.

The estimations performed with ARIMA are based on the provided data, with no additional information. The discrepancies on the estimated dates and the reported dates are the product of the model using solely past data from the two analyzed datasets. Public health suggestions on the use of face covering and social distancing are not considered in the models to reflect the multi-variant effects in linear regression. The use of face coverings adopted at some point in time or neglecting social distancing may affect the actual trends that past data may not be able to forecast.

Looking ahead, comparisons of the prevalence of COVID-19 in NYC with cities that experienced an early face-covering measure is of interest for forecasting future prevalence or containment of the spread of this virus.

We have shown that the NYC turnstile entry data and the COVID-19 deaths and cases data in NYC from March 1, 2020 to May 15, 2020 are highly correlated. We have also shown that the correlation is also detectable for the data of each individual borough of NYC. However, each borough shows a different correlation strength that is probably associated with their geographical location and economical status. The consideration of different incubation-symptomatic phases for subway users as features through Long Short-Term Memory analysis confirms and enhances this correlation. Based on the dataset on COVID-19 deaths and cases in NYC, we have also shown the predicted progression of deaths and cases beyond the dates of the reported data using the Auto-Regressive Integrated Moving Average forecast model. This progression shows the possible decline of COVID-19 prevalence but also a probable increase. We have also shown a backward forecast of the possible dates when the first case and death might have occurred at the beginning of the pandemic in NYC. The backward forecast of the first death shows high accuracy. 'DWHV 1XPEHURIGHDWKV 'HDWKV )RUHFDVW FRQILGHQFHLQWHUYDO 

TensorFlow: Large-scale machine learning on heterogeneous systems

Optimization method for forecasting confirmed cases of COVID-19 in China

Databased analysis, modelling and forecasting of the COVID-19 outbreak

Time series analysis: forecasting and control

Interim Clinical Guidance for Management of Patients with Confirmed Coronavirus Disease (COVID-19)

Estimation of COVID-19 prevalence in Italy, Spain, and France. Science of The Total Environment

Real-time forecasts and risk assessment of novel coronavirus (COVID-19) cases: A data-driven analysis

Analysis and forecast of COVID-19 spreading in China, Italy and France

ARIMA models

Analysing the link between public transport use and airborne transmission: mobility and contagion in the london underground

Clinical characteristics of coronavirus disease 2019 in china

Long short-term memory

A contribution to the mathematical theory of epidemics

Early dynamics of transmission and control of COVID-19: a mathematical modelling study

Prediction of the epidemic peak of coronavirus disease in Japan

Why New York is the epicenter of the American coronavirus outbreak

Propagation analysis and prediction of the COVID-19

Modeling epidemic spreading through public transit using timevarying encounter network

Introduction to linear regression analysis

The socio-economic implications of the coronavirus pandemic (covid-19): A review

COVID-19 Data

COVID-19 Dataset

Turnstile data

Forecasting COVID-19

Forecasting the novel coronavirus COVID-19

Prediction of number of cases of 2019 novel coronavirus (COVID-19) using social media search index

TensorFlow for deep learning: from linear regression to reinforcement learning

Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil

What is a savitzky-golay filter?

Correlation between weather and Covid-19 pandemic in Jakarta

Variation in COVID-19 hospitalizations and deaths across New York City boroughs

Taming the mta's unruly turnstile data

Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions

Estimation of the reproductive number of novel coronavirus (COVID-19) and the probable outbreak size on the Diamond Princess cruise ship: A data-driven analysis