key: cord-0843126-cqeaz6dx authors: Chire Saire, J. E.; Pineda-Briseno, A. title: Text Mining Approach to Analyze Coronavirus Impact: Mexico City as Case of Study date: 2020-05-12 journal: nan DOI: 10.1101/2020.05.07.20094466 sha: e6c010432ace4b0f6e84d012145f513d13b1eef2 doc_id: 843126 cord_uid: cqeaz6dx The epidemiological outbreak of a novel coronavirus (2019-nCoV or Covid-19) in China, and its rapid spread, gave rise to the first pandemic in the digital age. Derived from this fact that has surprised humanity, many countries started with different strategies in order to stop the infection. In this context, one of the greatest challenges for the scientific community is monitoring (real time) the global population to get immediate feedback of what is happening with the people during this public health contingency. An alternative interesting and affordable for the materialization of the aforementioned are the social networks. In a social network, the persons can act as sensors/information not only of personal data but also data derived from their behavior. This paper aims to analyze the publications of people in Mexico using a Text Mining approach. Specifically, Mexico City is presented as a case study to help understand the impact on society of the spread of Covid-19. A novel infectious disease (2019-nCoV or better known as identified at the end of 2019 in Wuhan, China, originated an outbreak of viral pneumonia earlier this year. Due the rapid dissemination of this respiratory disease, increased deaths and resource depletion, the World Health Organization (WHO) declared coronavirus as pandemic [1] [2] . This international public health emergency is rapidly mobilizing the government, the industry and the scientific community to respond effectively to this new disease that represents a highrisk for human population and a negative impact for others areas such as economy, education, politics, etc. By April 30, 2020, 3,096,626 confirmed cases of Covid-19 and 217,896 deaths were globally reported to World Health Organization (WHO) [3] . On February 29, 2020, the first two cases of Covid-19 were reported in Mexico. By April 30, the total number of infections was 17,799 and 1732 deaths in this North American country [4] . Regarding Mexico City, by April 30, 2020, there have been confirmed 6412 cases and around of 379 persons have died due to the Covid-19 disease [5] . Over the past few years, the explosive growth of Social Networks (SNs) indicates that more people are using them to connect and communicate their ideas, interests, feelings, experiences, events, including health related information. Therefore, the SNs are a huge source of heterogeneous data can be exploited for public health monitoring and surveillance purposes. [6] [7] . Among SNs, "Twitter" is one of the most widely used micro-blogs. It's a free service for sharing shorttext messages or "tweets" limited to 280 characters. Currently in the world Twitter has around 340 million active users [8] . Meanwhile in Mexico, with a population of around 125 million people, it has an audience of 9.45 million of users, being Mexico City with the highest number of active twitter users [9] . The main contribution of this work is evaluate the keywords related to Covid-19 in an effort to understand how a public health emergency of international concern plays out in social media, and Twitter in particular, in Mexico City. The remainder of the paper follows. Section 2 presents related works regarding the retrieval infectious diseases information from social media. In section 3, the data collection methodology for extracting relevant information of Covid-19 from Twitter is presented. Section 4 describes experimental findings and a discussion related to the analysis. Finally, conclusions and future work are described in Section 5. There are many research papers discussing the uses of Twitter data, and the valuable contribution of information for the field of public health. In this section some important approaches are summarized. In the works [10] and [11] are presented studies about the power of using Natural Language Processing (NLP) techniques to generate new information collected from Twitter for public health research. Other contributions, such as [12] , demonstrated experimental findings that give rise to future research works on when to use Twitter information for public health. In order to show the effectiveness of use social media information as strategy for public research in [13] and [7] are presented some analysis. Both research papers conclude with a recommendation to combine social media information with other techniques to measure disease surveillance and spread. In [14] is presented a real time system for the prediction and detection of the proliferation of an epidemic by identifying disease tweets by graphical location. A seminal paper published by [15] proposes combining Twitter data with Goolgle Trends data to track the spread of infectious diseases. A study conducted by [16] analyzed Twitter data collected during some infectious disease outbreaks. The experimental findings helped to learn more about how act people with panic disorder into a health contingency. Other interesting proposals are related to applications for surveillance with data source from Twitter. For example: monitor H1N1 pandemic [17] , monitor Dengue in Brazil [18] , monitor Covid-19 symptoms in Bogota, Colombia [19] , and monitor Covid-19 in South American Population [20] . The previous works have examined Twitter content around infectious diseases, such as Ebola, Dengue, N1H1, Covid-19, etc. Because there is very little scientific literature that has conducted a analysis of how the dissemination of the new coronavirus Covid-19 is affecting in different aspects to the society, in this paper a text mining approach is proposed to analyze the impact of Covid-19 in the Mexican society, particularly in the population of Mexico City, one of the most populated cities in the world and in Latin America. The present work performs experiments with source data from Twitter with Natural Language Processing and Data Mining (Text Mining). • Choose terms to search on Twitter • Setup parameters of the query for Twitter and collect data • Pre-processing data to eliminate words with no relevance (stopwords) • Visualization Considering news about Covid-19 and after previous queries.The next terms are chosen: • 'coronavirus','covid19' But, people does not write following this official names then special characters are found like @, #, -, . For this reason, variations of the previous terms are created, i.e. { '@coronavirus', #covid-19', '@covid 19' } The extraction of tweets is through Twitter API, with the next parameters: • date: 13-03-2020 to 20-03-2020 The next graphics present the results of the experiments and answer some questions to understand the phenomenon of the pandemic over Mexico City population. A. How was the progress of Covid-19 in Mexico City? The Figure 2 shows the progress of positive cases per day and accumulated of Covid-19 in Mexico City from the first outbreaks that were reported in late February to April 30, 2020 [5] . As can be seen, positive cases began to increase steadily from the third week of March. First, Figure 5 shows a cloud of the first thirty terms per day in Mexico City to help identify population concerns related to Covid-19. This experimental scenario is considering all users with post between 13-03-2020 to 20-03-2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 12, 2020. Secondly, the Figure 6 displays the fifty of more frequent terms are related to cases of coronavirus in Mexico City. Meanwhile in the Figure 7 shows a cloud of those more frequent terms to help in the visualization and analysis of the extracted information. This experimental scenario is considering only the top users between 13-03-2020 to 20-03-2020. After extracting and analyzing a value between number of tweets, in both experimental scenarios of this section, and the frequency of each term, it is concluded that the concerns are related to health risks and economic crisis that could be triggered by Covid-19. It is important to remember that Mexico country promotes a different strategy for the Covid-19 contingency with respect to other countries. It is important and interesting to know who were the most active twitter users during the Covid-19 outbreak in Mexico City. For this experimental scenario only the top users were selected (the same data used in the Figure 4 ). The Figure 8 shows the names of users and quantity of posts. As can be seen in the graphic, the user accounts with the highest number of tweets are related to the media: written press, radio and television, which they represents 40% the total twitter messages posts between 13-03-2020 and 20-03-2020, with average of 408,6 posts by username in this period. First, in the Figure 9 is shown the fifty of more frequent terms related to Covid-19 in Mexico City. Meanwhile the Figure 10 shows a cloud of words to help in the visualization and analysis of those more frequent terms. This experimental scenario is considering the Top 50 words published by Top 50 users between 13-03-2020 to 20-03-2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 12, 2020. Another experimental scenario was to remove media (written press, radio and television) tweets from the Top 50 twitter users in order to know which are the most important topics or concerns related to pandemic and identify if these are same with respect to media, with 6% more posts reported. The experimental results are presented in the Figure 11 . After analyzing experimental results reported in the two previous images it is concluded that the interest topics of population during the outbreaks of Covid19 in Mexico City is about of the first infections in Mexico, health sector strategies to prevent prevention, fear of contagion, among others. In this paper was presented a Text Mining approach to helps to visualize and understand the impact of Covid-19 in Mexico City population. The experimental scenarios were focused to extract and analyze the most published terms, the most active twitter both from the all data obtained between 13-03-2020 to 20-03-2020 as one proportion of them. It is concluded . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 12, 2020. coronavirus covid #coronavirus mexico casos @lopezdoriga #covid19 mas si presidente medidas salud gobierno mx pandemia personas mil hoy lopez evitar caso informacion primer virus ser pais #ultimahora crisis confirmados asi marzo @beltrandelrio nuevo italia positivo mundo amlo paises autoridades #covid propagacion ebrard dias #lomasleido secretaria primera @lopezobrador @carlosloret Who statement regarding cluster of pneumonia cases in wuhan, china A novel coronavirus outbreak of global health concern Coronavirus (covid-19) Mexico: Covid-19 cases and deaths 2020 Sitio oficial del gobierno de méxico sobre el coronavirus New technologies in predicting, preventing and controlling emerging infectious diseases Social media mining for public health monitoring and surveillance Global social networks ranked by number of users 2020 Latin america: Twitter users 2020, by country Twitter as a tool for health research: a systematic review Social media as a tool to increase the impact of public health research Investigating public health surveillance using twitter Using social media for actionable disease surveillance and outbreak management: a systematic literature review Medical analysis and visualisation of diseases using tweet data A social media platform for infectious disease analytics Moral panic through the lens of twitter: An analysis of infectious disease outbreaks Pandemics in the age of twitter: content analysis of tweets during the 2009 h1n1 outbreak Building intelligent indicators to detect dengue epidemics in brazil using social networks What is the people posting about symptoms related to coronavirus in bogota, colombia Infoveillance based on social sensors to analyze the impact of covid19 in south american population