key: cord-0684739-dk5k5ihz
authors: Edo-Osagie, Oduwa; De La Iglesia, Beatriz; Lake, Iain; Edeghere, Obaghe
title: A scoping review of the use of Twitter for public health research
date: 2020-05-16
journal: Comput Biol Med
DOI: 10.1016/j.compbiomed.2020.103770
sha: 415d62019921a186943fac8ec56d7a0e91a7654d
doc_id: 684739
cord_uid: dk5k5ihz

Public health practitioners and researchers have used traditional medical databases to study and understand public health for a long time. Recently, social media data, particularly Twitter, has seen some use for public health purposes. Every large technological development in history has had an impact on the behaviour of society. The advent of the internet and social media is no different. Social media creates public streams of communication, and scientists are starting to understand that such data can provide some level of access into the people's opinions and situations. As such, this paper aims to review and synthesize the literature on Twitter applications for public health, highlighting current research and products in practice. A scoping review methodology was employed and four leading health, computer science and cross-disciplinary databases were searched. A total of 755 articles were retreived, 92 of which met the criteria for review. From the reviewed literature, six domains for the application of Twitter to public health were identified: (i) Surveillance; (ii) Event Detection; (iii) Pharmacovigilance; (iv) Forecasting; (v) Disease Tracking; and (vi) Geographic Identification. From our review, we were able to obtain a clear picture of the use of Twitter for public health. We gained insights into interesting observations such as how the popularity of different domains changed with time, the diseases and conditions studied and the different approaches to understanding each disease, which algorithms and techniques were popular with each domain, and more.

A Scoping Review of the use of Twitter for Public Health Research

Objective: As there exists no broad and recent evidence based review on the use of Twitter data for public health we perform a scoping review on the subject focusing specifically on research on the monitoring, detection and forecasting of public health conditions.

We draw upon a broad range of public health and Information Technology related literature in an attempt to elicit what is known on this topic. We follow a specific scoping review methodology to get an exploratory map of the key problems and concepts being tackled in public health through the use of Twitter data.

Results: We find the key areas of application in which Twitter data has been used, mainly surveillance, event detection and pharmacovigilance. We also find specific infections and non-infectious diseases which are being studied, like influenza and mental health conditions. Finally, we also find trends in the use of specific statistical and machine learning algorithms such as an early focus on Support Vector Machines and Bayesian learning and a more recent focus on Deep Learning

Conclusions: Twitter data can be used to aid in public health efforts but some gaps in research exist, for example wider use of semi-supervised techniques and translation of research into practice.

Surveillance, described by the World Health Organisation (WHO) as "the cornerstone of public health security" [1] , is aimed at the detection of elevated disease and death rates, implementation of control measures and reporting to the WHO of any event that may constitute a public health emergency or international concern. Syndromic surveillance can be described as the real-time (or near real-time) collection, analysis, interpretation, and dissemination of healthrelated data, to enable the early identification of the impact (or absence of impact) of potential human or veterinary public health threats that require effective public health action [2] . The task of syndromic surveillance is an undertaking motivated by the notion of public health. Public health has been defined as the science and art of preventing disease, prolonging life and promoting human health through organized efforts and informed choices of society, organizations, public and private, communities and individuals [3] . In this sense, the concept of health encompasses the physical, emotional and social well-being. Historically, public health practitioners have used data from multiple sources for measuring the burden of diseases and other health outcomes, preventing and controlling diseases and guiding healthcare activities. Emergency department attendances or general practitioner (GP, family doctor) consultations are some of the sources traditionally used to track specific syndromes such as influenzalike illnesses (ILI). With the proliferation of the internet and the advent of modern technology, potential new data sources present themselves. In recent years, researchers have recognized that social media platforms, such as Twitter and Facebook, could also provide data about national-level health and behaviour [4] . Among these social media platforms, Twitter offers a unique and potentially powerful data source due to its ease of access, real-time nature and richness in detail. In this paper, we look towards Twitter with the aim of investigating and assessing its utility as a public health tool by performing a scoping review on the subject. While we seek to review the literature of Public health research making use of Twitter, our interest in such literature is limited to research concerning the monitoring, detection and forecasting of public health conditions. We are not interested in social science research investigating the use of Twitter for recruitment or public awareness and dissemination of public health information. We are similarly not interested in research concerned with opinion mining to understand public opinion on public health issues. A scoping review such as ours is pertinent as there exist no broad and recent evidence-reviews on the use of Twitter data for health research purposes. Wargon et al. [5] performed a systematic review on syndromic surveillance models used in forecasting emergency department visits, however, only 9 studies were found and none of them made use of Twitter or any social media. Subsequently, Charles-Smithe et al. [6] carried out a systematic review of the use of social media (not limited to Twitter) specifically for disease surveillance and outbreak management. Sinnenberg et al. performed another systematic review looking at Twitter as a tool for health research [7] . Their systematic review encompassed research in both the sciences and social sciences. We seek to carry out a scoping review in order to map the broad area of Twitter for public health research as well as to produce an updated review containing more recent studies carried out since the above reviews were published. Hence, our research question is: "What is known from the existing literature about the use of Twitter data in the context of monitoring, detection and forecasting of public health conditions?". We are particularly interested in the type of conditions/illnesses being studied; in the sources of data being used; in the data analysis techniques being applied; and in the geographical and time trends of such studies.

We deliver a summary of what has been done so far, which will enable researchers to quickly and efficiently understand this field in terms of the volume, nature and characteristics of the primary research undertaken and any gaps in research that may need prompt attention. Such evidence is particularly necessary in new but fast moving areas of research such as analysis of Twitter data for health applications.

A scoping review methodology was chosen to achieve our goal of investigating the state of Twitter applications in the field of public health research, our research question. The scoping review is defined by Arskey and O'Malley [8] as a study that aims "to map rapidly the key concepts underpinning a research area and the main sources and types of evidence available, and can be undertaken as stand-alone projects in their own right, especially where an area is complex". For our scoping review, we made use of the Arksey and O'Malley framework which adopts a rigorous process of transparency, enabling replication of the search strategy and increasing the reliability of the study findings. As Arksey and O'Malley [8] explain, the method consists of a number of stages such as: identifying the research question; identifying relevant studies; study selection; charting the data and collating, summarizing and reporting the results (i.e. analysis). We elaborate on specific application of the method to our scenario next.

To gain a broad coverage of the available literature, the general terms "Twitter " and "Public Health" were used as search keywords. We chose these two keywords as "Twitter " covers every discussion of the Twitter platform, and used together with "Public Health" covers all mention of Twitter in a health context. As our work is multidisciplinary in that it spans multiple fields, we conducted our search in both health and Information Technology (IT) databases. First, we performed a literature search in the health/medical database PubMed. Next, we searched the IT databases IEEE Xplore and the ACM Digital Library. Finally, we searched a general database that indexed both fields, Scopus. Our searches were refined such that we only included research articles which were peer-reviewed and in English. We also limited our search to only return results within the date range of January 2009 and March 2019, which was when the search was carried out. We started our search from 2009 because of the highly influential Google Flu Trends paper published that year which inspired and kickstarted the use of social media as a data source for public health research [9] . ·Review articles and other articles not reporting an original contribution.

·Articles not focused on our above definition of public health but rather concerned with public health in the context of recruitment and outreach, public awareness and communication, information dissemination or opinion mining. ·Articles which do not make known the statistical or machine learning technique being used. ·Articles which are works in progress or otherwise do not contain the full-text, such as conference abstracts. In accordance to best practice for systematic reviews and meta-analysis, we applied the guidelines for Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) [10] to select studies for inclusion in the analysis. The flowchart for PRISMA that corresponds to our review is shown in fig 1. 754 research articles were returned by our search and 1 paper was added from the bibliographic listings of relevant retrieved papers. Of these 755 articles, we found 550 to be unique. We then drew up a list of criteria for inclusion and exclusion of articles in our review similar to those used by Shatte et al [11] . These criteria are shown in table 1. In short, articles were included if all the following criteria were met: (i) the article reported on a method or application of Twitter data to address a public health issue; (ii) the article evaluated the performance of the statistical or machine learning technique used in drawing utility from the Twitter data; (iii) the article was published in a peer-reviewed publication and (iv) the article was available in English. Articles were excluded if any of the following criteria were met: (i) the article did not report an original contribution (e.g. review papers or articles commenting or speculating on the state or future of such research); (ii) the article was focused on the use of Twitter for public health in the context of recruitment and outreach, public awareness and communication, information dissemination or opinion mining; (iii) the article did not make known the statistical or machine learning technique being used; (iv) the full text of the article was not available (e.g. conference abstracts). Guided by our inclusion and exclusion criteria, we identified and selected 92 articles to be included for the review.

The focus of our review was to get an exploratory map of the key problems and concepts being tackled in the public health space through the use of Twitter and the techniques being used. To this effect, for each article in our review, data was collected on (i) the aim of the research (ii) the disease or illness of focus (iii) sources of data for the study (iv) statistical or machine learning algorithms and methods used (v) the country for which the study was carried out (vi) the year in which the study was carried out. To analyse the collected information, we used a narrative review synthesis to capture the broad range of research studying Twitter for public health in our scoping review. for a breakdown of study activity by country. The use of Twitter data was evident for a varied number of different diseases and health conditions. We observed a range of applications dealing with physical health and illnesses (n = 82) [e.g. influenza-like illnesses (ILIs), adverse drug events and reactions, sexually transmitted diseases, food-borne illneses], mental health (n = 6) [e.g. suicide and depression], natural disasters and environmental issues (n = 5) [e.g. earthquakes, heat waves, air pollution] and social issues (n = 8) [e.g. drug abuse, smoking, alcoholism]. We examined the subjects of the studies for trends in Twitter applications. We analyzed and plotted the three most studied diseases for each year. Fig 4 shows the result of this analysis. Taking a closer look at the diseases, conditions and public health phenomena studied using Twitter data, we observed ILIs to be the most common. The next most common subject of public health research using Twitter were drug abuse and adverse drug events and/or reactions (ADE/R). Furthermore, we observed a general rise in the quantity of research into the use of Twitter for public health. Research activity appears to have peaked in 2016 but seems to be on the rise from 2018. As this scoping review looks at studies up until March 2019, the data for 2019 The average number of Tweets used in the reviewed studies was roughly twenty thousand. A closer look at the research towards Twitter use for public health revealed that the SVM was a popular tool in this research field. We hypothesize that this is due to the SVM's popularity and strength in text classification problems [12] . We also analyzed the surveyed studies to find out which statistical or machine learning algorithms were popular, as well as if and how this might have After this, Bayesian learning seemed to be the method of choice, followed by the SVM. From 2018, the widespread popularity of deep learning appears to have made its way into public health research with Twitter data, as it is becoming the dominant method used since then.

Through the synthesis of the data obtained from the reviewed articles, we broadly identified 6 different ways in which Twitter data is used for public health research. The identified domains were: (i) surveillance (n = 41); (ii) event detection (n = 38); (iii) pharmacovigilance (n = 19); (iv) forecasting (n = 15); (v) disease tracking (n = 12) and (vi) geographic identification (n = 7). Note that these domains where not always mutually exclusive. Surveillance includes articles aiming to monitor some status over a period of time. Event detection includes articles that aim to discover and/or identify a health-related event from Twitter data. Pharmocovigilance includes articles which were concerned with public drug consumption and reactions to said drugs. Forecasting includes articles which aim to predict the trends for health-related events. Disease tracking includes articles attempting to observe or predict the spread of diseases in the public through Twitter. Geographic identification includes articles whose aim is We were interested in examining the trends, if any, in the public health application domains studied over the years. We constructed a bubble trend chart from the reviewed papers. This chart, included in fig 6, illustrates the research activity in each domain for each year with the size of the bubble representing the number of articles for a given year and public health domain. It shows that there appears to indeed be a trend in activity for different public health domains. In 2011, there is little to moderate activity across the board. In the years following that, we see research in some domains drop off and on the map, and some growing steadily in size. Event detection, surveillance and pharmacovigilance appear to have seen steady increases in activity, leading the other domains. However, since 2016, research in those three domains has reduced slightly, with some focus switching to the other domains. The data for the year 2019 is not particularly informative, as the scoping review was only carried out in the first quarter of 2019. We were also interested in the different techniques applied across different public health research domains. We computed a matrix of the application domains against the techniques applied and visualised it as a heatmap. This heatmap is shown in fig 7. Darker colours in the heatmap indicate higher activity for that cell. Supervised learning appears to see a lot of utility across the board. Deep learning and natural language processing also see a fair amount of utility, particularly in event detection, pharmacovigilance and surveillance. Unsupervised learning seems to see some utility use in surveillance and event detection. On the other hand, semi-supervised learning appears to see the least use across the board.

The reviewed articles were found to exist within one or more of these domains. These domains are discussed in more detail below.

Surveillance was the most popular research domain with around 43% of the reviewed articles represented. Research on surveillance focused on employing machine learning in order to utilize Twitter as an alternative or augmentative resource to traditional health surveillance systems. Naturally, the surveillance domain encompasses the field of syndromic surveillance [13, 14, 15] . However, it is broad and also includes additional applications such as the tracking of vaccination efforts [16] and monitoring of environmental conditions [17, 18] , as well as for natural disaster reporting and alarming [19] . That being said, the most common application was the syndromic surveillance of influenza-like illnesses (ILIs). Besides ILIs, other diseases and conditions that were studied include dengue, HIV, gastroenteritis, ebola, diarrhoea and allergies. Due to the extensive research carried out in this area, a wide range of techniques were used. For example, supervised learning applied in the form of k-Nearest Neighours (kNN) was used to monitor allergy trends and occurences [20] . Unsupervised learning was used in the form of Density-based Spatial Clustering of Applications with Noise (DBSCAN) clustering in order to exploit the spatial and temporal properties of the Twitter stream for dengue surveillance [21] . Semi-supervised learning was used in the form of transductive SVMs for the surveillance of ILIs, gastroenteritis, diarrhoea and vomiting [22] . 

Detection was another popular domain which saw around 40% of the reviewed articles represented. Research in this domain sought to automatically detect events and describe the magnitude and trend of disease, as well as the impact of control measures. Examples of applications in this domain are automatically detecting drug abuse within the population [55] , depression and suicide [26] , ebola [56] and most common of all, ILI [57] . Such research tends to be fairly recent with the mode publication year being 2016. The statistical and machine learning techniques used were typically supervised, with most studies employing either classification or regression to make the predictions necessary for detection. For example, SVMs were used to detect mention of "dabbing", a method of marijuana consumption that involves inhaling vapors from heating marijuana concentrates [58] . CNNs were used to detect harmful algal blooms from pictures posted on Twitter [59] . Additionally, stepwise regression was used to detect depression from Tweets in order to explore the effect of climate and seasonality on mood [60] . Clustering [75] , Lexicon Analysis [57] , [76] , [35] , Deep Learning (RNN) [36] , Logistic Regression [77] , Gaussian Process [78] , Deep Learning (CNN) [36] , Outlier Detection [46] , Bayesian Inference [57] , [35] , Fasttext [36] , ARIMA (Autoregressive Integrated Moving Average) [22] , GloVe [36] , FP-Growth [37] , Trap Model [79] , Support Vector Machine [37] , [80] , [77] , Shallow MLP [81] , TSVM [22] , Word2Vec [75] , Regression [ 

Research in pharmocovigilance focused mainly on adverse drug reactions and events, but also investigated with recreational drug use and abuse. Usually, when studying the use of Twitter to detect adverse drug reactions and events, articles searched for a range of names obtained from a thesaurus of drugs and events, such as the Medline Plus Drug Information [83] . However, other such studies focused on a drug for a particular disease such as HIV [63] . In addition, studies also investigated drug habits and their effects on the population. For example, one article studied the use of e-cigarettes and their utility for smoking cessation [62] . Another article studied the variability of alcoholism with time [84] . A number of the pharmacovigilance studies utilized sentiment analysis, usually a form of supervised text classification, to aid in their efforts [63, 83, 28] . In fact, most of the studies make use of supervised learning in the form of text classification using mostly SVMs and decision trees. Of the 19 articles in this domain, three made use of deep learning [28, 85, 86] , one employed a semisupervised multi-instance learning approach [86] and three used unsupervised natural language processing [28, 87, 66] . 

Forecasting research studies the prediction of public health trends, as well as means of nowcasting which is the prediction of the present state of public health. It can be seen as a part of the syndromic surveillance effort, aimed at predicting epidemics in order to improve crisis response. Research in this domain is focused predominantly on ILIs. Around 67% of the reviewed literature studied ILI. However, other diseases such as dengue, gastroenteritis, cancer and asthma were also studied [53, 22, 23, 95] . While a mix of statistics and machine learning is used in this domain, there is a heavier focus on statistics. In fact most studies made use of statistical techniques like regression and time series analysis. For example, dynamic regression was used to predict infuenza trends in Boston, USA [96] . AutoRegressive Integrated Moving Average (ARIMA) was used to forecast influenza cases on a city level in Chongqing, China, as well as for predicting gatroenteritis in the UK [97, 22] . Partial differential equations were used to forecast influenza cases on a regional level across the USA [44] . Deep learning was also used to aid in the forecasting problem of predicting influenza cases [40] and in the creation of SENTINEL, a software system system capable of nowcasting diseases being monitored by the US Centre for Disease Control (CDC) [98] . Unsupervised learning was used in the form of topic modelling in a study aiming to predict health transition trends without any a priori diseases [51] . [36] , Deep Learning (MLP) [40] , Fasttext [36] , Deep Learning (CNN) [36] , ARIMA (Autoregressive Integrated Moving Average) [22] , [97] , GloVe [36] , Temporal Topic Model [14] , Dynamic Regression [96] , TSVM [22] , Partial Differential Equation [44] , Simple Statistical Analysis [23] , Autoregressive Moving Average (ARMA) [45] Boston Public Health Commission, Public Health England, Pan American Health Organization (PAHO), Chinese CDC, CDC General Health 1

Temporal Ailment Topic Aspect Model (TM-ATAM) [51] CDC Dengue Simple Statistical Analysis [53] Brazilian Official Dengue case data Diarrhoea TSVM [22] , ARIMA (Autoregressive Integrated Moving Average) [22] Public Health England

Disease tracking is a domain that seeks to support epidemiology by offering insight into the spread of infectious diseases. Research in this domain is primarily interested in understanding the way in which diseases spread through a population. It looks toward not only gaining a better understanding of the spread of diseases, but also to keep track of the public health state during recognized outbreaks and mass gatherings which could be a breeding ground for disease. For example, one study investigated and proposed a means of traking flu transmission in China using Twitter [39] . Another study retrospectively tracked the spread of measles during the 2015 outbreak [101] . Additionally, there was a study to detect the occurence and spread of disease symptoms which could signify a potential outbreak at a number of British music festivals and a religious event in Mecca, Saudi Arabia [50] . Most studies in this domain made use of machine learning methods, leaning towards supervised learning. In particular, regression learning proved popular, as two studies utilized dynamic regression and support vector regression to track the spread of influenza [96, 100] . Another study proposed a gaussian mixture regression approach to estimating the geographic origin of a tweet for use during an outbreak [102] . There were also some studies which used statistical analysis to obtain impressive results. One of such studies made use of the TSIR (time-series Susceptible-Infected-Recovered) model to understand human mobility and the spread of the dengue virus in Lahore, Pakistan [103] . While it was rare, one study made use of semi-supervised learning and deep learning to simulate influenza epidemics. Gaussian Mixture Regression (Gmr) [102] Map data

Geographic identification is a small domain which involves the extraction of geographical information from Twitter data and typically sees little use alone. Rather, it is used in conjunction with other domains to improve the efficacy of solutions or provide added benefit. It is most often used with surveillance and disease tracking. Methods used in geograhic identification are typically based on unsupervised learning. For example, DBSCAN clustering was used to monitor and track obesity levels within the population [54] , as well as track the spread of the dengue virus [21] . Another study utilized hot spot analysis to examine spatial patterns of depression on Twitter. Some supervised learning, typically in the form of classification is also used in geographic identification. Here, a classifier is used to predict the location of a tweet based on some features of the tweet, usually its word collocations. As an example, one study in the review made use of a random forest classifier to predict which city and province a tweet determined to be from Canada (according to the Twitter API), was from [105] . While geographic identification in itself is not of major use to the field of public health, when combined with other identified public health research domains, it offers improvements on the specificity and granularity of their results. [102] , HDB-SCAN (Clustering) [106] Map data

This review has compiled and analysed the published literature on the use of Twitter data for public health, highlighting popular and current research and applications. In terms of research undertaken so far, three findings were produced from the review. First, we identified the key application domains being studied: (i) surveillance; (ii) event detection; (iii) pharmacovigilance; (iv) forecasting; (v) disease tracking and (vi) geographic identification. Studies were found to predominantly be concerned with surveillance, event detection and pharmacovigilance. Next, the conditions and diseases being tackled using Twitter data were identified. We discovered a wide range of illnesses to which Twitter data is being applied to including infectious diseases, mental health problems, environmental issues and social issues. Finally, we mapped out the statistical and machine learning algorithms and approaches being used to process and analyse Twitter data for public health purposes. In doing so, we observed trends in these approaches. Bayesian learning and SVMs appear to be popular algorithms of choice, however, in the past two years the focus seems to have shifted towards deep learning.

So far our findings will enable researchers working in health data to identify relevant studies in different application areas, tackling different diseases or conditions and will also provide evidence of analysis techniques that have been applied in each context. This will enable faster development of new applications, which is an important contribution of our research with the growth on the user of Twitter around the world, and particularly in Low and Middle Income Countries (LMIC). The use of Twitter in a health context can present new practical and affordable solutions for implementing disease monitoring and surveillance in countries with weak health systems.

While research toward using Twitter for public health has been extensive, our study has also identified some gaps for future researchers to fill. The identification of gaps is an important deliverable of a scoping review and hence a contribution of our work.

In terms of diseases tackled so far, understandably, studies are focused on infectious diseases because of their global importance. In particular, the reviewed research focused heavily on the surveillance and detection of influenza. However, we have identified significant scope to explore the use of Twitter data in other infectious diseases. Some such studies are beginning to take place (e.g. dengue or ebola) but much more work is expected in the light of recent outbreaks. Often outbreaks are fast moving situations and research needs to progress very quickly so our findings will facilitate such endevours. Whereas we may not expect Twitter data to be of use for the study of sexually transmitted diseases (STDs) as such a study would rely on Twitter user-reporting what may be quite sensitive information, other infectious diseases such as cholera could be studied. Furthermore, we have also identified the potential utility of Twitter and social media for public health in the context of non-infectious diseases, such as asthma or celiac disease as little work has so far been reported in the literature, yet those diseases can represent a large health burden. An additional area of application may be the occurrence of positive health states/outcomes. Our review did not identify any articles that used Twitter for this, although it might be a result of the limitations of our scoping methodology.

In terms of analysis techniques employed so far, there was wide application of supervised learning techniques. This is somewhat understandable as the most popular application domains were surveillance and detection, which are related to the supervised learning tasks of classification and prediction. The average number of Tweets used in the reviewed studies was roughly twenty thousand. This suggests that most of the reviewed articles had large amounts of labelled Twitter data available to them which leads to supervised learning tasks. Unfortunately, such labeling could constitute a sizeable effort so we have identified the use of unsupervised learning, and particularly semi-supervised learning, as another potential area for new exploration. Such approaches would reduce the amount of labeled Twitter data required by also taking advantage of the unlabeled data. Some articles are already starting to emerge [91, 104] but mostly only focused on ILI so far.

Furthermore, in terms of application areas despite the rich potential for success from using Twitter data for public health which was identified in the literature, there were few articles describing active Twitter-based systems and/or their evaluation in an operational context for routine public health practice. This may suggest that it is somewhat difficult to translate research using Twitter for public health into practice. We believe the bulk of this challenge might come from the ethical issues involved and the lack of an ethical framework for the integration of social media into surveillance systems. Hence the development of robust ethical frameworks could be an important area for future work. That being said, public health institutions around the world may already be using Twitter as such a tool, and just not reporting their efforts.

It is also important to note that this review had some limitations. Constraints in the search methodology such as the use of broad search terms and the exclusion of works-in-progress may have resulted in some relevant studies being missed. However, this is a common limitation of scoping reviews as they are intended to broadly map topics, achieving a good balance of breadth and depth in a relatively quick time-frame [107] .

This review makes an important contribution by successfully giving an overview of the use of Twitter data in the context of monitoring, detection and forecasting of public health conditions. We providing insightful analysis of the existing literature in the field, including the type of conditions being monitored; the data analysis techniques being used and the application areas most commonly found. We also analysed time trends to understand how research in this area is evolving over time. Such information will be useful in aiding researchers, clinicians and policy makers in understanding the modern landscape of public health applications for social media.

To conclude, research into the application of Twitter data for public health has uncovered interesting and inspiring advances, especially in recent years, and identified gaps in the knowledge thus allowing targeted research in the future. Overall, we see that Twitter data has been used to aid in pubic health efforts concerned with surveillance, event detection, pharmacovigilance, forecasting, disease tracking and geographic identification, demonstrating positive results. We have uncovered the need to evaluate the use of Twitter in less studied epidemiological diseases and other non-epidemiological conditions. We also uncovered scope to apply semi-supervised algorithms to the task in hand to reduce labelling efforts. Furthermore, we have identified the need for a robust framework including ethics to translate research into an operational context and produce working systems.

With the richness of Twitter as a data source, is semi-real time nature, the take up of mobile devices in LMIC that give access to such platforms and with the development of machine learning tools and their increasing accessibility, we expect to see more interesting ideas and applications of Twitter to public health.

Twitter is a very popular microblogging platform with over 300 million active users.

We analyse the literature to understand Twitter's capability to provide a useful tool for public health.

We reviewed almost a thousand research studies and found that Twitter can be used for surveillance, event detection, pharmacovigilance, disease tracking and forecasting.

Twitter is mostly used in the context of flu, drug abuse, depression and dengue.

Our analysis of the literature presents the modern landscape of public health applications using social media data in combination with Machine Learning approaches.

The world health report 2007 -a safer future: global public health security in the 21st century

Assessment of syndromic surveillance in europe

The untilled fields of public health

Evaluating social media's capacity to develop engaged audiences in health promotion settings: use of twitter metrics as a case study

A systematic review of models for forecasting the number of emergency department visits, Emergency

Using social media for actionable disease surveillance and outbreak management: a systematic literature review

Twitter as a tool for health research: a systematic review

Scoping studies: towards a methodological framework

Detecting influenza epidemics using search engine query data

Preferred reporting items for systematic reviews and meta-analyses: the prisma statement

Machine learning in mental health: a scoping review of methods and applications

Text categorization with support vector machines: Learning with many relevant features

Real-time processing of social media with sentinel: a syndromic surveillance system incorporating deep learning for health classification

Syndromic surveillance of flu on twitter using weakly supervised temporal topic models

Syndromic surveillance of infectious diseases meets molecular epidemiology in a workflow and phylogeographic application

Digital immunization surveillance: Monitoring flu vaccination rates using online social networks

Feasibility of using social media to monitor outdoor air pollution in london, england, Preventive

Social media responses to heat waves

Earthquake reporting system development by tweet analysis with approach earthquake alarm systems

Public health allergy surveillance using microblogs

Dengue surveillance based on a computational model of spatio-temporal locality of twitter

DEFENDER: Detecting and forecasting epidemics using novel data-analytics for enhanced response

Real-time disease surveillance using twitter data

How to exploit twitter for public health monitoring?

Comparing twitter data to routine data sources in public health surveillance for the 2015 pan/parapan american games: an ecological study

Using social media to monitor mental health discussions -evidence from twitter

Analyzing social media to characterize local HIV at-risk populations

Mining pre-exposure prophylaxis trends in social media

Using social media as a tool to predict syphilis

Towards early discovery of salient health threats: A social media emotion classification technique

Tracking twitter for epidemic intelligence

Deploying nemesis: Preventing foodborne illness by data mining social media

Using real-time social media technologies to monitor levels of perceived stress and emotional state in college students: A web-based questionnaire study

Investigating the relationship between social media content and real-time observations for urban air quality and public health

Trust filter for disease surveillance: Identity

Real-time processing of social media with SENTINEL: A syndromic surveillance system incorporating deep learning for health classification

Health-related hypothesis generation using social media data

Evaluating google, twitter, and wikipedia as tools for influenza surveillance using bayesian change point analysis: a comparative analysis

Detecting flu transmission by social sensor in china

Forecasting influenza levels using realtime social media streams

Mining twitter data for influenza detection and surveillance

Using social media to perform local influenza surveillance in an inner-city hospital: a retrospective observational study

Applying GIS and machine learning methods to twitter data for multiscale surveillance of influenza

Regional level influenza study with geo-tagged twitter data

Predicting flu trends using twitter data

Distance-based outliers method for detecting disease outbreaks using social media

Flu gone viral: Syndromic surveillance of flu on twitter using temporal topic models

A framework for detecting public health trends with twitter

Estimating county health statistics with twitter

Detecting disease outbreaks in mass gatherings using internet data

Health monitoring on social media over time

Intelligent dengue infoveillance using gated recurrent neural learning and cross-label frequencies

Dengue prediction by the web: Tweets are a useful tool for estimating and forecasting dengue at country and city level

Social network data mining using natural language processing and density based clustering

Detection of illicit online sales of fentanyls via twitter

Classifying information from microblogs during epidemics

Hybrid classification for tweets related to infection with influenza

Drugs or dancing? using real-time machine learning to classify streamed "dabbing" homograph tweets

A deep learning paradigm for detection of harmful algal blooms

Effect of climate and seasonality on depressed mood among twitter users

Social media sensing framework for population health

Text classification for automatic detection of e-cigarette use and use for smoking cessation from twitter: a feasibility pilot

Identifying adverse effects of HIV drug treatment and associated sentiments using twitter

Mining social media streams to improve public health allergy surveillance

Enabling real-time drug abuse detection in tweets

Twitter-based detection of illegal online sale of prescription opioid

Applying multiple data collection tools to quantify human papillomavirus vaccine communication on twitter

On infectious intestinal disease surveillance using social media content

Adverse event detection by integrating twitter data and VAERS

GIS analysis of depression among twitter users

Social media, big data, and public health informatics: Ruminating behavior of depression revealed through twitter

Tweeting back: predicting new cases of back pain with mass social media data

Health department use of social media to identify foodborne illness-chicago, illinois

Public health surveillance of dental pain via twitter

From social media to public health surveillance: Word embedding based clustering method for twitter classification

An unsupervised machine learning model for discovering latent infectious diseases using social media data

National and local influenza surveillance through twitter: An analysis of the 2012-2013 influenza epidemic

The added value of online user-generated content in traditional methods for influenza surveillance

Twitter-based influenza detection after flu peak via tweets with indirect information: Text mining study

Identification of keywords from twitter and web blog posts to detect influenza epidemics in korea

Prediction of influenza-like illness based on the improved artificial tree algorithm and artificial neural network

Ontology-based automatic identification of public health-related turkish tweets

Efficient adverse drug event extraction using twitter sentiment analysis

Temporal variability of problem drinking on twitter

Utilizing different word representation methods for twitter data in adverse drug reactions extraction

Semi-supervised recurrent neural network for adverse drug reaction mention extraction

Semantic network analysis of vaccine sentiment in online social media

Epidemiology from tweets: Estimating misuse of prescription opioids in the USA from social media

Analysis of the effect of sentiment analysis on extracting adverse drug reactions from tweets and forum posts

Pharmacovigilance on twitter? mining tweets for adverse drug reactions

Semi-supervised multi-instance interpretable models for flu shot adverse event detection

Towards large-scale twitter mining for drugrelated adverse events

T-recs: Time-aware twitter-based drug recommender system

Enhancing seasonal influenza surveillance: Topic analysis of widely used medicinal drugs using twitter data

Predicting asthma-related emergency department visits using big data

Santillana, Accurate influenza monitoring and forecasting using novel internet data streams: A case study in the boston metropolis

Citywide influenza forecasting based on multi-source data

Real-time processing of social media with sentinel: A syndromic surveillance system incorporating deep learning for health classification

Tracing out various diseases by analyzing twitter data applying data mining techniques

The use of twitter to track levels of disease activity and public concern in the u.s. during the influenza a h1n1 pandemic

Tweeting about measles during stages of an outbreak: A semantic network approach to the framing of an emerging infectious disease

Conditional density estimation of tweet location: A feature-dependent approach

Inferences about spatiotemporal variation in dengue virus transmission are sensitive to assumptions about human mobility: a case study using geolocated tweets from lahore, pakistan

Simnest: Social media nested epidemic simulation via online semisupervised deep learning

Context prediction in the social web using applied machine learning: A study of canadian tweeters

Mining location information from users' spatio-temporal data

A scoping review of scoping reviews: advancing the approach and enhancing the consistency