key: cord-0116842-jjg476zn authors: Dodds, P. S.; Minot, J. R.; Arnold, M. V.; Alshaabi, T.; Adams, J. L.; Dewhurst, D. R.; Reagan, A. J.; Danforth, C. M. title: Long-term word frequency dynamics derived from Twitter are corrupted: A bespoke approach to detecting and removing pathologies in ensembles of time series date: 2020-08-25 journal: nan DOI: nan sha: 8468495ec73f5b8ff30df2f0a77ee3cf813530b6 doc_id: 116842 cord_uid: jjg476zn Maintaining the integrity of long-term data collection is an essential scientific practice. As a field evolves, so too will that field's measurement instruments and data storage systems, as they are invented, improved upon, and made obsolete. For data streams generated by opaque sociotechnical systems which may have episodic and unknown internal rule changes, detecting and accounting for shifts in historical datasets requires vigilance and creative analysis. Here, we show that around 10% of day-scale word usage frequency time series for Twitter collected in real time for a set of roughly 10,000 frequently used words for over 10 years come from tweets with, in effect, corrupted language labels. We describe how we uncovered problematic signals while comparing word usage over varying time frames. We locate time points where Twitter switched on or off different kinds of language identification algorithms, and where data formats may have changed. We then show how we create a statistic for identifying and removing words with pathological time series. While our resulting process for removing `bad' time series from ensembles of time series is particular, the approach leading to its construction may be generalizeable. The successful collection, cleaning, and storage of data through time requires a stability of data sources, measurement instruments, and data storage taxonomy [1] [2] [3] [4] [5] [6] [7] [8] . Of course, such stability has hardly been the norm for any developing area of measurement. Indeed, over the full arc of science, measuring and recording time itself. Thousands of years led to the establishment of a settled calendar, with its quadracentennial leap-year exception to an exception to an exception [9, 10] . Accurate clocks only first appeared with chronometers in the 1600s [11] , now, in terms of achievement, perhaps manifested by the Global Positioning System (GPS) which requires general relativity. For internet data, sources go through episodic upgrades as formats are reconfigured and expanded. In the case of Twitter, our focus here, just a few of the features that have been added include: retweets as formalized entities, images and video, local time, and tweet and user language. The data object behind any given tweet, whose format began as xml and changed to json, has correspondingly grown in size, and the format has evolved somewhat biologically. The json for a "quote tweet" contains simplified json for the retweeted tweet. And the expansion from 140 to 280 characters was accomplished not by expanding an existing entry field but adding a second one which must be combined with the old one for * peter.dodds@uvm.edu "long tweets". Data providers and APIs have also have also changed, most recently to GNIP as the data provider with a completely different JSON schema. Over time, and not without setbacks, Twitter has become an important global social media service. Amplifying and reflecting real world stories, Twitter is globally entrained with politics and news, sports, music, and culture, and also performs as a distributed sensor system for natural disasters and emergencies [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] . Like any scientific enterprise, empirical research involving Twitter and social media in general depends fundamentally on the quality of data [7] . Because of Twitter's now sprawling platform across time and language, great care must be taken to ensure such integrity. In this short report, we describe: (1) How we uncovered anomalies in word usage time series derived from Twitter (Sec. II), and (2) One approach to identifying and removing corrupted time series (Sec. III). We offer concluding thoughts in Sec. IV. We emphasize that we are not attempting to clean individual time series, a common statistical practice, but rather we are cleaning ensembles of times series by removing problematic, unsalvagable time series. Our work would be suitable for any many-component complex system where abundances of components are recorded over time. Our approach is intended to be used for ensembles of time series which can only be taken as they are, i.e., they cannot be rebuilt from more primary data sets. The work we present here has inspired a ground-up, reidentification of language for our Twitter data set [30] which in turn has led to the building of our n-gram time Dutch words were especially susceptible to being misclassified as being English giving rise to a corrupted time series. We expand the time series for 'niet' in the two regions shaded in gray in panel A and present them in panels B and C. The jumps in the time series in panel B appear to be due to Twitter putting into place a series of language identification algorithms (which we do not attempt to reverse engineer in any way). The second jump in panel B seems to be due to the initial algorithm being switched off. The time series for 'niet' stays roughly two orders of magnitude lower for over three years before one last major adjustment in late 2016 shown in panel C. series for Twitter project Storywrangler [31, 32] , and the revision and expansion of our Hedonometer instrument [29, 33, 34 ] (more below). Our work is also connected to our studies of how the COVID-19 pandemic has been discussed across languages on Twitter [35, 36] , as well as story turbulence and chronopathy in connection with Trump [37] . Throughout, we do not attempt to reverse engineer any of Twitter's proprietary algorithms, but rather contend with derived data and changing formats only. Neither do we suggest that there is any fault of Twitter in changing language identification methods, or indeed any aspect of their service, over time. We also acknowledge that some data artifacts may have been introduced by our own struggle with the complexities of consistently processing formats that have changed many times. The instigation of our work here came from first noticing in June of 2018 that our Hedonometer's happiness time series for English Twitter [29, 33, 34 ] had begun to apparently show increasing turbulence from the year 2016 on. While a weekly cycle had always been a feature of our measure of Twitter's day-scale happiness (Saturday had been typically happiest, Tuesday the least), its strength appeared to be waning. Deciding that this observation deserved further investigation, we began to conceive of ways to measure lexical and story turbulence [37, 38] . Our Hedonometer functions by averaging the individual, offline-crowd-sourced happiness scores. At that point in time, we were using a "lexical lens" of 10,222 words to create a single score for each day [29] . In brief, our method ultimately derived from Osgood et al.'s work on the measurement of meaning [1] . Through semantic dif-ferentials, Osgood et al. found that valence (happinesssadness) was the first dimension of the experiences of meaning, followed by excitement and dominance. Using a double-Likert scale, we improved upon earlier efforts to score individual words [39] , drawing on the most common words used for various time periods of Twitter, Google Books, the New York Times, and music lyrics [29] . We scored 10,222 words using Amazon's Mechanical Turk crowd-sourcing service, calling the resulting data set labMT (language assessment by Mechanical Turk). To run the Hedonometer, we created a usage frequency distribution for this set of 10,222 labMT words, doing so for each day (according to Coordinated Universal Time) using tweets identified as English by Twitter. For an initial attempt to quantify turbulence on Twitter, we left the Hedonometer part aside and focused on the underlying labMT word frequency distributions. We used Jensen-Shannon divergence (JSD) to compare frequency distributions between dates over different time scales, with the distributions normalized as probabilities (or rates). Our choice of Jensen-Shannon divergence was not crucial, but rather something to try, and we later developed alternate kinds of divergences (see Refs. [40, 41] ). In Fig. 1A -C, we show three JSD time series representing comparisons between a date and A. the previous day, B. the same day of the week, a week before, C. and the same date, a year before. We first plotted just the panels in Fig. 1A and Fig. 1B , and saw that these JSD time series, after trending down from 2009 through to 2011, were both increasing from 2012 on, in agreement with our visual observations of Hedonometer. In seeking to further develop our analysis of lexical turbulence, we then examined JSD over longer time scales between dates, including the year scale of Fig. 1C . And it was here that we first clearly saw there were problems with our word distributions. In late 2012, through 2013, and into 2014, we see striking jumps in year-scale JSD. We see more isolated jumps at the ends of 2015, 2016, and 2017. Because we are comparing across years, we expect the anomalous patterns to appear twice with year separation, once for a problematic date looking back a year, and then again for a year ahead looking back at the same problematic date. We were able to say something immediately about what these anomalies are not. They are not due to isolated corrupted dates, something we would have to contend with in collecting any form of streaming data, as we would see these as spikes in the JSD. Some aspect of the distributions was being switched and maintained. Nor are the changes somehow volume dependent, as Fig. 1D makes clear. While we do have some inconsistencies and changes in the volume of labMT words collected over time, they do not line up with the jumps in the yearscale JSD time series. While Twitter is ever-changing in content, we nevertheless expect to find reasonable consistencies in aggregate word usage patterns we may derive. Upon visual inspection of individual frequency time series for Twitter around the dates of the jumps in the year-scale JSD time series, we find some corresponding peculiar jump sequences. (In the following section, we develop a systematic approach to identifying such anomalous time series.) For an individual example, in Fig. 2A , we show how the normalized usage frequency for the Dutch word 'niet' (English: 'not/no') exhibits a number of sharp jumps (shaded regions). The word usage rate for 'niet' increases or drops over several orders magnitude around certain dates. Expanding the shaded regions of Fig. 2A, Fig. 2B shows four jumps occurring at the end of 2012 and in 2013, and Fig. 2C shows one in late 2016. We have the suggestion then that individual tweets (and hence words) are being differentially classified by a sequence of language identification algorithms employed by Twitter. Overall, from Fig. 2 , the example word of 'niet' seems to be initially identified as coming from English tweets, then, after a several months of algorithms switching on or off, appears to have been excluded from English for several years until the end of 2016, or appear so due to a change to the tweet distribution system provided by GNIP. For the Hedonometer, for which these time series were prepared, we had accepted tweets for processing unless they were identified as being a language other than English or the user a speaker of language different from English (in other words, not not English tweets). We note that we had not noticed any of the year-scale JSD artifacts in our Hedonometer signal, which itself is a dayscale average. Word usage distributions are of course determined within the context of all words for each day. Given the behavior of year-scale JSD in Fig. 1C , we must expect the time series of more words to follow the specific form of 'niet'. We should also expect that these corrupted time series would also lead to corrupted time series of basic function words in English (e.g., 'the'). Clearly we do not want to involve poorly sampled time series in any of our analyses. And because we have observed that some words follow the 'niet' pattern while the majority track well (i.e., largely continuously if noisy, and with jumps that have historical explanations), we can hope to remove this particular set of poorly sampled words. We are thus able to overcome Twitter's hidden shifts in algorithmic classification, at least in this most essential task of extracting basic word usage frequency time series. We construct a specialized method for identifying corrupted time series as follows. For the five jumps overall for 'niet' in Fig. 2 , we notice that adjacent and intersticial time periods are relatively quiescent. Observing that similar patterns hold for other words, we construct a "jump statistic" to measure the degree to which a word's time series locally tracks the shapes in Fig. 2B and Fig. 2C . For the four jumps in the first time period of change (Fig. 2B) , we choose five similar-length time ranges within which we expect words to be relatively similar in abundance on a logarithmic scale: Again referring to the behavior of 'niet', we expect the transitions of corrupted words between these time periods to be down, up, down, and down. For the second time period (Fig. 2C) , we bound the one jump with two periods: 2016-10-15 to 2016-12-04 (51 days), and 2016-12-11 to 2017-01-30 (51 days). We expect corrupted words to jump up across this single transition. For each word w in our set of 10,222 words, we construct a jump statistic J by averaging across differences of the logarithms of normalized frequency P w,d for all possible pairs of dates across each transition point. We incorporate the expected transition direction for corrupted time series by multiplying by +1 (up) or -1 (down), as appropriate. By using sums of differences of logarithms, we are equivalently computing ratios of normalized frequencies and taking their geometric mean. A simpler estimate might be to take the average probability of a word in each region and sum the signed differences across the transition points. However, comparing each pair of dates around each transition point generates a distribution of J values, allowing us to estimate other statistics, such as a variance. We compute a variance for each word by creating a distribution of values. For each component of J around each transition point (one value for each pair of dates). For example, the first two time periods of 51 days each give us 2601 possible date pairs. We use these to estimate variances for individual jumps. We then sum variances over all five transition points to obtain a variance for J which we will denote simply by σ 2 . We compute J and σ 2 for each word w. We first sort them by descending values of J, and the main plot in Fig. 3A shows these values of J for all 10,222 labMT words. Annotated disks along the curve give example words. We see that for positive values of J, the words that track with the corrupted form are non-English words ('zijn', 'kalo', 'gak', etc.) and come from a range of languages. We also find corrupted time series for common words that tend to be used across languages, such as "hahaha". Visually, it appears that many of the words (∼ 90%) have values of J close to 0 (between -1/2 and 1/2, say). These are non-corrupted words ('coke', 'britain', and 'varying'). We will firm up our measure of closeness below. Some words go strongly against the trend of word corruption (J < −1) with 'clinton' and 'hillary' being prominent examples. Twitter changed their language identification algorithm about a month after the US presidential election, and Clinton's loss led to her name dropping in prevalence, going against the grain of jumping upwards for corrupted word time series (Fig. 2C) . Now, having J > 0 is too severe a condition for determining whether a word is corrupt or not. In the insets of Fig. 3B and Fig. 3C , we employ our distributions of J scores to craft a better criterion. In Fig. 3B , we show the first 1500 words ordered again by decreasing J but now with the range J − 2σ to J + 2σ shaded. We observe words with 0 < J < 1 whose (notional) 95% confidence interval covers 0. Evidently, we would not want to exclude these words, mistaking them for being corrupted because J > 0. We will instead take our criterion for a time series to be corrupted if J − 2σ > 0. In Fig. 3C , we re-order words so that they are descending according to the lower limit of their 95% confidence interval, J − 2σ. We preserve the example labeled words from Fig. 3B to show how they move around. With this criterion, we find that the time series of 9,030 of our 10,222 word are relatively unaffected by the five major changes in Twitter's language detection algorithm we have identified. We deem 1,192 words to be sufficiently problematic that we should exclude them. With these words removed, we return to our JSD calculations and examine how the year-scale JSD now behaves. We find that the jumps that appeared to be due to Twitter's language detection algorithm changes have all been eliminated. However, one last peculiar structure remained due to anomalous word frequency changes in 2015 and 2016. We were able to find that two words, "weather" and "channel", were unusually prominent during this time, per their time series in Fig. 4 . We are unsure exactly why this artefact appeared for our labMT data set. We note that in our Twitter n-grams project, Storywrangler, we do not see any anomalous behavior for "weather", "channel", or "weather channel" in English [31, 32] . (For Storywrangler, whose development was directly motivated by the findings of our present paper, we used FastText for language identification of tweets [30] .) Finally, in Fig. 5 , we show the year-scale JSD time series for our labMT data set with corrupted words removed. To be compared with Fig. 1C , we now see a noisy time series more in keeping with the 1-day and 1week time scale JSD times series in Figs. 1A and 1B. While we cannot be sure that there are no other problems for our labMT word list, we have at least been able to systematically contend with the time series corruptions induced by changes in Twitter's language detection algorithms. We have shown that certain kinds of time series for individual words on Twitter may be functionally corrupted due to changes in how Twitter has deployed language detection algorithms over the last decade coupled with the difficulties of constantly needing to recognize and adapt to data format changes. In the absence of the ability to rebuild these problematic time series from original primary data, we have demonstrated how a systematic, if bespoke, method can be developed to generate a 'clean' ensemble of time series. We repeat that we do not clean individual time series but rather remove them entirely from an ensemble. Anomalies within ensembles of interrelated time series may in general be difficult to discern. While pursuing other research directions may have uncovered the same time series problems-our original research interest con-cerned lexical turbulence [37, 38] -measuring the divergence between Zipf distributions for days proved powerful here. Our stumbling upon aberrant time series was helped by Jensen-Shannon divergence being just one of many divergences that would have worked, though evidence of time series problems only arose when we looked beyond short time scales. We believe our findings should elicit some measure of concern as they suggest that existing work based on language-specific time series derived from Twitter may need to be re-examined. More generally, our work would support the very reasonable concern any researcher might have about the long-term integrity of data collected on the fly from social media and other internet services. Indeed, our investigations have led us to rebuild our Twitter database, resulting in important upgrades for our happiness measurement instrument, Hedonometer, and the development of our Twitter n-gram viewer, Storywrangler. The Measurement of Meaning The Measurement of Values Precision measurement and the genesis of physics teaching laboratories in Victorian Britain The Mismeasure of Man (WW Norton & company CODATA recommended values of the fundamental physical constants Making Natural Knowledge: Constructivism and the History of Science Tampering with Twitter's sample API Measurement schmeasurement: Questionable measurement practices and how to avoid them Foucault's Pendulum Questioning the Millennium: A Rationalist's Guide to a Precisely Arbitrary Countdown Longitude: The True Story of a Lone Genius Who Solved the Greatest Scientific Problem of His Time Earthquake shakes Twitter users: Real-time event detection by social sensors Tracking the flu pandemic by monitoring the social web Towards detecting influenza epidemics by analyzing Twitter messages Does the early bird move the polls? The use of the social media tool 'Twitter' by US politicians and its impact on public opinion What do the average twitterers say: A Twitter model for public opinion analysis in the face of major political events Twitter mood predicts the stock market Time-critical social mobilization Harnessing the crowdsourcing power of social media for disaster relief Predictive analysis on Twitter: Techniques and applications Fake news on Twitter during the 2016 US presidential election Reclaiming stigmatized narratives: The networked disclosure landscape of #MeToo SemEval-2016 task 4: Sentiment analysis in Twitter An automated pipeline for the discovery of conspiracy and conspiracy theory narrative frameworks: Bridgegate, Pizzagate and storytelling on the web Evaluating the fake news problem at the scale of the information ecosystem Online social networks and offline protest Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter The growing amplification of social media: Measuring temporal and social contagion dynamics for over 150 languages on twitter for Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter Human language reveals a universal positivity bias How the world's collective attention is being paid to a pandemic: COVID-19 related ngram time series for 24 languages on Twitter Divergent modes of online collective attention to the COVID-19 pandemic are associated with future caseload variance Computational timeline reconstruction of the stories surrounding Trump: Story turbulence, narrative control, and collective chronopathy Is language evolution grinding to a halt? The scaling of lexical turbulence in English fiction suggests it is not Affective norms for English words (ANEW): Stimuli, instruction manual and affective ratings Allotaxonometry and rank-turbulence divergence: A universal instrument for comparing complex systems Probability-turbulence divergence: A tunable allotaxonometric instrument for comparing heavy-tailed categorical distributions The authors are grateful for the computing resources provided by the Vermont Advanced Computing Core which was supported in part by NSF award No. OAC-1827314, and financial support from the Massachusetts Mutual Life Insurance Company and Google.