key: cord-0466709-ivutg6rk authors: Toger, Marina; Shuttleworth, Ian; Osth, John title: How average is average? Temporal patterns in human behaviour as measured by mobile phone data -- or why chose Thursdays date: 2020-04-30 journal: nan DOI: nan sha: 8d79b8aea0889559a145168aae3ffb2758cc2fbf doc_id: 466709 cord_uid: ivutg6rk Mobile phone data -- with file sizes scaling into terabytes -- easily overwhelm the computational capacity available to some researchers. Moreover, for ethical reasons, data access is often granted only to particular subsets, restricting analyses to cover single days, weeks, or geographical areas. Consequently, it is frequently impossible to set a particular analysis or event in its context and know how typical it is, compared to other days, weeks or months. This is important for academic referees questioning research on mobile phone data and for the analysts in deciding how to sample, how much data to process, and which events are anomalous. All these issues require an understanding of variability in Big Data to answer the question of how average is average? This paper provides a method, using a large mobile phone dataset, to answer these basic but necessary questions. We show that file size is a robust proxy for the activity level of phone users by profiling the temporal variability of the data at an hourly, daily and monthly level. We then apply time-series analysis to isolate temporal periodicity. Finally, we discuss confidence limits to anomalous events in the data. We recommend an analytical approach to mobile phone data selection which suggests that ideally data should be sampled across days, across working weeks, and across the year, to obtain a representative average. However, where this is impossible, the temporal variability is such that specific weekdays' data can provide a fair picture of other days in their general structure. As the sun rises on any given day, people go to work or education, engage in other daily activities whilst yet others stay at home, collectively making up the vibrant buzz and spatial patterns of the society. Modelling this complex behaviour demands large amounts of data, feasible due to recent advances in data collection and processing. The advent of Big Data, whether from mobile phones (1), travel cards (2) , or social media (3) , enables mapping of the daily patterns to understand such phenomena as segregation (4) (5), spread of diseases such as Covid-19 (6) (7), human mobility (8) (9) , and urban dynamics (10) . The time-dimensional difference between traditional data-sources in the form of census data or population registers means that what was measured on decennial or annual periodicity now can be measured in real time. As a result, there is an increasing number of studies using finer temporal resolutions in societal studies but there are, to our knowledge, no studies indicating which days or hours that are representative for societal activities in general. There is also a growing interest in the ethical implications of the collection and use of Big Data and much discussion about how to visualise and analyse these large datasets (11) (12) . We address these concerns by asking 'how normal is normal?'. Understanding the periodicity and typical patterns in the data is important since one response to ethical/privacy concerns is to allow analysis only of a subset of the data and one quick answer to the large size of Big Data is similarly to subsample the data whether for a day, a week, a month or a particular place. These seemingly trivial questions are vital to our understanding of complexity and variability in Big Data and answering whether it is sufficient to analyse data for a day or a week to improve our understanding of societal processes such as segregation, whether all data should be analysed, or when and if there are decreasing returns to computing time in revealing the difference. Furthermore, to assess the impact of shock events such as extreme weather to get a better grasp on social resilience, it is important to study the natural dataset variability at different temporal scales and periodicities. After all, to understand how anomalous an event is, it is necessary to have a benchmark against which to judge it. The conceptual framework used to evaluate behaviour as revealed by our mobile phone dataset (supplementary Figure S1 ). Previous studies looked at variability in mobile phone usage data, for instance (13) assumed weekly periodicity in CDR for selecting a benchmark in their study of emergency related human behaviour in a European country. Here we examine the data to determine the extent of weekly/monthly/hourly periodicity rather than assuming it is there. Variation in the number of users and in daily travel distances manifested weekly, monthly and seasonal regularities in population movement around the holidays in connection with the earthquake in Haiti, the analysis (14) requiring opening and processing the trajectories of individuals. Comparison (15) of the load volume on electrical infrastructure with phone activity in Senegal, so that phone activity can function as a proxy for level of development/electrification, also required processing of the files. A visual analytic approach to detecting anomalies in call activity in Senegal (16) entailed illustrating visually the temporal changes in weekends/religious holidays. These studies showed significant and detectable variability in mobile phone activity based on Call Detail Records (CDR). Our Network Detail Records (NDR) dataset comprises calls, SMS, MMS, data connections, and also silent handovers, thus having more frequent entries per user than the popularly used CDR. Given the structure of our dataset, the activity level is proportional to the number of observations/rows in the dataset and thus to the file size (see supplementary Figure S2 ). We propose a convenient estimate of variability in activity based on file sizes, without opening them, with a measurement of deviation between expected vs observed periodical temporal signature of mobile phone activity. Moreover, in our case mobile technology penetration in Sweden (17) is 98% so phone activity is likely to be representative of behaviour across all demographic groups. Two different time lags were used for the autocorrelation analysis for hours (Figure 1 a) and days (Figure 1 b) . The repetitive pattern in day lag autocorrelation is easily detectable (Figure 1 b) . The thin black line indicates that the seventh, fourteenth, twenty-first days (and so on) are more correlated than the days in between, pointing to weekday-specific file sizes. In other words, correlations are conducted using the same hour (a) or the same day (b) and at a lagdistance that grows with a day's distance (for hours) or a week's distance (for days) for each step on the x-axis. The far-right part of the red lines is located at around 2/3 of a month distance (hours -a) and around 7 months' distance (days -b). The thin black lines depict a time lag that grows with one hour for each step on the x-axis (a) and a day for each step on the x-axis (b). Thus, the far right of the thin black line corresponds to a day's distance (hours -a) and to a months' distance (days -b). The bold blue line stretches for a longer time span (lag distance is one week) and describes the annual repetitive pattern where the lowest correlation value for any specific weekday is found at lag of 26 weeks (six months) away. The strong repetitive hourly, daily, weekly and monthly patterns in the data indicate that file size can be used as a proxy for activity at the temporal scales illustrated here. MLM is used to (1) estimate file size variance is a three-level hierarchical model and (2) estimate the impact of events and weather on levels of activity. Table 1 shows the results of the empty model with no fixed effects and with standardised file size as the dependent variable. It shows that most of the variation is between hours (83.9%), followed by months (7.9%) and days (4.3%). An additional 3.9% of the variance cannot be explained but might be related other external factors such as weather and national events. Table 2 presents the results of the full MLM with fixed effects. The random effects parameters indicate that the hour-level variation drops to 81.1% while no change is detectable at day and month-year (MY) levels. The fixed effects explain around 2.3% (note that cumulative % is reaching 97.7%) of the variation. This may sound small, but is in line with the expectations since the majority of phone usage is connected to regular reoccurring events such as workdays and weekends, night-rest and active hours. Of the fixed effects it is clear that religious and secular holidays significantly and negatively reduce the file size (usage of phones) while major sports and TV media-events increased the file sizes. Hours with major transport breakdowns or out-of-the-ordinary weather have small effects although some are statistically significant. In Table 3 , the third restricted model, an alternative empty model, is specified, which excludes festivals and national events leaving only 'normal' days. The results indicate that taking away unusual days strengthens the already clear temporality observed in the data; the variation attributable to hours increases to 84.2%, months to 8.0%, although that for days remains the same. In the final analytical stage, the difference between observed file sizes and the values predicted by all three of the multilevel models for hours and days is analysed using RMSD. This shows which hours and which days are easier to predict. As a baseline, the three models produce RMSD values of ~0,1877 for the empty model, ~0,1834 for the full model and ~0,1737 for the restricted empty model. The variation is largest for the empty model, with slight reduction when introducing fixed effects (full model) and smallest when all days having events are removed from the analysis (restricted model). Turns out Thursday is the most 'normal' weekday (and not Tuesday) and Saturday is the most normal weekend day. Decomposing the files into weekdays (Figure 2) pinpoints Mondays as days with the greatest RMSD value, indicating that it is difficult to predict file-size (i.e. phone usage varies more amongst Mondays than between other days). The opposite situation is found for Saturdays which tend to have low RMSD values. There is an interesting deviation between models for Sundays where all hours are kept in (empty and full) and the model where only days without events (such as holidays) are modelled (restricted model). See supplementary Figure S3 for graph of the relationship between standardized predicted and observed file size. The greatest RMSD for hour values (Figure 3 ) are found at hours 1 (01:00-02:00), 5 and 6 (05:00-07:00). Early afternoon hours also have relatively high values. The least varying hour at night is 2 (02:00-03:00), daytime is 11 (11:00 12:00), and evening is between 19 and 20 (19:00-21:00). In the supplementary material ( Figure S4 ), combined day-hour graphs are available where the RMSD for all hours across all days is shown. The mobile phone dataset analysed in the paper reveals temporal complexity in human behaviour at temporal scales of hours, days, weeks and months. The diurnal patterns shown are much as might be expected but the variability by day and month indicate the importance of considering these aspects. The method demonstrated to estimate the activity of mobile phone users from file size offers a quick and convenient way to extract patterns from datasets that are otherwise unwieldy thereby enabling generalisations to be made to set samples in their context. Our model (see Figure S3 ) closely predicts observed file sizes (and thus levels of activity and phone usage). It is noteworthy that the empty model performs as well in some regards as the full model which has explanatory terms for national holidays, events and extreme weather; this indicates the importance of hourly, daily and monthly temporal beats in the data, overlain with random variation, as the main drivers of activity. Swedes are therefore largely creatures of habit (but with a certain amount of random variation thrown in). Despite this, national holidays, other events, and weather do have some influence on behaviour although this is small compared to the rolling cycle of the day, week. month and year. The daily and hourly patterns can be explained by how human activities are organised. The difficulty in predicting activity on Mondays and Fridays arises from their position at the start and end of the working week and their consequent variability across seasons and sensitivity to holidays and other events. Thus, mid-week days are more 'average', but Thursday performs better than Tuesday like one might expect. This also explains the volatility of the hours between 05:00 and 07:00 and 22:00 and 00:00 where behaviour will vary for the same reasons -other times of the day are less variable. The hourly variability therefore means most care should be taken in selecting which hours to sample; this is the largest source of variation in our data. However, month is also important and whereas neighbouring months are similar, the decrease in similarity as the temporal lag increases shows that if possible data should be extracted from more than one month, and ideally months half a year apart, if the objective is to capture the full variability of human behaviour. There is a small number of high RMSDs. These are not correlated with any known extreme event. Closer inspection of these extreme file sizes (see supplementary Figures S5 and S6 ) which are more than 2 standard deviations from the series average indicates that these rare occurrences are usually in the early hours of the morning and during week days when file sizes are normally small. Their cause is unknown but can most probably be attributed to technical issues such as system updating and maintenance (the providers are reluctant to discuss these issues openly). It is important to note because these (and similar) technical issues may be present in other mobile phone (and other similar) big datasets. If these extreme cases are dropped as anomalies, then the model performs even better since there is less random variation. The decision whether to drop them or not would depend on the objective of the study. If the focus is on human behaviour, then data integrity related outliers should be dropped. However, if the data itself is under scrutiny, then it is open to discussion whether these cases should stay part of the analysis. Therefore, we kept them in. It is likely that similar societies to Sweden will show the same temporal patterns of behaviour but that there will be increasing divergence as social difference increases by level of economic development, religion and world area (1) . Therefore, these results cannot be generalised to all national contexts. However, the method of using mobile phone NDR file sizes as proxies for activity and human behaviour can be generalised assuming that data providers release the essential data of file size by hour, day, week and month so as to permit the analysis of variation as has been undertaken here. The phone data The data were obtained from a major Swedish mobile phone provider. At time of analysis the data series covers more than 500 days. Over time an archive of compressed data ranging on a byte-scale between terabytes (10^12) and petabytes (10^15) have been registered (the exact size cannot be revealed due to an agreement with the data provider). In a decompressed and analytically manageable format, the volume of the dataset becomes considerably bigger. The data are NDR which comprise calls, SMS and MMS messages, data uploads and downloads, and silent handovers. They record geographical movements at a small temporal scale of five-minute intervals nested within hours, days, weeks and months. As such, they provide a proxy for human activity and behaviour. For this study, the lowest five-minute level of the data is ignored as this requires processing of the files. Instead, we concentrate file size changes over hours, days, weeks and months ( Figure 4 ). No personal information about phone owners is known and the individual records are anonymised. The hourly files number 12,000 and rising. Due to ethical, legal and computational restrictions, often it is only possible to analyse a fraction of the data. Data subsamples are unpacked and transferred to a database called MIND; non-analysed data are registered but moved in their compressed format to a repository 1 . Total number of phone events (rows) is highly correlated with standardized file-sizes (R² = 0,9971, see supplementary Figure S1 ). The database is available in the supplementary material, including the standardized hourly file-size and all the explanatory variables. Precipitation and temperature are used to describe weather during the studied period. Both variables are available for download with a temporal resolution of hours which makes the weather data easy to integrate with MIND data. However, geography poses a problem. Since the MIND data is derived from users in all of Sweden the weather data should represent the weather situation valid for a large share of the individuals. The challenge is to generate representative weather data for a relatively big country with considerable differences in temperature and precipitation. We have chosen to draw data from Stockholm and Malmö which both are located in the populous south. The two cities are ranked number one and three in population size and are located close to the remaining bigger cities in Sweden. The online data are available from the Swedish meteorological surveys 2 . The two variables were temperature (in centigrade per hour, Air temp Malmö and Stockholm) and precipitation (mm per hour, precipitation Malmö and Stockholm). However, since also rate and magnitude of change may contribute to the model, additional variables have been created: which were added because people might be reacting to a size of change rather than its direction. Various regular calendar events as well as unplanned events affect the behaviour of populations. Big enough events may affect the entire country contributing to changes in how people spend their time, their spatial and digital activity. In order to test if and potentially which events that affect the size of hour-files we have generated a set of dummy variables, coded 1 for the hours (or full days) they are observed, including the following: Secular holiday, religious holiday, sports, TV, media and, weather & transport (i.e. complete stop in traffic between main national nodes or extreme weather affecting a large proportion of the population) 3 . The analysis relies on the relationship between activity, whether mobility or phone usage, as measured by the number of rows in a file and its size on disk. This avoids the daunting need to open every file to measure and count each activity. The robustness of this approach is demonstrated by high correlation between file-size and count of activities (supplementary Figure S2 ). The dependent variable for the analysis -aggregated to hours -is thus simply file size. Temporal patterns in the data are described and then analysed using established methodologies. Autocorrelation across time is used to assess hourly and daily time lags in activity. We conduct an autocorrelation test of the variation in file size by correlating the of hours (and days) files to lag hours (and days) files as specified in equation 1. The result r " is the correlation between any hour or day and time lag h that ranges between perfectly correlated, random or perfectly negatively correlated (r " values of 1, 0 and -1 respectively). By plotting the r " for a series of values where the time lag h is increasing by one hour (or one day for the day series) we depict changes in autocorrelation, as the distance in time lag increases, in the shape of a correlogram. The correlogram illustrates the trend-change in autocorrelation to test our hypothesis that the size of files follows a pattern that is strongly related to the repetitive nature of days and hours. Multilevel models (MLM) are used because the dependent variable is sorted on time in a nested hierarchy of hours, days and months. MLM is used to estimate variance across temporal levels accounting for weather and special events. Temporal categories, enable detecting how much of the file-size variation can be assigned to hours, days or months but also see how much, and which temporal category of the variation can be explained by the fixed effects. The fixed effects are listed as event and weather dummy variables. Three different multi-level analyses are employed in this study. First, we run the multi-level regression (the empty model) using no explanatory variables, but with three temporal levels: hour, day and a combined month-year. The empty model is formulated as specified in equation 2. 456 = 8456 + 6 + 56 + 456 (eq.2.) Where 456 refers to the file size z-score and subscripts represents hour, day and month-year levels respectively. The variation (or random errors) at different levels is expressed as: 6 (monthyear), 56 (day) and 456 (hour). The empty model enables estimating the percentage of variation attributable to hours, days and month-years, but also enables assessing to what extent the introduction of fixed effects for events and weather in the full model reduce variation at different temporal levels. The second and third multi-level analyses differ from each other in two ways: (1) the second (full) model makes use of all listed event and weather variables and (2) the third (restricted) model is an empty model (no fixed effects) but contains only the hours and days that are not associated with specific events. Therefore, the restricted model contains fewer cases. The full multi-level model can be expressed as specified in equation 3. 456@ 8456 + A A456 + ⋯ + D D456 + 6 + 56 + 456 (eq.3.) Where 8456 represents the intercept and D D456 the time-specific regression coefficient between a fixed effect predictor and the dependent variable. Predicted file-sizes from these MLM are compared with the observed using the Root Mean Square Deviation (RMSD) so as to measure the scale of temporal variation. By comparing the deviation between estimated and observed size of the hour-files, we trace which hours and weekdays tend to vary more or less from the average. By repeating the RMSD test for each hour, day and combined week-day and hour, we detect which hours and days are best suited for selection of a representative subsample. The RMSD can be expressed as in equation 4. A survey of results on mobile phone datasets analysis Variability in regularity: Mining temporal mobility patterns in London, Singapore and Beijing using smart-card data How to Draw a Neighborhood? The Potential of Big Data, Regionalization, and Community Detection for Understanding the Heterogeneous Nature of Urban Neighborhoods New perspectives on ethnic segregation over time and space. A domains approach Spatial and temporal patterns of economic segregation in Sweden's metropolitan areas: A mobility approach Response to COVID-19 in Taiwan: big data analytics, new technology, and proactive testing Mobile phone data and COVID-19: Missing an opportunity? arXiv preprint Activity-based human mobility patterns inferred from mobile phone data: A case study of Singapore Human mobility characterization from cellular network data Simulating human mobility patterns in urban areas A general survey of privacy-preserving data mining models and algorithms Mobility, data mining and privacy: Geographic knowledge discovery Collective Response of Human Populations to Large-Scale Emergencies Predictability of population displacement after the 2010 Haiti earthquake Using mobile phone data for electricity infrastructure planning Data for development reloaded: visual matrix techniques for the exploration and analysis of massive mobile phone data in IIS survey Swedes and internet 2017 We thank the mobile phone data provider for the data. The work of MT and JÖ was performed with the support of Formas Research Council funding.