key: cord-0556222-hwnc0ag6
authors: Christen, Peter; Schnell, Rainer
title: Common Misconceptions about Population Data
date: 2021-12-20
journal: nan
DOI: nan
sha: 1bccae0e8d69d87eb20293f9947d1f54b96f436c
doc_id: 556222
cord_uid: hwnc0ag6

Databases covering all individuals of a population are increasingly used for research studies in domains ranging from public health to the social sciences. There is also growing interest by governments and businesses to use population data to support data-driven decision making. The massive size of such databases is often mistaken as a guarantee for valid inferences on the population of interest. However, population data have characteristics that make them challenging to use, including various assumptions being made how such data were collected and what types of processing have been applied to them. Furthermore, the full potential of population data can often only be unlocked when such data are linked to other databases, a process that adds fresh challenges. This article discusses a diverse range of misconceptions about population data that we believe anybody who works with such data needs to be aware of. Many of these misconceptions are not well documented in scientific publications but only discussed anecdotally among researchers and practitioners. We conclude with a set of recommendations for inference when using population data.

There is a shift in many domains of science towards the use of large and complex databases that cover whole populations to replace -or at least enrich -traditional data collection methods such as surveys or experiments [13, 28, 33, 68] . This kind of data are now seen as a crucial strategic resource for research in various disciplines [23, 50] . Similarly, governments and businesses increasingly recognise the value large population databases can provide to improve decision making processes [3, 12, 69] . The monetary value of personal data can be seen in the tremendous success of today's largest companies, whose business models are based on the use of data about people at the level of nearly complete populations [46] , and where the value of an individual user has been estimated to be up-to $40 [55] .

While misconceptions about personal data in the commercial sector might lead to lost customers and reduced revenue due to wrong inference, within government initiatives they can lead to severe and costly mismanagement of programs, for example in the public health or social security domains, as well as the more general loss in the overall trust in governments by the public [14, 49, 54] . In the context of research, assumptions and misconceptions about population data and how these are used for research studies, potentially after being linked with other data, can lead to wrong outcomes that can result in conclusions with severe negative impact in the real world [12, 34] . For example, in October 2021 the German government indicated an underestimation of the official vaccination rate by 3.5 million people [7] , which were most likely due to unreported vaccinations by physicians working within large organisations. The databases of these physicians were added to the digital vaccination monitoring system of the supervising federal health agency several months after the databases of residential physicians.

While it is generally acknowledged that bad data quality can cost organisations millions if not billions in lost revenue, loss of reputation, or losses due to bad decision making [5, 58] , here we focus on one specific but commonly required type of data: Personal data about individuals covering (nearly) whole populations. Following a recent definition of Population Data Science (succinctly "the science of data about people") [61] , we define population data as data about people at the level of a population. The focus on populations is important, as it refers to the scale and complexity of the data being considered. These challenge (manual) assessment of data quality as well as data processing that are normally conducted on the much smaller data sets used in traditional medical studies or social science surveys. Personal data include personally identifiable information (PII) [65] , such as the names, addresses, or dates of birth of people. Most administrative and operational data collected by governments and businesses can be categorised as personal data, including data about people's education, their electronic health records and shopping habits, and their social security, financial, and taxation data [41] .

A crucial aspect of population data is that they are not primarily collected for research, but rather for operational or administrative purposes [39, 50, 51, 61] . As a result, researchers have much less control over such data and their quality, potentially only limited ways to learn about the data's provenance, and even less opportunities to edit and clean such data to make them fit for the purpose of a specific research study [8] . Quoting Brown et al. [15] , "in science, three things matter: the data, the methods used to collect the data (which give them their probative value), and the logic connecting the data and methods to conclusions". When both the data and their collection are outside the control of a researcher, conducting proper science can become challenging. A further common assumption is that due to their large size and because they are readily available, population databases allow valid inference at no or little extra costs.

One example is the increasing use of medical databases for public health research, compiled from the electronic health records of patients in hospitals or doctor's clinics, and potentially linked with databases that contain other personal information such as education or employment details of these patients. While analysing such linked data can provide exciting new insights into the effects of people's social status upon their health, not considering any potential bias or other limitations of such data can lead to wrong conclusions [10, 26, 43, 78, 83] .

Another example are the digital footprints or traces individuals leave behind in their daily lives, such as location data and electronic communication [1] . These are based on self-selected subsets of a population and are neither population covering nor a random sample [13] . Therefore, handling trace data as population data could result in biased estimates [24, 51] .

Most of the discussion about the use of personal data for research has been about privacy and confidentiality [2, 17, 21, 30, 47] . Much less consideration has been given to how data quality and assumptions about personal data can influence the outcomes of a research study. Concerning administrative data, Hand writes [39] : "One of the lessons that we have learnt from data mining practice over the past 20 years is that most of the unusual structures in large data sets arise from data errors, rather than anything of intrinsic interest". The same can be said about population covering databases of personal data.

Interestingly, not much has been written about the characteristics of personal data and how they differ from other types of data, such as scientific or financial data. During the writing of this article, we conducted extensive literature research to find scientific publications, government reports, or technical white papers that describe experiences or challenges when dealing with population databases. The number of identified publications was scarce, indicating that many challenges encountered and lessons learnt are not being shared, even though these would be of high value to both researchers and practitioners who work with population data. Therefore, we mainly draw on our several decades of experience from diverse disciplines, working with large real-world population databases and in collaborative projects with both private and public sector organisations across multiple continents.

We aim with this article for improving the handling of population covering databases. By demonstrating common misconceptions about personal data, we hope to show how researchers could identify and avoid the resulting traps. We group these misconceptions into five categories: Sampling, measuring, capturing, processing, and linking. We are not discussing any misconceptions related to the analysis of population data -how to prevent various pitfalls in statistical data analysis and machine learning is the topic of extensive discussions [38, 44, 72] .

Before we describe misconceptions in these categories, some clarifications of the kind of data to be discussed are needed.

Population data are observational data that most commonly occur in administrative or operational databases as held by government or business organisations [39] . In their most general form [8] , each entity (person) in a population database is represented by one or more records (one row in a spreadsheet, text file, or database table). Each such record consists of a set of attributes (also known as columns, fields, variables or features), where each of these attributes can contain values from a specific domain. For example, years of birth are made of numbers, while first names generally consist of letters and possibly some special characters [62] .

The attributes that represent entities in population data can be categorised into identifiers and microdata [21] . The first category can either be an entity identifier (such as a social security number) that is unique for each person in a population [19] , or they can be a group of attributes that, when combined, can be used to uniquely identify individuals. Such attributes are generally known as quasi-identifiers (QIDs), and include names, addresses, date and place of birth, and so on [21] . While single such attributes by themselves generally do not allow the identification of entities (there might be multiple individuals named 'Nick Kennedy' who live in 'London'), when enough QIDs are combined, they become unique (there might be only one 'Nick Kennedy' in 'London' who was born on 19 August 1992). QIDs are generally not used in research studies, or if they are then only in limited forms (such as age or gender) that do not allow the identification of individuals [27, 29] . Many of the misconceptions about population data are, however, about QIDs and their meaning, how they are captured and processed, and are used to link and aggregate population databases to transform them into a format that is of use for a research study.

The second component of personal data are known as micro-or payload data [21] , and they include the data of interest for a research study, such as the medical, education, financial, or location details of individuals. Much of these data are highly sensitive if they are connected to QID values because combined they can reveal sensitive personal details about an individual [21] . Research into data anonymisation and disclosure control methods [27, 29] addresses how sensitive microdata can be made available to researchers in anonymised form.

It is commonly recognised that isolated population databases are of limited use when trying to investigate and solve today's complex challenges, such as the relationships between genetic and environmental aspects of diseases, or how a pandemic spreads through a population [8, 61, 73] . Therefore, in many projects that are based on population data, record or data linkage [42, 45] is employed to identify and link all records that correspond to the same entity (person) across diverse databases. Linked records at the level of individuals, rather than aggregated data, are generally required to allow the development of accurate predictive models.

There is widespread confusion about the techniques available for linking data sets. From the viewpoint of statistics or computer science [19, 45] , linking records about persons requires the identification of a specific person who is represented in all or some of the databases being linked. If no identification of a specific person is possible, linking will be impossible 1 .

Record linkage techniques have a long history going back to the 1950s [66] and have traditionally been employed in the domains of public health and national statistics [45] . In more recent times such techniques have seen much wider use in domains ranging from business intelligence to social science research. Within governments, record linkage is, for example, being used to find welfare fraudsters [14, 21] and in national security to identify terrorism suspects [18] . Modern record linkage techniques are based on sophisticated statistical or machine learning based approaches [25, 45] and are capable of producing high quality linked data sets. Record linkage, as data integration more generally, is however a complex and challenging process [4, 19] . Assumptions about how the databases being linked have been collected and processed, and the actual linkage methods being employed, can result in linked data sets of questionable quality and potentially wrong decisions being made from such data [10, 26, 43] .

While many of the misconceptions about population data seem obvious, they are often not taken into consideration when such data are captured, processed, linked, and then analysed in research studies. Many data issues are due to humans being involved in the processes that gen-erate population data, including the mistakes and choices people make, changing requirements, novel computing and data entry systems, limited resources and time, as well as decision making influenced by political and economical reasons.

Furthermore, policy and decision makers are often not aware of the issues we discuss here, and they assume any kind of question can be answered with highly accurate and unbiased results when using administrative or operational databases that cover a whole population [8] .

While the literature on data quality and data cleaning is broad [5, 52, 58, 70] , given the widespread use of personal data at the level of populations it is surprising that only little published work seems to discuss data quality aspects specific to personal data [21, 76, 81] . The reasons why misconceptions such as the ones we discuss here are not described in publications include that projects based on population data are generally covered by privacy regulations, such as the European General Data Protection Regulation (GDPR) or US Health Insurance Portability and Accountability Act (HIPAA), as well as additional confidentiality agreements [21] . The processes and methods employed are often seen as commercial in confidence and covered by intellectual property agreements, where in some cases researchers are not even allowed to disclose what kinds of data they are working with. Furthermore, data centric aspects are generally not discussed in publications where the focus is on presenting the results obtained in a research study rather than the steps taken to obtain these results.

Sampling is the process of selecting entities (people) from an actual real population to be included into a population database. The choice of how individuals were selected can result in a variety of misconceptions.

(1) A population database contains all individuals in a population. This assumption will unlikely hold in any population database. Even government databases that are supposed to cover whole populations, such as taxation or census databases, very likely have subpopulations that are under-represented or absent, such as children, residents without a fixed address, or temporary residents. For example, individuals who do not have a national health identifier number and therefore are not eligible for government health services (such as tourists or international students) might be missing from population health databases. As the COVID-19 pandemic has shown, there are also always individuals who refuse to participate in government services for personal reasons, which can be influenced by their ethnicity, religion, or political beliefs.

When a population database is generated by linking data from different organisations, linking might only be possible between parts of a population. For example, if linking is restricted to public hospitals and certain private hospitals, not all patients will be included in the resulting linked data set. If private hospitals refuse to participate in a certain study, this will likely introduce bias, since patients with higher income will more likely go to exclusive private hospitals. In highly segmented health care systems such as in Germany, restricting an analysis to patients from only the statutory health insurance system will most certainly cause biased estimates.

The digital divide [24] , the division of people who have access and use digital services and media versus those who do not, likely also results in biased population databases. Younger and more affluent individuals are more likely to be included in databases collected using digital services compared to older people and those with a lower socio-economic status or who are migrants.

A wrong conclusion from this misconception is that any question can be answered using a population database. Because it is assumed all individuals in a population are represented in that database, the illusion is that it will be possible to identify any subpopulation of interest and explore this group of people, even if that group is very small [39] . However, any data quality issues in the records representing a subpopulation can have severe consequences especially when analysing small groups.

(2) Records in a database are within the scope of interest. Individuals might have left the population of interest for a research study because the criteria for inclusion in a population database are no more given. For example, a person may have died (although that might be of interest in itself) or have left the geographical area of a study. Because many organisations are only notified of the death of an individual or them moving address until some time after the event (in some cases never), it is likely that at any point in time a population database contains records that are outside the scope of interest. Including records about these people can affect research studies as well as operational systems. In the case of dead people, it can also result in reputational damage to an organisation if a deceased person is asked to pay bills or is being paid social security benefits.

Keeping a population database up-to-date, both with regard to only including relevant individuals as well as keeping their personal details updated, can be highly challenging and likely requires linking it with other databases that provide details about recently deceased individuals and those who have moved away. A record linkage based approach using a database of residents to estimate the true population is, for example, used in Estonia [79] .

(3) Population databases are complete. Population covering databases might be considered as containing information on all or at least most individuals in a population. This misconception has caused a considerable amount of misinvested research money and time since many population databases are sparse matrices containing different pieces of information for different subsets of a population. The sparseness is due to the generating process of many large databases resulting from the merge of different individual databases, each only covering a part of the whole population. The resulting sparse patterns of populated cells of the matrix may prevent many, if not all, statistical analyses of the data. A typical example of this kind of data is consumer data used to predict credit scores or consumer types. Although sufficient for their initial purpose, using such data for statistical purposes such as non-response adjustments for surveys has been in some cases disappointing [77] , because the sparsity of the auxiliary data may require different models of adjustments for each pattern of given auxiliary data.

(4) The population covered in a database is well defined. The reasons for records about individuals to be included (or not) into a database are crucial to understanding the population covered in that database. Some population databases can be based on mandatory inclusion of individuals (think of government taxation or health databases) while others are based on voluntary or self-selected inclusion (think of medical research databases where patients can optin or opt-out when asked to provide their details [21] ).

A population database to be used for research is often extracted from an operational database (that is being dynamically updated), and the definitions and rules used to extract records about certain individuals might not be known to the data scientists who are processing and linking the database, and even less likely to the researchers who will be analysing it [8] .

Furthermore, the definitions used to extract data might differ between organisations or change over time, or they might contain minor mistakes such as ill-defined age or date ranges. For example, COVID-19 cases could be included into a database based on the date of symptom onset, collection of samples, or diagnosis [4] , where each of these will result in different numbers of records being added. Such differing mechanisms and any resulting bias might not be known or taken into account when research is conducted based on such population data.

(5) Records in a population database are unique. A database may contain duplicate records referring to the same person due to erroneous multiple records. Data entry validation checks may prevent exact duplicates, however fuzzy or approximate duplicates [19] might be missed by (automatic) checks. In real-world settings, the same person can therefore be registered multiple times in different institutions and not being identified as the same individual.

An example are people in a social security database who should have one record but were registered multiple times because they changed names or addresses and might have forgotten their previous registration (and their unique identifier number), or they might be interested in multiple registrations to obtain social benefits more than once. Both in hospital admission as well as in social security databases have we seen individual cases with dozens of records.

Some duplicates are very difficult to find, for example a women who changes her surname and address when she gets married, and therefore only her first name, gender, and place and date of birth will stay the same. The flipside is that several people with highly similar personal details, such as twins who only have different first names, might not be recognised as two individuals but instead as duplicates due to their highly similar QID values.

Duplicate records are possible even if a database contains entity identifiers (such as social security numbers, patient numbers, or voter identifiers) that should prevent multiple records by the same entity from being added into a database. The misconception here is that such identifiers are always provided correctly by an individual, which due to human behaviour and errors is not always the case. Duplicates have been identified in applications where high data quality is crucial, such as voter registration databases [67] . Unique entity identifiers need to be robust (have a check-sum), stable over time, complete (occur for all records in a database), have to be assigned in a global way, and carefully validated during data entry [19] .

The process of measuring the characteristics of people can result in a variety of misconceptions.

(6) Population data are always of value. Both private and public sector organisations increasingly make databases publicly available to facilitate their analysis by researchers. However, many of these databases either lack metadata or context for them to become of use, or they are aggregated or anonymised due to privacy and confidentiality concerns. A main reason for this is because past experiences have shown how sensitive personal information about individuals can be reidentified even from supposedly anonymised databases [75] .

Population data without context are unlikely to be of use for research studies. A database of QID values (such as names and addresses) without any (or only limited) microdata is, by itself, of little value for research. Having the educational level of individuals in a database only becomes useful if this database can be linked with other data at the level of individuals. Furthermore, due to the dynamic nature of people's lives, population data become out of date and therefore need to be updated on a regular basis. Without adequate metadata, useful detailed microdata, and regular updates, many publicly available databases are of little value for research.

(7) Records always refer to real people. A surprisingly large number of real-world databases contain records of people who never existed [20] . In many cases, these records are due to training and exercises for data entry personal, or for testing of software systems. If these records are not deleted they remain in a database. Although in many cases easily identified by a human ('Tony Test' living in 'Testville'), these records are difficult to detect by data cleaning algorithms since they were designed to have the characteristics of real people and often contain high amounts of variations and errors [22] . In population databases collected from social media platforms, complete records might furthermore correspond to fake users or even AI (artificial intelligence) bots that generate human-like content [8, 51] .

(8) Errors in personal data are not intentional. There are social, cultural, as well as personal reasons why individuals would decide to provide incorrect personal details. These include fear of surveillance by governments, trying to prevent unsolicited advertisements from businesses, or simply the desire to keep sensitive personal data private [16] . Fear of data breaches and how personal data are being (mis)used by organisations are clearly influencing the reluctance of individuals to provide their details unless deemed necessary [49, 75] . In domains such as policing and criminal justice, faked QID values such as name aliases occur commonly as criminals try to hide their actual identities [57] .

Incorrectly provided data might only be modified slightly from a correct value (such as a date of birth a few days off), be changed completely (a different occupation given), not be provided at all (no value entered if an input field is not mandatory), or be made up (such as a telephone number in the form of '1234 5678'). The decision to provide incorrect or withhold personal data is highly dependant upon where the data are being collected. It is less likely for an individual to provide wrong or faked values on an official government form or when opening a bank account compared to when ordering a book in an online store or signing up for a social media platform.

(9) Certain personal details are fixed. While some personal details, such as names and addresses, are known to change over time for many individuals, it is often assumed that others are fixed at birth. These include ethnic and gender identification, as well as place and country of birth. Therefore these are seen to be suitable QID values, for example for record linkage or longitudinal data analysis.

In most current databases, ethnic identification is self-reported. The available categories depend upon how a society values different subpopulations. 2 Socially influenced attributes such as race are sometimes changed by individuals over time, a recent example being the 'Black Lives Matter' movement which made many people become more proud to be of colour.

Gender (the socially constructed characteristics of women and men 3 ) is an example where until recently most databases only allowed two or possibly three values (male, female, or un-known). Increasingly non-binary gender values are being officially recognised. As a result, for individuals not only their actual gender value might change over time, but also new gender 'categories' that were not available in the past can now appear in databases.

It is even possible for values fixed at birth to change in the real world. An example is the Eastern German city of Chemnitz, which from 1953 until 1990 was named Karl-Marx-Stadt. Individuals born during that period have a country of birth (German Democratic Republic) and a place of birth that both do not exist anymore. If they change their name and address details, and possibly even their gender, then records referring to such an individual collected at different points in time will be almost impossible to identify and link correctly.

(10) There is only one valid spelling of a name. Names are a key component of personal data as collected in many population databases. Unlike normal text, where there is a single correct spelling for a given word, for many names there are multiple variations (such as 'Gail', 'Gayle', and 'Gale'), and all of them are valid [19, 56] . When data are entered, for example over the telephone, different name variations might be recorded for the same individual. Furthermore, people are sometimes given or are using nicknames (such as 'Tash' for 'Natasha').

There are many different cultural aspects of names, including different name structures, ambiguous transliterations from non-roman into the roman alphabet, or name changes over time for religious reasons, to name a few [19] . Such name variations are a known problem when names are used to identify individuals or when names are compared between databases for the purpose of record linkage. Working with names can therefore be a challenging undertaking that will require expertise in the cultural and ethnic aspects of names as encountered in a population database. Overall, there are many misconceptions about (personal) names [62] .

(11) Missing data has no meaning. Missing (empty) data are common in many databases [40] . When occurring in QID or microdata attributes, they can lead to problems for data processing, linking, and analysis [21, 40] . Missing data can occur at the level of missing records (no information is available about certain entities), missing attributes (QID or microdata for all entities in a database are not available), or missing QID or microdata values for individual records (specific missing attribute values) [8] .

There are different categories of missing data [40, 59] . In some cases a missing value does not contain any valuable information, in others it can be the only correct value (children under a certain age should not have an occupation), or it can have multiple interpretations. A missing value for a question about religion in a census, for example, can mean an individual does not have a religion or they choose not to disclose it. Missing data can also occur in settings where resources are limited and therefore data entries had to be prioritised, such as in emergency departments [4] . Care must therefore be taken when considering missing data. Removing attributes or even records with missing values, or imputing missing values [59] , can result in errors being introduced into population databases that can lead to bias and errors in a research study. 5 , the latter currently in its eleventh revision. It is often assumed that such codes are fixed over time and unique in that a certain item, such as a disease or occupation, is only assigned one code, and that this assignment does not change. However, most coding systems are revised over time with new codes being added, outdated and unused codes removed, and whole groups of codes being recoded (including codes being swapped). As a result, it is possible that a database can contain codes which are no longer valid. An example are the codes of the Australian Pharmaceutical Benefit Scheme (PBS) [63] , where the antidepressant Venlafaxine had the code N06AE06 until 1995, when it was given the code N06AA22, which was then changed to N06AX16 in 1999.

The assumption, therefore, that codes are stable and unique and can be used to, for example, group or categorise records of patients for a research study can lead to wrong analysis results unless care is taken about the meaning of these codes. This can be an issue (even if a single database is analysed) if records have been collected over a period of time, as is the case for longitudinal research studies. Once databases are linked, the challenge of different codings can be exacerbated if the source databases were collected at different points in time.

(13) Data definitions are unambiguous. Many population databases contain categorical attributes that are based on some definition about who to include or exclude in a certain population. A recent example are the definitions for death or hospitalisations due to COVID-19 infections [34] , where different US states used varying definitions that resulted in databases that could not be used for comparative analysis.

Another example illustrating this misconception is the definition of a live birth as used in the divided Western and Eastern parts of Germany in the 20th century [74] . In the former eastern German Democratic Republic, a live birth was defined by two indications, heartbeat and respiration. A newborn who only had one of these two after neonatal measures were applied was still counted as a stillbirth. On the other hand, the western Federal Republic of Germany followed the definition of the World Health Organisation, and this newborn would have been counted as a live birth who subsequently died. These different definitions resulted in much lower infant mortality in Eastern compared to Western Germany.

In another example, a recent study [36] compared data about fatal police violence from the US National Vital Statistics System (NVSS) to three non-governmental, open-source databases. Because of differences in definitions of police violence, this means that the NVSS under-reported deaths attributable to police violence by more than 50% between 1980 and 2018.

As with coding systems, data definitions can also change over time. Unless relevant metadata describing such changes are available, it can be challenging to identify any changed definitions because the change might only have subtle effects on the characteristics of the population of interest for a research study. (14) Temporal data aspects do not matter. Given the dynamic nature of personal details, the time and date when population data are measured and included into a database can be crucial because differences in data lag can lead to inconsistent data making them not suitable for research studies [13, 34] . If it takes different amounts of time for different organisations to measure data about the same events then clearly these data are not comparable, resulting in misreporting for example of COVID-19 deaths [4] or vaccination rates [7, 13] . Daily, weekly, monthly, or seasonal aspects can influence data measurements, as can events such as public holidays and religious festivities which likely only affect certain subpopulations. Data corrections are not uncommon, especially in applications where there is an urgent need to provide initial data as quickly as possible, for example to better understand a global pandemic [4, 13] . Later updates of data might not be considered leading to wrong conclusions of research studies.

When databases collected at different points in time are being linked, the aspects of people moving, changing their names, or both (for example due to getting married) need to be carefully considered [21] . Having two records from two databases collected two years apart that have the same name but different addresses does not necessary mean they cannot refer to the same person, while two records with the same address and the same name in these two databases can refer to two different individuals. Challenging situations where such cases are not uncommon are prisons and student accommodation that can house large numbers of individuals.

(15) The meaning of data is always known. It is not uncommon for population databases to contain attributes that are not (well) documented. These can include codes without known meaning, irrelevant sequence numbers, or temporary values that have been added at some point in time for some specific purpose. If no documentation is available, database managers are generally reluctant to remove such attributes. As a result, spurious patterns might be detected if such attributes are included into a data analysis.

There are many different ways how (personal) data can be captured, including being manually typed from handwritten or typed documents, scanned followed by optical character recognition, or spoken followed by automatic speech recognition [21] . Increasingly, biometric data are captured via sensors such as cameras or finger-print readers. Each of these different data capturing methods can introduce specific data quality aspects, such as typing or scanning errors. There are, however, some common misconceptions about data capturing of population data.

(16) All records in a database were captured using the same process. Since population covering databases are large and often collected over long periods of time and wide geographical areas, records are often generated or entered by a large number of staff, giving rise to different interpretations of data entry rules. For example, if an input field requires a mandatory value, humans will enter all kinds of unstandardised indicators for missingness, ranging from single symbols (like '-' or '.'), acronyms ('NA' or 'MD'), to texts explaining the missing data (like 'unknown'). If a database is compiled by independent organisations (as, for example, in the German system of official statistics), these different interpretations of data entry rules will cause the need for standardisation before analysis. Manual data entry such as typing can furthermore lead to different error characteristics between data entry personnel [56] . There are regional differences in how dates are written (day / month / year versus month / day / year), numbers are formatted (900.00 in the UK versus 900,00 in Germany), or addresses are structured. Furthermore, data can be captured with different temporal and/or spatial resolution, such as only providing postcodes or city names versus detailed street addresses. Different locations can also have the same name, such as Berlin the German city versus Berlin the German state. As a result, the characteristics of both QID and microdata values can differ between subsets of records in a population database, making their comparison and analysis challenging [4] .

Changing regulations with regard to what needs to be captured into a certain database will also result in data of different content, format, and structure, where potentially new QID attributes are added to records while others are removed or left empty. Landline telephone numbers, for example, are increasingly not being the main point of contact anymore.

(17) Attribute values are correct and valid. Any data values captured, either by some form of sensor or manually entered into a database, can be subject to errors coming from equipment malfunction, human data entry, or cognitive mistakes (such as confusion about the data required or difficulties recalling correct information) [8, 52] . In the medical domain, manual typing errors, wrong interpretations of forms (think of handwritten prescriptions by doctors), entering values into the wrong input fields, or making mistakes interpreting instructions (when prescribing drugs) are commonly occurring mistakes, where rates ranging from 2 to 514 mistakes per 1,000 prescriptions have been reported [82] .

While data validation and data cleaning tools can easily detect values that are outside the range or domain of what is valid (such as 31 February) [52] , unless some form of external validation is possible, it is generally not feasible to ascertain the correctness of any given value. If a patient is really 42 years old (a valid human age) can only be validated if it is possible to link the person's health record to authoritative information (likely from an external database) about her true age.

Furthermore, while individual QID values might each be valid for a given record, they might contradict each other. For example, a record with first name 'John' and gender 'f' likely contains one QID value that is erroneous. Such contradictions can be identified and corrected using appropriate edit constraints [45] . (18) Values are in their correct attributes. The assumption here is that data entry personnel always enter values into the correct attribute, such as first and last names. This will likely not always be the case, for example for Asian names where many can be used both as first and last names. Similar cases can also occur with certain Western names, for example both 'Paul' and 'Thomas' are being used as first and last names.

(19) Data validation rules produce correct data. In order to ensure data of high quality, many data management systems contain rules that need to be fulfilled when data are being captured. For example, registering a new patient in a hospital requires both a valid address and date of birth. In some cases, such as in emergency admissions, not all of this information will be known. As a result of data entry checks, default values are often used, a common example being the 1 January for individuals with unknown dates of birth. While these are valid, if not properly handled such defaults can result in skewed data distributions that can adversely affect research studies. Data entry personnel might have ad-hoc rules they apply in order to bypass data entry requirements and to ensure any entered records fulfil any data validation steps.

(20) All relevant data have been collected. Because the primary purpose of most population databases is not their use for research studies (rather research is a secondary use of such data), not all relevant information that is of importance for a given study might be available for all records in a database, or it might only be available in subsets of records. This can, for example, be due to changes in data entry requirements over time, or data might have been withheld by the owner due to confidentiality concerns or for commercial reasons, or only be provided in aggregated or anonymised form.

Data that are not available are known as dark data [40] , data we do not know about but that could be of interest to a research study. As a result certain required or desired information might be missing for a given research study, making a given database less useful or requiring the use of alternative, less useful data for that study.

It has to be kept in mind that information collected in population data (both administrative and operational databases) are about what people do [39] . This is unlike survey data, where specific questions about attitudes, believes, expectations or intentions are commonly asked with the aim of trying to understand the behaviour of people. Factual information about what people actually do can provide different answers compared to questions about what they claim to do, while inferring people's believes from their behaviour might not be possible.

It is rare for population databases to be used for analysis without any processing being conducted. The organisations that collect data about individuals, and those that further aggregate, link or otherwise integrate such data, are all likely applying various and slightly different forms of processing to their data [4] . Processing can include data cleaning and standardisation, parsing of free text values, transformations (lossy or lossless), numerical normalisation, recoding into categories, imputation of missing values, and data aggregation [19] . The use of database management systems and data analysis software can furthermore result in data being reformatted internally before being stored and later extracted for further processing and analysis. Each component of a data pipeline can therefore result in both explicit (user applied) as well as implicit (internally to a system) processing being conducted, leading to various misconceptions.

(21) Data processing can be fully automated. Much of the processing of population data has to be conducted in an iterative fashion, where data exploration and profiling leads to a better understanding of a database which in turns helps to apply appropriate data processing techniques [52] . Commonly known as data wrangling [71] , this process requires manual exploration, programming of data specific functionalities, domain expertise with regard to the provenance and content of a database, as well some understanding of the final use of a database. Data processing is often the most time-consuming and resource intensive step of the overall data analytics pipeline, commonly requiring substantial domain as well as data expertise [71] . In national statistical agencies, it has been reported that as much as 40% of resources are used on data processing [8] .

Time and resource constraints might mean not all desired data processing can be accomplished. Manual editing and evaluation is also not possible on large and complex population databases, and therefore compromises have to be made between data quality and timeliness of a database available for research [8] .

(22) Data processing is always correct. No data processing is always 100% correct, especially when data cleaning is applied on free format textual attributes such as personal names and addresses. Often there is no single correct value for a given ambiguous input value. For example, within a street address, the abbreviation 'St' can stand for either 'Street' or 'Saint' (as used in a town name like 'Saint Marys'). For data processing steps such as geocoding (assigning geographical coordinates to addresses), there can be incorrect locations being matched to ambiguous or incomplete addresses, resulting in certain addresses being located far away from their actual position [53] .

Given data processing commonly involves human efforts, mistakes in the use and configuration of software can lead to incorrect data processing, as can bugs in or the use of different or outdated versions of software. It has been reported [31] that on 2 October 2020 a total of 15,841 positive COVID cases (around 20%) in England were missed because when recording daily cases an old file format of Microsoft Excel was used which allowed a maximum of 65,536 rows. Software features such as auto-completion and automatic spelling correction can furthermore lead to the wrong correction of unusual but valid values that are not available in a dictionary.

(23) Aggregated data are sufficient for research. Highly aggregated data, for example at the level of states, counties, or large geographical units, are hardly of use for scientific research intending causal statements. Although results based on aggregated data might seem to be interesting, the number of possible alternative explanations for the same set of facts based on aggregated data is usually so large that no definitive conclusions are possible. A major problem here is the ecological fallacy, describing the mistake of an aggregate relationship implying the same relationship for individuals [32] . For example, if increased mortality rates are observed in regions where vaccination rates are high, the false conclusion would be that vaccinated people have a higher probability to die. But actually the reverse might be true: People observing other people dying might be more willing to get vaccinated. For a review of methodological problems of research based on aggregate data in sociology see [11] .

How data are aggregated depends on how aggregation functions are defined and interpreted [35] . Weekly counts can, for example, be summed Monday to Sunday or alternatively Sunday to Saturday. Data that are aggregated inconsistently, or are available at different levels of aggregation, will unlikely be of use for any research studies (or only after additional data processing has been conducted).

(24) Metadata are correct, complete, and up-to-date. Metadata (also known as data dictionaries) are describing a database, how it has been created, populated, and its content captured and processed. Metadata include aspects such as the source, ownership, and provenance of a database, licensing and access limitations, description of all attributes including their domains and any coding used. Ideally summary statistics are included in metadata about relevant data quality dimensions (including accuracy, validity, completeness, timeliness, consistency, and uniqueness) [5, 21] , as well as any data cleaning, imputation, editing, processing, transformation, aggregation, and linkage that was conducted on the source databases to obtain a given population database. Relevant documentation, including who conducted any data processing using what software, as well as a revision history of that documentation, are crucial to understand the actual structure, content, and quality of a database at hand.

Unfortunately metadata are often not available, or they are incomplete, out of date, they need to be purchased, or can only be obtained through time-consuming approval processes [50] . A lack of metadata can lead to misunderstandings during data analysis, wasted time, misreporting of results, or can make a population database altogether useless [34] .

The process of linking population databases is an increasing requirement due to the need to combine data about different aspects of people's lives to facilitate data analysis that is not possible with only a single database [61] . Unless common unique entity identifiers are available in all the databases to be linked, record linkage methods have to rely upon the QID values of individuals, such as people's names, addresses, and other personal details, to find records that refer to the same person [19, 42, 45] . These QID values, however, can contain errors, can be missing, and they can change over time, resulting in incorrect linkage results even when advanced linkage methods are employed [25, 26] . The challenging process of linking databases can therefore be the source of various misconceptions about the obtained linked data set.

(25) A linked data set corresponds to an actual population. Due to data quality issues and the record linkage technique(s) employed, a linked data set likely contains wrong links (Type I error, two records referring to two different individuals were linked wrongly) while some true links have been missed (Type II error, two records referring to the same person were not linked) [21, 26] . The performance of most record linkage techniques can be controlled through parameters, allowing a trade-off between these two types of errors. As a result, any linked data set is only an approximation of the actual underlying population that it is supposed to represent. By changing the parameter settings of a linkage technique it is possible to generate multiple linked data sets with different error characteristics, each providing a possible approximation of the actual underlying population of interest.

(26) A linked data set contains no duplicates. When linking databases, pairs or groups of records that refer to the same individual might not be linked correctly. One reason for this to occur is if a wrong entity identifier has been assigned to an individual, as has been reported in voter databases [67] . Another reason is if many of the QID values of an individual have changed over time, such as both their name and address details, resulting in two records that are not similar with each other. Therefore, many linked data sets do contain more than one record for some individuals in a population. (27) A linked data set is unbiased. Many linkage errors do not occur randomly [50] . Rather, the rates of these errors depend upon the characteristics of the actual QID values of individuals, which can differ in diverse subpopulations. Examples include name structures that are different from the traditional Western standard of first, middle, and last name formats [60] , or different rates of mobility (address changes) for young versus older people. As a result, there can be structural bias in linked data sets in subpopulations defined by ethnic categories, age, or gender (for example if women are more likely to change their names compared to men when they get married) [10] .

Recent work has also shown that even small amounts of linkage error can result in large effects on false negative (Type II) error rates in research studies. This is especially the case with small sample sizes that can occur with the rare effects that are often sought to be identified via record linkage from large population databases [78] . Such errors can even lead to research studies that are based on a small linked subpopulation to become useless [8] . If the aim of a study is to analyse certain subpopulations, or compare, for example, health aspects between subpopulations, then a careful assessment of the potential bias introduced via record linkage is of crucial importance.

(28) Attribute values in linked records are correct. Once records have been linked across databases, they need to be fused (merged) to generate one record per entity in a population. This process of data fusion often requires decisions to be made about how to resolve inconsistencies or impute missing values [9] . Even if the links made between records are correct, these decisions can introduce further errors both into QID values as well as microdata. For example, assume three linked records that refer to the same person have been linked, where three different salary values are recorded. Should the average, median, minimum, maximum, or the most recent salary value be used for the fused record of this individual? How data fusion is conducted needs to be discussed with the researchers who will be analysing a linked data set because depending upon the fusion operation applied substantially different outcomes will potentially be obtained.

(29) Linkage error rates are independent of database size. Because the QID values used for linking can be shared by multiple individuals, potentially by thousands in the case of city and town names and popular first and last names, as the databases being linked become larger the number of highly similar record pairs increases, and correctly classifying them becomes more challenging. Generalising linkage quality results obtained on small data sets in published studies to much larger population sized real-world databases can therefore be dangerous.

(30) Record linkage techniques are suitable for databases of any sizes. Many researchers who develop record linkage methods and algorithms, especially in the computer science and statistical domains, do not have access to large real-world databases due to the sensitive nature of population data [21] . As a result, novel linkage methods are often developed and evaluated on small publicly available benchmark data sets. Error rates for linkages obtained on such data sets can provide evidence of the superiority of a novel technique over previous methods. However, assuming that this new technique will produce comparable high quality linkage results on larger databases is not guaranteed.

(31) Databases reflect the conditions of people at the same time. Data updates on individuals often occur at different points in time, usually when an event such as a medical condition occurs, or a data error is detected during a triggered data transaction such as a payment. In the German Social Security database, for example, education is entered at first data entry for a given person and not regularly updated. Therefore, highly trained professionals might have a record stating a low educational level because they were still pupils at the time of their first paid job. Assuming the QID values of all records in the databases being linked are up-to-date might therefore not be correct, and outdated information can lead to wrong linkage results [21] .

Data corrections and updates can furthermore occur when wrong historical data are being discovered and errors rectified [13] . Unless it is possible to re-conduct a linkage, which is unlikely for many research studies due to the efforts and costs involved in such a process, a linked data set might contain errors which have influenced the conclusions of the original study.

As we have shown, the much hyped promise of Big Data being the new oil [48, 80] requires some careful considerations when personal data at the level of populations are used for research studies or for decision making in governments and businesses. We have discussed various misconceptions that are intrinsic to personal data, and that can occur due to the way such data are sampled, measured, and captured before being processed and potentially linked with other databases 6 .

As the use of population data based on administrative or operational databases is increasing in many domains [39] , researchers will potentially have less and less control over the quality of the data they are using for their studies and any processing done on these data [15] . They likely will also have only limited information about the provenance and other metadata that will be crucial to fully understand the characteristics and quality of their data. Furthermore, because population data are commonly sourced from organisations other than the one where they are being analysed [8] , these limitations are unlikely to improve, and as a result population data might not be fit for the purpose of a research study.

There are no (simple) technical solutions to detect and correct many of the misconceptions we have described here. What is required is heightened awareness by anybody who works with population data. With this in mind, we now provide a set of recommendations which we hope will help researchers and practitioners who are working with population data to recognise and overcome misconceptions such as the ones we have discussed in this paper.

• If at all possible, data scientists as well as researchers should aim to get involved in the collection, processing, and linking of any data they plan to use for their research studies. This involves discussions with data owners about what data to collect in what format, how to ensure high quality of these data, and that adequate metadata are being generated.

• While extensive statistical methodologies about how to deal with uncertainties in surveys have been developed for decades, there is a lack of corresponding rigorous methods that can be employed on large-scale population databases to deal with bias and quality issues in data that have not primarily been collected for research purposes. The Big Data paradox [64] , the illusion that large databases automatically mean valid results, requires new statistical techniques to be developed.

Similarly, while certain data quality issues can be identified (and potentially corrected) automatically [52] , novel data exploration methods need to be developed to identify more subtle data issues where traditional methods have failed. Using advanced machine learning methods such as clustering or deep learning might provide insights into this challenge.

• A crucial aspect is to have detailed metadata about a population database, including how it was sampled, measured and captured, and any processing applied to it. All relevant data definitions need to be described. Detailed data profiling and exploration should be conducted by researchers before a population database is being analysed in order to identify any unexpected characteristics in their data. Information about all sources and types of uncertainties in a population database should be included in its metadata [39] .

• Existing guidelines and checklists, such as RECORD (REporting of studies Conducted using Observational Routinely-collected health Data) [6] and GUILD (GUidance for Information about Linking Data sets) [37] , should be applied and adapted to other research domains. Frameworks such as the Total Survey Error (TSE) and the Big Data Total Error (BDTE) methods [8] can be adapted for population data to better characterise errors in such data. Furthermore, data management principles such as FAIR (Findable, Accessible, Interoperable, Reusable) [84] should be adhered to, although in some situations the sensitive nature of personal data might limit or even prevent such principles from being applied.

• Data scientists and IT personnel who are processing and linking population data need to work in close collaboration with the researchers who will conduct the actual analysis of the data they obtain. Forming multi-disciplinary teams with members skilled in data science, statistics, domain expertise, as well as the 'business' aspects of research [50] , is crucial for successful projects that rely upon population data.

Data science (or inference more generally) is often viewed as a (linear) pipeline of individual steps (data acquisition, processing, integration, and analysis), where each step might be conducted by a different team of experts, potentially located within different departments within the same organisation, or even in different organisations. The danger with data being passed from team to team without proper interactions is that data issues encountered are not passed along with the data. Close interaction between data and domain experts means that the data science pipeline will become an iterative endeavour where data might have to be reacquired, reprocessed, and relinked until they are fit for analysis.

• Cross-disciplinary training should be aimed at improving complementary skills [50] . Having data scientists with expertise in statistics and computer science who also have domain specific expertise will be of high value in any project involving population data. Equally crucial is for any scientist, no matter what their domain, to understand how modern data processing, record linkage, and data analytics methods work, and how these methods might introduce bias and errors into the data they are using for their research studies.

Training in data wrangling methods as well as data quality issues should be part of any degrees that deal with data, including statistics, quantitative social science, computer science, and public health.

• While transparency is crucial for the progress of science [15, 84] , the nature of population data -being personal data about a large number of individuals -can prohibit such data (at least in their original sensitive form) to be made publicly available. As a result, it can be difficult for other researchers to fully understand how conclusions were drawn in a study and if the data used were appropriate for that study. Guidelines and principles such as FAIR [84] might be difficult to comply in situations where sensitive personal data are being used [21] . In such situations, at least metadata and any software used in the research study should be made publicly available using an online open repository.

• The lack of publications that describe practical challenges when dealing with population data can result in the misconceptions we have described in this paper. We therefore encourage increased publication of data issues and the sharing of experiences with the scientific community about lessons learnt as well as best practice approaches being implemented in organisations and research studies.

We have discussed some aspects in modern scientific data processes that are rarely considered when population data are being used for research studies or for decision making. Since good data management is a key aspect of good science [84] , it is important for researchers who use population data to be aware of underlying assumptions concerning this kind of data to prevent misleading conclusions and poor real-world decisions. Population data are an important resource for scientific research and evidence-based policies, but misconception about this type of data are prevalent. We hope the exposition given here will help researchers in the design and analysis of research projects using population data, making such data the new oil of the Big Data era.

The origins of personal data and its implications for governance. Available at SSRN 2510927

Privacy and human behavior in the age of information

Beyond prediction: Using big data for policy problems

Challenges in reported covid-19 data: best practices and recommendations for future epidemics

Data and Information Quality. Data-Centric Systems and Applications

The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) statement

Germany's vaccination rate could be higher than previously thought

Errors and inference

Data fusion

Data linkage: a powerful research tool with potential problems

Macrocomparative research methods

The Promise and Peril of Big Data

Unrepresentative big surveys significantly overestimated US vaccine uptake

Beyond the bubble that is robodebt: How governments that lose integrity threaten democracy

Issues with data and analyses: Errors, underlying themes, and potential solutions

Obfuscation: A User's Guide for Privacy and Protest

Personal data: Thinking inside the box

Terrorism Informatics: Knowledge Management and Data Mining for Homeland Security

Data Matching -Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection

Automatic discovery of abnormal values in large textual databases

Linking Sensitive Data

Flexible and extensible generation and corruption of personal data

The role of administrative data in the big data revolution in social science research

The Digital Divide

Magellan: toward building ecosystems of entity matching solutions

Reflections on modern methods: linkage error bias

Statistical Confidentiality: Principles and Practice

Economics in the age of big data

The Anonymisation Decision-making Framework 2nd Edition: European Practitioners' Guide

The end of privacy

Measuring the scientific effectiveness of contact tracing: Evidence from a natural experiment

Statistics of ecological fallacy

Big Data and Social Science

The challenges of data usage for the united states' covid-19 response

De-identification of personal information

Police Violence US Subnational Collaborators and others. Fatal police violence by race and state in the usa, 1980-2019: a network meta-regression

GUILD: Guidance for information about linking data sets

Classifier technology and the illusion of progress

Statistical challenges of administrative and transaction data

Dark Data: Why What You Don't Know Matters

Challenges in administrative data linkage for research. Big Data and Society

Methodological Developments in Data Linkage

Evaluating bias due to data linkage error in electronic healthcare records

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Data Quality and Record Linkage Techniques

Facebook algorithms and personal data

Data, privacy, and the greater good

Data is the new oil

User data privacy: Facebook, Cambridge Analytica, and privacy protection

Routinely collected data as a strategic resource for research: priorities for methods and workforce

Digital trace data: Modes of data collection, applications, and errors at a glance

A taxonomy of dirty data

Geocoding error, spatial uncertainty, and implications for exposure assessment and environmental epidemiology

Public attitudes towards algorithmic personalization and use of personal data online: evidence from germany, great britain, and the united states

The war over the value of personal data

Techniques for automatically correcting words in text

Evaluating health outcomes of criminal justice populations using record linkage: the importance of aliases

Journey to Data Quality

Statistical Analysis with Missing Data

Ethnic bias in data linkage

A position statement on population data science: The science of data about people

Falsehoods programmers believe about names. Kalzumeus Software

The Australian Pharmaceutical Benefits Scheme data collection: a practical guide for researchers

Statistical paradises and paradoxes in big data (i): Law of large populations, big data paradox, and the 2016 US presidential election

Myths and fallacies of personally identifiable information

Automatic linkage of vital records

Generating realistic test datasets for duplicate detection at scale using historical voter data

Data science and its relationship to big data and data-driven decision making

Data Science for Business: What you need to know about Data Mining and Data-Analytic Thinking

Data cleaning: Problems and current approaches

Principles of Data Wrangling: Practical Techniques for Data Preparation

Three pitfalls to avoid in machine learning

Developmental impacts of the covid-19 pandemic on young children: A conceptual model for research with integrated administrative data systems

About mortality data for East Germany

Predictive Analytics: The Power to Predict who Will Click

Assessing the quality of administrative data for research: a framework from the manitoba centre for health policy

The report of the international workshop on using multi-level data from sample frames, auxiliary databases, paradata and related sources to detect and adjust for nonresponse bias in surveys

Dude, where's my treatment effect? errors in administrative data linking and the destruction of statistical power in randomized experiments

Residency testing. Estimating the true population size of Estonia

Data is the new oil of the digital economy

Toward a complete data valuation process. challenges of personal data

Medication errors: prescribing faults and prescription errors

How precision medicine and screening with big data could increase overdiagnosis

The FAIR guiding principles for scientific data management and stewardship

We like to thank S. Redlich and J. Reinhold for their comments. P. Christen likes to acknowledge the support of the University of Leipzig and ScaDS.AI, Germany, where parts of this work was conducted while he was funded by the Leibniz Visiting Professorship; and the support of the Scottish Centre for Administrative Data Research (SCADR), Edinburgh.