key: cord-0295818-mljmvsyq authors: Mawji, A.; Longstaff, H.; Trawin, J.; Dunsmuir, D.; Komugisha, C.; Novakowski, S. K.; Wiens, M. O.; Akech, S.; Tagoola, A.; Kissoon, N.; Ansermino, M. M. title: A proposed de-identification framework for a cohort of children presenting at a health facility in Uganda date: 2022-04-02 journal: nan DOI: 10.1101/2022.03.29.22273138 sha: d4f773b950c05b4822c6c4b2af21eb403a0e9cc9 doc_id: 295818 cord_uid: mljmvsyq Data sharing has enormous potential to accelerate and improve the accuracy of research, strengthen collaborations, and restore trust in the clinical research enterprise. Nevertheless, there remains reluctancy to openly share raw datasets, in part due to concerns regarding research participant confidentiality and privacy. Statistical data de-identification is an approach that can be used to preserve privacy and facilitate open data sharing. We have proposed a standardized framework for the de-identification of data generated from cohort studies in children in a low-and-middle income country. Variables were labeled as direct and quasi-identifiers based on conditions of replicability, distinguishability, and knowability with consensus from two independent evaluators. Direct identifiers were removed from the dataset, while a statistical risk-based de-identification approach using the k-anonymity model was applied to quasi-identifiers. Qualitative assessment of the level of privacy invasion associated with data set disclosure was used to determine an acceptable re-identification risk threshold, and corresponding k-anonymity requirement. A de-identification model using generalization, followed by suppression was applied using a logical stepwise approach to achieve k-anonymity. The utility of the de-identified data was demonstrated using a typical clinical regression example. The de-identified dataset was published on the Pediatric Sepsis Data CoLaboratory Dataverse which provides moderated data access. Researchers are faced with many challenges when providing access to clinical data. We provide a standardized de-identification framework that can be adapted and refined based on specific context and risks. This process will be combined with moderated access to foster coordination and collaboration in the clinical research community. 106 De-identification of the Smart Triage dataset followed a six-step framework (Fig 1) . Functional 107 definitions of key terms used in the framework described are outlined in the S1 Appendix. 119 provided by a parent or guardian prior to enrollment. Assent was required from children above 120 eight years of age. Consent was obtained to make de-identified data available to other 121 researchers. There were 241 health-related variables, including clinical signs and symptoms, and 122 anthropometric and sociodemographic information, which were collected from 1764 participants 123 (S2 Appendix). This dataset was generated to inform development of a rapid triage model for 124 children presenting to the emergency department at health facilities in low-and-middle income 125 countries. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 2, 2022. ; https://doi.org/10.1101/2022.03.29.22273138 doi: medRxiv preprint 127 Step 1: Select data release model 128 A de-identified dataset may be released publicly, semi-publicly, or non-publicly. 14 The release 129 model plays an important role in determining the amount of de-identification required as certain 130 models offer more privacy protection than others. 131 132 In a public data release, the dataset is available for anyone to download or use without any 133 conditions. 14 This model provides the greatest availability and least amount of protection. On the 134 other hand, a non-public data release limits dataset availability to a select number of recipients. 135 As a condition of receiving the data, recipients must agree to terms and conditions regarding the 136 privacy and security of the data. This model provides the least availability and highest amount of 137 protection. In a semi-public data release, the data set is available to anyone for download; 138 however, as a condition of receiving the data, the recipient must register with the organization 139 releasing the data set and agree to the restrictions regarding the processing and sharing of data. 140 141 We have selected a semi-public release for the de-identified Smart Triage dataset to be published 142 on the Pediatric Sepsis Data CoLaboratory (Sepsis CoLab) Dataverse, a platform that allows for 143 international data sharing among members with built-in access control. 19 Data collaborators must 144 register with the CoLab and sign a memorandum of understanding, submit a project proposal 145 detailing what they plan to do with the data, and sign a terms-of-use agreement. Step 2: Classify the variables 148 The second step involves determining which of the collected variables in the dataset contain 149 identifying information. Classification of variables as identifiers was established after . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 2, 2022. ; https://doi.org/10.1101/2022.03.29.22273138 doi: medRxiv preprint 150 consideration of three conditions: replicability, distinguishability, and knowability (Table 1) . 20 151 Variables that fulfilled at least one of the three conditions were further classified as direct or 152 quasi-identifiers using a decision tree (Fig 2) . Variables that uniquely identified an individual or 153 have been classified by the HIPAA as a direct identifier were treated as direct identifiers. 15 The variable must be sufficiently stable over time so that the values will occur consistently in relation to the data subject. If a field value is not replicable, it will be challenging for an adversary to use that information to re-identify an individual. Distinguishability The variable must have sufficient variability to distinguish among individuals in a dataset. Knowability An adversary must know the identifiers about the data subject to re-identify them. This assumes the adversary is an acquaintance of a data subject. If a variable is not knowable by an adversary, it cannot be used to launch a re-identification attack on the data. 21 If the value was above 0.8, consensus was assumed, and the two investigators met 204 and resolved the classifications on which they had disagreements. If the Kappa threshold was not 205 achieved, the process was to be repeated with the full group of investigators. The results were . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 2, 2022. Step 3: Determine the re-identification risk threshold 213 For a dataset to be considered de-identified, the data risk must be sufficiently reduced so that it is 214 less than or equal to the re-identification risk threshold. Determination of an acceptable re-215 identification risk threshold required an assessment of the extent to which the release of the data 216 set would invade an individual's privacy. 14 Three factors were considered to rank the level of 217 potential privacy invasion as low, medium, or high: (1) the sensitivity of the data (the greater the 218 sensitivity of the data, the greater the invasion of privacy), (2) the potential injury to patients 219 from an inappropriate disclosure (the greater the potential for injury, the greater the invasion of 220 privacy), and (3) the appropriateness of consent for disclosing the data (the less appropriate the 221 consent, the greater the invasion of privacy). 20 The rank of potential privacy invasion was 222 translated to a re-identification risk threshold needed to ensure an acceptable level of risk (Table 223 2). 14 224 225 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 2, 2022. Step 4: Quantify data risk 230 The k-anonymity model was used to measure data risk. Each set of records with 231 indistinguishable quasi-identifiers is called an equivalence class, of size k. The model assumes 232 an upper bound of 1/k on the probability of re-identification for each individual record. 16, 22, 23 233 This probability applies under two conditions: (1) the adversary (an individual who is attempting 234 to use the data for a nefarious purpose) knows someone in the real world and is trying to find the 235 record that matches that individual, or (2) the adversary has selected a record in the dataset and is 236 trying to find the identity of that person in the real world. 20 A dataset is k-anonymous if each of 237 its records cannot be distinguished from at least k-1 records and the k value required to consider 238 a data set de-identified can be derived by taking the reciprocal of the re-identification risk 239 threshold (Table 2 ). In a semi-public release model, the data risk is equal to the maximum re-240 identification risk across all records with the assumption that there will be a re-identification 241 attack. 14 242 243 244 245 246 Step 5: De-identify the data 247 All de-identification procedures were performed using R (3.5.1). 24 Preparing quasi-identifiers . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 2, 2022. 263 De-identification was conducted using the R package sdcMicro, which enables application of 264 statistical disclosure control methods to the data to decrease re-identification risk of the data. 25 A 265 cycle that applied generalization, followed by suppression was used to determine the optimal de-266 identification model (Fig 3) . The first generalization hierarchy was applied to applicable 267 variables in a stepwise fashion, where the quasi-identifier with the lowest importance ranking 268 was de-identified first. At each step, the number of records violating k-anonymity was evaluated. 269 If k-anonymity violations remained after application of the first generalization hierarchy, 270 suppression was applied to meet the requirement. The suppression algorithm was based on 271 importance ranking of variables to prioritize preservation of modelling variables. It was decided 272 that acceptable suppression limits were 5% and 10% for modelling and supplementary variables, . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 2, 2022. ; https://doi.org/10.1101/2022.03.29.22273138 doi: medRxiv preprint 273 respectively. If the number of suppressed values for a given variable exceeded the suppression 274 limit, the above process was undone and repeated using generalization hierarchies two, and if 275 needed, three. The model that requires the least amount of generalization for suppressions to be 276 contained within the limit would be considered optimal. 294 Numbers assigned to quasi-identifiers correspond to relative importance ranking. 295 296 297 Step 6: Test data utility 298 The purpose of this dataset is to develop rapid triage models based on need for hospital 299 admission, and thus assessment focused on evaluating model integrity. Missing data and the 300 distribution of response values among quasi-identifiers were compared before and after 301 suppression. Univariate logistic regression with 10-fold cross validation was applied to quasi-302 identifiers labelled as predictor variables on both the original and de-identified datasets. The 303 outcome measure for modelling was defined as a positive response on either the admitted or . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 2, 2022. Data risk and re-identification risk threshold 317 Data were collected from children presenting with an acute illness. Health information from 318 these children can be considered sensitive, however the risk of potential injury from disclosure 319 was considered to be low. Due to the absence of electronic health records within LMIC 320 healthcare systems, adversaries would have very limited public data sources. At Jinja Hospital in 321 particular, patient information is hand recorded in notebooks and stored only for admitted 322 children, making data sharing with public registries unfeasible. Additionally, informed consent 323 approving disclosure of de-identified health information for the purpose of data sharing was 324 obtained prior to enrollment. With consideration of the above information, the level of privacy 325 invasion was ranked as medium, which corresponds to a re-identification risk threshold of 0.075 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 335 Many quasi-identifiers (N=24) were removed due to having too few participant responses to 336 confer indistinguishability in the dataset (S3 Appendix) ( Table 3) . Some variables were merged 337 with another variable or removed as the information was captured elsewhere (N=11). For 338 example, date of admission recorded by the study team and date of admission reported by 339 caregiver were merged into a single variable. Further, month of birth and year of birth variables 340 were removed as information was captured in the calculated age variable. Variables were also 341 removed if they contained more than 100 unique response values with few counts each (N=7), or 342 if the information was not reliable (N=2) or relevant for data analysis (N=1). There were 8 343 remaining quasi-identifiers that required further de-identification (Table 4 ). Generalization 344 hierarchies were created for the four categorical quasi-identifiers ( . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 2, 2022. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 2, 2022. [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] , [24, 36) , [36, 60) ; 60+ months Length of stay 361 Results indicated that the optimal model for de-identification of this dataset was use of the 362 second generalization hierarchy, followed by suppression. After applying the second 363 generalization hierarchy, there remained 384 (21.9%) records violating 15-anonymity (Table 6) . 364 The amount of suppression required to address these violations were well within the 5% limit for 365 modelling variables, however 'district', which had 272 (15.5%) suppressions, exceeded the 10% 366 limit for supplementary variables (Table 7) . This variable continued to exceed the limit even 367 after application of the third generalization hierarchy as 235 (13.4) suppressions were required. 368 Since suppression limits for modelling variables were met using the second generalization 369 hierarchy, it was deemed impractical to further generalize variables in exchange for a small 370 reduction in the number of required suppressions (Table 7) . (Table 7 ). In addition, the distribution of response values for modelling variables pre-393 and post-suppression were within 1% of each other (Table 8 ). Considerable data loss was evident . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 2, 2022. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 2, 2022. 399 Univariate logistic regression demonstrated that associations between the three predictor 400 variables and outcomes were similar pre-and-post data de-identification (Table 9 ). In both cases, 401 males were significantly less likely to have a positive admission outcome compared to females. 402 Pre-and-post de-identification odds ratios showed overlapping confidence intervals. Length of 403 stay was not significantly associated with the positive admission outcome at any time. Finally, 404 both in the original and de-identified data set, increasing age was associated with a decrease in 405 the odds of having a positive admission outcome. This association was significant in the original 406 data set, but only for children aged three and older in the de-identified data set where age was 407 transformed into categorical bins. 408 409 410 Table 9 . Results from univariate logistic regression pre-and-post de-identification. Regression Coefficient 416 We have proposed a standardized framework for the de-identification of data generated from 417 cohort studies in children in a LMIC. In an effort to balance protection of patient privacy and 418 data integrity, direct identifiers were removed from the dataset and a de-identification model 427 recommended for mitigating re-identification risk associated with quasi-identifiers. 14 Two widely 428 accepted privacy models are k-anonymity and differential privacy. 26 K-anonymity is used to 429 prevent re-identification of individuals made possible by record linking attacks while the 430 differential privacy provides a probabilistic guarantee that the inclusion of an individual in a data 431 set will not alter the outcome of a query to that dataset. 27 There are many trade-offs between 432 these techniques. 28, 29 In the case of record-level data release, applying differential privacy would 433 require employing a large amount of noise to obtain a meaningful privacy guarantee. As a result, 434 the analytical utility of the output would be poor. 30, 31 Thus, k-anonymity was deemed preferable 435 for this microdata set. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 467 Age, however, was found to be significantly associated to the admissions outcome in the original 468 data set, but only for children aged three and older in the de-identified data set. This resulted 469 from the transformation of age as a continuous variable in the original dataset to a categorical 470 variable in the de-identified data set. These transformations are generally not advised due to the 471 risk that variables may lose predictive value and that associations with the modelling outcome 472 may be distorted. 34 Generalization was applied nonetheless in the interest of protecting patient 473 privacy and to achieve k-anonymity with reasonable suppression. The supplementary variables 474 were generously suppressed to maximize integrity of the modelling variables. This was 475 acceptable as it allowed for at least some supplementary information to be retained, but not at the 476 expense of compromising the development of prediction models. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 2, 2022. ; https://doi.org/10.1101/2022.03.29.22273138 doi: medRxiv preprint 498 investigators so that the original data set can be used for such analyses. 499 500 The generalization of age was a particular concern. The physiology of children changes rapidly 501 following birth with rapid transitions occurring in the first year or two of life. 35 The loss of the 502 number of months of age in these children is potentially limiting. The granularity in the data was . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 2, 2022. ; https://doi.org/10.1101/2022.03.29.22273138 doi: medRxiv preprint 503 lost due to generalization. This was required because of the small number of children in the data 504 set of the same age. The fact that the date of birth had been generalized to age at the time of 505 triage had already provided a significant degree of risk reduction that was not considered. While 506 age is typically considered to be a quasi-identifier, in a study that has a duration of many months 507 or years, converting the date of admission and date of birth provides generalization if these dates 508 are suppressed and the date of admission could be anywhere within the study period. The same 509 would be true of other time intervals such as length of stay. 510 511 Another limitation was the difficulty in assessing data utility. In a complex dataset generated to 512 develop prediction models, where variables are of varying type and held at different levels of 513 importance, there are no broadly accepted metrics that can be applied to judge the results. 36 514 Additionally, no existing de-identification algorithm provides both perfect privacy protection and 515 perfect analytic utility. Thus, selection of an acceptable re-identification risk threshold to balance 516 the probability of re-identification with the amount of distortion applied to the data was based on 517 subjective assessment of privacy risk. It is impossible to anticipate all potential unethical future 518 uses of data that could lead to significant harms such as discrimination or stigmatization of 519 individuals or groups. It is for this reason that the standardized approach proposed here should be 520 understood as one tool used to manage privacy risks within a greater holistic data governance 521 framework. 522 523 CONCLUSION . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 2, 2022. ; https://doi.org/10.1101/2022.03.29.22273138 doi: medRxiv preprint We also thank WALIMU, 530 the administration and staff of Jinja Regional Referral Hospital, and participants and caregivers of Canada Data guidelines Open data Who Owns the Data? Open Data for Healthcare. Front Public Health What drives and inhibits researchers to share and use open research data? A systematic literature review to analyze factors influencing open research data adoption Sharing detailed research data is associated with increased citation rate. PLoS One Transparency of COVID-19 vaccine trials: decisions without data The Dataverse Project [Internet]. Open source research data repository software For your research data A global clinical research data sharing platform The what, why, and how of born-open data What drives academic data sharing? PLoS One British Columbia's Office of the Human Rights Commissioner Disaggregated Demographic Data Collection in British Columbia: The Grandmother Perspective De-identification Guidelines for Structured Data Guidance regarding methods for deidentification of protected health information in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule Efficient and effective pruning strategies for health data de-identification Smart triage: triage and management of sepsis in children using the point-of-care Pediatric Rapid Sepsis Trigger (PRST) tool. BMC Scholars Portal Dataverse Pediatric Sepsis Data Colab Committee on Strategies for Responsible Sharing of Clinical Trial Data; Board on Health Sciences Policy Appendix B, Concepts and Methods for De-identifying Clinical Trial Data A coefficient of agreement for nominal scales Less than five is less than ideal: replacing the "less than 5 cell size" rule with a risk-based data disclosure protocol in a public health setting Anonymity: a model for protecting privacy R: A Language and Environment for Statistical Computing Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro Calibrating Noise to Sensitivity in Private Data Analysis Disclosure metrics born from statistical evaluation of data utility. UNECE 2021: Expert meeting on statistical data confidentiality Information preserving regression-based tools for statistical disclosure control The future of statistical disclosure control Fool's gold: an illustrated critique of differential privacy Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing Introduction to statistical disclosure control (sdc) De-identifying a public use microdata file from the Canadian national discharge abstract database Dichotomizing continuous predictors in multiple regression: a bad idea Predictive Performance of Physiology-Based Pharmacokinetic Dose Estimates for Pediatric Trials: Evaluation With 10 Bayer Small-Molecule Compounds in Children Use and Understanding of Anonymization and De-Identification in the Biomedical Literature: Scoping Review