De-identification Guidance 
 
 
Prepared by the Portage Network, COVID-19 Working Group on behalf of the Canadian Association of 

Research Libraries (CARL) 

 
Kristi Thompson (Western University) 

Erin Clary (Portage) 

Lucia Costanzo (University of Guelph) 

Beth Knazook (Portage) 

Nick Rochlin (University of British Columbia) 

Felicity Tayler (University of Ottawa) 

Jane Fry (Carleton University) 

Chantal Ripp (University of Ottawa) 

Kathy Szigeti (University of Waterloo) 

Qian Zhang (University of Waterloo) 

Roger Reka (University of Windsor) 

Minglu Wang (York University) 

Rebecca Dickson (Coppul) 

Mark Leggott (RDC-DRC) 

Melanie Parlette-Stewart (Portage) 

 
SEPTEMBER 2020 

 
Portage Network 

Canadian Association of Research Libraries 

portage@carl-abrc.ca  


PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 1 

Table of Contents 

De-identification Guidance .............................................................................................................................. 2 

Identify and Remove Direct Identifiers............................................................................................................ 3 

How do I remove this information? ............................................................................................................. 3 

Identify and Evaluate Indirect or Quasi-Identifiers based on Perceived Risk and Utility ................................ 4 

How do I figure out what combination of quasi-identifiers are a problem? ............................................... 4 

How do I assess the sensitivity of non-identifying variables in dataset? .................................................... 7 

Considerations for Qualitative Data De-identification .................................................................................... 8 

Brief Considerations for Social Media, Medical Images, and Genomics Data .............................................. 10 

Data collected from social media or social networking platforms (e.g., Twitter, Facebook). .................. 10 

Medical Images ......................................................................................................................................... 11 

Genomics data, and other biomedical samples ........................................................................................ 12 

Appendix 1: Code for Checking K-Anonymity ................................................................................................ 14 

Appendix 2: Free de-identification software packages ................................................................................. 16 

Appendix 3: Fee-based services for de-identification ................................................................................... 18 

Resources ....................................................................................................................................................... 19 

References ..................................................................................................................................................... 19 

 
PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 2 

De-identification Guidance 

The guidance below is intended to help you minimize disclosure risk when sharing data collected from 

human participants. If you use any of the following techniques to anonymize your data, please include 

this information in your documentation and README file.
1
 For transparency, it should be clear how the 

dataset was modified to protect study participants.  

Before proceeding, please note that not all human participant data needs to be de-identified, or stripped, 

of direct and indirect identifiers. Please review your consent form and prepare your data to share only 

what participants have agreed to share. If you are unsure whether you need to de-identify your data, 

please see the Portage help guide Can I share my Data? and consult with your institution’s Research 

Ethics Board.
2
 For help selecting a repository for your data, please see Portage’s Recommended 

Repositories for COVID-19 Research Data guide or consult with librarians at your institution to see if 

further support is available.
3
 

1
 Learn more about creating appropriate documentation for depositing your datasets in the Portage COVID-19 

Working Group’s “Documentation and Supporting Materials Required for Deposit,” September 25, 2020, 
https://doi.org/10.5281/zenodo.4042034.  
2
 Portage COVID-19 Working Group, “Can I Share My Data?” September 25, 2020, 

https://doi.org/10.5281/zenodo.4041661.    
3
 Portage COVID-19 Working Group, “Recommended Repositories,” September 25, 2020.  

https://doi.org/10.5281/zenodo.4042037.  

https://doi.org/10.5281/zenodo.4042034
https://doi.org/10.5281/zenodo.4041661
https://doi.org/10.5281/zenodo.4042037
https://doi.org/10.5281/zenodo.4042037
https://doi.org/10.5281/zenodo.4042034
https://doi.org/10.5281/zenodo.4041661
https://doi.org/10.5281/zenodo.4042037


PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 3 

Identify and Remove Direct Identifiers 
Direct identifiers are those which place study participants at immediate risk of being re-identified. Unless 

explicit consent was received from study participants, they must be removed from any published version 

of your dataset. The following list is based on various sources, including guidance from major 

international funding agencies, the US Health Insurance Portability and Accountability Act (HIPAA) and 

the British Medical Journal. See Preparing raw clinical data for publication: guidance for journal editors, 

authors, and peer reviewers and List of 18 items considered under HIPAA to be identifiers.
4
 

Direct identifiers are: 

● Names or initials, as well as names of relatives or household members 

● Addresses, and small area geographic identifiers such as postal codes / zip codes 

● Telephone numbers 

● Electronic identifiers such as web addresses, email addresses, social media handles, or IP 

addresses of individual computers  

● Unique identifying numbers such as hospital IDs, Social Insurance Numbers, clinical trial record 

numbers, account numbers, certificate or license numbers 

● Exact dates relating to individually-linked events such as birth or marriage, date of hospital 

admission or discharge, or date of a medical procedure 

● Multimedia data: unaltered photographs, audio, or videos of individuals 

● Biometric identifiers including finger or voice prints, and iris or retinal images 

● Human genomic data, unless risk was explained and consent to share data or consent for 

secondary use of data was received from study participants 

● Age information for individuals over 89 years old 

 
How do I remove this information? 

Removing direct identifiers from your data is relatively straightforward. You may either record this 

personal information in a separate document, spreadsheet, or database and link this to the other data 

points via a series of codes that can be removed before publishing, or choose to delete the identifying 

data points entirely at the end of the project. Refer to your consent forms to determine how to proceed. 

If you are unsure whether data can simply be unlinked or if it must be destroyed, consult your local 

Research Ethics Board. 

 
4
 Iain Hrynaszkiewicz, Melissa L. Norton, Andrew J. Vickers, and Douglas G. Altman, “Preparing Raw Clinical Data for 

Publication: Guidance for Journal Editors, Authors, and Peer Reviewers.” BMJ 340 (January 29, 2010): c181, 
https://www.bmj.com/content/340/bmj.c181; and Steve Alder, “What is Considered PHI Under HIPAA Rules?” 
HIPAA Journal (December 28, 2017), https://www.hipaajournal.com/considered-phi-hipaa/. 
 
 
https://www.bmj.com/content/340/bmj.c181
https://www.bmj.com/content/340/bmj.c181
https://www.hipaajournal.com/considered-phi-hipaa/
https://www.bmj.com/content/340/bmj.c181
https://www.hipaajournal.com/considered-phi-hipaa/


PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 4 

Identify and Evaluate Indirect or Quasi-Identifiers based on Perceived Risk and 

Utility 
Indirect or quasi-identifiers are characteristics (such as demographic information) relating to individuals 

that could be linked with other data sources to violate the confidentiality of individuals. Quasi-identifiers 

may not be identifying on their own but can be disclosive in combination. For instance, identifying a 

participant’s home community size within an overall limited geographic study area may allow someone to 

infer that participant’s location more precisely. A variable should be considered a quasi-identifier if 

someone could plausibly match that variable to information from another source. See the International 

Household Survey Network Anonymization Principles and the Information and Privacy Commissioner 

Ontario De-identification Guidelines for Structured Data.
5
 

A list of potential quasi-identifiers: 

● Geographic identifiers (census geography, town name, urban/rural indicator) of home, place of 

birth, place of treatment, place of schooling, or other geography linked to individuals 

● Sex / gender identity, orientation 

● Ethnic background, race, visible minority, or Indigenous status 

● Immigration status 

● Membership in organizations 

● Use of specific social networks or services  

● Socioeconomic data, such as occupation or place of work, income, or education 

● Household and family composition, marital status, number of children / pregnancies 

● Criminal records and other information that may link to public records 

● Generalized dates linked to individuals, e.g. age, graduation year, immigration year 

● Some full-sentence responses 

○ Note: These must be checked individually. For instance, the comment “The library should 

be open longer” is not identifying; however, a comment like “As chair of a research group 

that uses the library,…” is potentially identifying. 

● Some medical information (e.g. permanent disabilities or rare medical conditions) may be 

identifying; temporary illness or injury is less likely to be so. The test is whether this is information 

that can be found elsewhere and therefore could be used to re-identify the person. 

 
How do I figure out what combination of quasi-identifiers are a problem? 

 
5
 “Anonymization Principles,” International Household Survey Network, accessed August 4, 2020, 

https://ihsn.org/node/137; Information and Privacy Commissioner of Ontario, Deidentification Guidelines for 
Structured Data, Information and Privacy Commissioner of Ontario, June 8, 2016. https://www.ipc.on.ca/privacy-
organizations/de-identification-centre/. 
 

https://ihsn.org/node/137
https://ihsn.org/node/137
https://www.ipc.on.ca/privacy-organizations/de-identification-centre/
https://www.ipc.on.ca/privacy-organizations/de-identification-centre/
https://ihsn.org/node/137
https://www.ipc.on.ca/privacy-organizations/de-identification-centre/
https://www.ipc.on.ca/privacy-organizations/de-identification-centre/


PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 5 

1. Observe the possible combinations 
A good first step may be to look at the demographic variables in the dataset and consider describing an 

individual to a friend using only the values of those variables. Is there any likelihood that the person 

would be recognizable? For example, “I’m thinking of a person living in Toronto who is female, married, 

has a University degree, is between the ages of 40 and 55 and has an income of between 60 and 75 

thousand dollars.”  Even if there is only one such person in the dataset, this is likely not enough 

information to create risk UNLESS contextual information about the dataset narrows things down further. 

For instance, if your data is limited to a specific, narrow group of individuals, such as the referees for the 

Ontario Hockey Association, the list of quasi-identifiers given above may be enough to uniquely identify 

an individual. Quasi-identifiers need to be evaluated in the context of what is known or what may be 

reasonably inferred about the survey population. 

2. Assess these combinations mathematically 
K-anonymity is a mathematical approach to demonstrating that a dataset has been anonymized, where k 

is an integer selected by the researcher that represents a group of records with the same information 

across all quasi-identifiers.
6
 Within your dataset, a set of ‘k’ records (e.g., a set of 3 or 5 records) is called 

an equivalence class. To achieve k-anonymity, it should not be possible to distinguish one record from the 

other records in its equivalence class. For example, if you choose a k value of 5, each record in your 

dataset must have the exact set of quasi-identifiers that are present in at least 4 other records in order to 

achieve k-anonymity. 

K-anonymity only works to precisely estimate risk if a dataset is a complete sample of some population. 

K-anonymity considerably overestimates risk in the case of a dataset that is a subsample of a population. 

When determining the appropriate k value to use, consider:  

● A lower k value of 3 may be sufficient in datasets that contain small samples from a large 

population.  

● A higher (or more conservative) k value should be used if a dataset is a complete sample of a 

population. 

 
Keep in mind that a dataset that is a complete sample of a known population may have additional risk 

factors. Imagine that all the respondents in a particular equivalence class answered a question the same 

way - you would know how each person in the survey belonging to that equivalence class answered the 

question. Respondents to surveys are generally told that their responses will be kept confidential, not 

merely that no one will know which line of data contains their specific answers. A k-anonymous dataset 

that is a complete sample may not fulfill that promise.  

The code in Appendix 1 can be used with your preferred statistical software package to create 

equivalence classes based on the quasi-identifiers in the dataset and to list them by size. If any 

 
6
 Khaled El Emam and Fida Kamal Dankar, “Protecting Privacy Using k-Anonymity,” Journal of the American Medical 

Informatics Association 15, no. 5 (September 2008): 627–637, https://doi.org/10.1197/jamia.M2716. 

https://academic.oup.com/jamia/article/15/5/627/732733
https://academic.oup.com/jamia/article/15/5/627/732733
https://doi.org/10.1197/jamia.M2716


PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 6 

equivalence class has fewer members than the value of k you selected, use the data reduction techniques 

below to further reduce dataset risk. 

For more on k-anonymity, see International Household Survey Network (IHSN)’s Measuring the Disclosure 

Risk and the UK Anonymisation Network’s Anonymisation Decision-Making Framework section 2.2.2, 

Guaranteed anonymisation.
7
 

3. Use data reduction techniques to address dataset risk  
Univariate frequencies and bivariate crosstabs can be used to identify small

8
 categories of quasi-

identifiers. Data reduction techniques can be used to mitigate risk once you have identified these small 

groups. There are three simple types of data reduction you may wish to consider:  

1. The simplest is to completely drop risky variables from the dataset. This is an option for variables 

with relatively high risk that are not considered to be of high research value. (For example, in 

some datasets geography may be considered relatively less important than ethnicity or language.)  

2. The second is global re-coding, or aggregating the observed values into a defined set of classes, 

such as transforming a variable with years of age into a variable of ten-year age categories, or 

top-coding a high income category to “$100,000 and above”.  

3. A third option for unusual cases is to use local suppression. For example, a very young married 

respondent might have their marital status set to ‘missing’ as an alternative to globally re-coding 

the otherwise non-risky age variable into a larger group.   

 
After each exercise in data reduction, repeat the test for k-anonymity described above and check 

equivalence classes until all groups are larger than your selected value for K. 

For more information, including information about more complex types of data reduction, see UK 

Anonymisation Network’s Anonymization Decision-Making Framework section 2.5, Anonymisation 

solutions.
9
 

7
 “Measuring the Disclosure Risk,” International Household Survey Network, accessed August 4, 2020, 

https://ihsn.org/anonymization-risk-measure; and Mark Elliot, Elaine Mackey, Kieron O’Hara, and Caroline Tudor, 
The Anonymisation Decision-Making Framework. UK Anonymisation Network (UKAN), University of Manchester, 
2016, https://ukanon.net/ukan-resources/ukan-decision-making-framework/. 
8
 ‘Small’ is relative; as a first pass, groups smaller than 5% of the dataset or containing fewer than 20 cases could be 

considered.  

9
 Elliot, Mackey, O’Hara, and Tudor, The Anonymisation Decision-Making Framework. 

https://ihsn.org/anonymization-risk-measure
https://ihsn.org/anonymization-risk-measure
https://ukanon.net/ukan-resources/ukan-decision-making-framework/
https://ukanon.net/ukan-resources/ukan-decision-making-framework/
https://ukanon.net/ukan-resources/ukan-decision-making-framework/
https://ihsn.org/anonymization-risk-measure
https://ukanon.net/ukan-resources/ukan-decision-making-framework/


PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 7 

How do I assess the sensitivity of non-identifying variables in dataset? 

Non-identifying information includes survey responses and measurements that are not likely to be 

recognizable as coming from specific individuals. Examples include opinions, rankings, scales, or 

temporary measures such as resting heart rate after meditation or the number of times an individual ate 

breakfast in a week. 

It is possible for non-identifying information to be highly sensitive as well. Information that could be used 

to stigmatize or discriminate against an individual, such as a criminal record, sexual practices, illicit drug 

use, mental health and psychological well being, and other sensitive medical information all increase the 

risk of the dataset and should be considered when deciding whether to release the data at all. You may 

wish to remove or modify these variables to create a less sensitive version of the data. 

  
PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 8 

Considerations for Qualitative Data De-identification             
Qualitative data describes qualities or characteristics that can be observed, but not necessarily measured. 

This type of data is collected through interviews, surveys, or observations, and may be in the form of 

transcripts, notes, video and audio recordings, images, and text documents. As with quantitative data, 

direct identifiers may appear in the form of names, date and place of birth, other locations, and even 

photos. These direct identifiers can be used along with indirect or quasi-identifiers, such as medical, 

education, financial, and employment information, to trace or determine an individual’s identity.   

The process for removing identifying information in a video recording, audio interview, or oral transcript 

is very different from that used to de-identify a medical record. For one, it is harder to do 

programmatically. Extremely detailed field notes or audiovisual information often requires someone to 

read or watch the content thoroughly.  

General Advice 

● Avoid asking for identifying information in the first place.  

○ It is easier to edit the information at the point of capture than it is to remove information 

after it has been recorded. 

○ If you require identifying information at the research stage, try to capture it within the 

first few minutes of an interview or recording, so that it is easy to edit it out quickly. 

Alternatively, transcribe the information in a separate document that can be removed 

from a person’s file. 

● Make de-identification a part of the process of informed consent. 

○ Ensure that study participants are aware of your planned use of the data, and the fact 

that their information may be anonymized to protect them. Make it clear in your consent 

forms how extensively they will be de-identified (i.e., what elements will be replaced or 

removed). While direct identifiers may be eliminated (name, address, birthday, etc.), 

there may be other subtle clues to their identities that remain within the recording or 

transcript. 

○ Agree in advance with participants which type of identifying information can be revealed 

in an interview. (For example, the participant may not wish to mention an employer’s 

name). This is easier than removing information after the fact. 

○ Keep in mind that not all data needs to be de-identified or anonymized. In some 

circumstances, you may be recording deeply personal accounts and should be mindful of 

a participant’s right to have their story told in their own words. Some participants may 

have a personal interest in staying identified. 

  
PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 9 

De-identification Guidance 

● Use pseudonyms and change identifying details to protect anonymity. 

○ If changing the person’s name, location of residence, or occupation can be done without 

compromising the dataset, this can help to protect their anonymity. Be advised that this 

could influence the utility of a dataset as it may alter a future researcher’s perception of 

the interviewee’s socio-economic status or behaviour. 

● If necessary, remove blocks of sensitive text or edit out portions within audio-visual data. 

○ Some portion of the research may need to be redacted. Be wary of using search and 

replace techniques as it is easy to replace the wrong piece of information. 

○ Voices in audio recordings may need to be masked by altering pitches. 

○ Faces in visual data may need to be pixelated. 

● Restrict access. 

○ This is not preferred, but some datasets will not remain useful if all identifiers are 

removed. It may be possible to allow researchers seeking secondary access to request 

that queries be performed by the original research team, who can then share results if 

they are non-disclosive or can be appropriately de-identified.  

 
For more information, see the UK Data Service's Guide to Anonymisation of Qualitative Data.
10

 
10
 “Anonymisation: Qualitative Data,” UK Data Service, last modified June 30, 2020, 

https://www.ukdataservice.ac.uk/manage-data/legal-ethical/anonymisation/qualitative.aspx. 

https://www.ukdataservice.ac.uk/manage-data/legal-ethical/anonymisation/qualitative.aspx
https://www.ukdataservice.ac.uk/manage-data/legal-ethical/anonymisation/qualitative.aspx


PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 10 

Brief Considerations for Social Media, Medical Images, and Genomics Data 
Data collected from social media or social networking platforms (e.g., Twitter, Facebook).  

Although information on social networking sites may be free to access or view, it does not automatically 

follow that it is free to redistribute. Many platforms have terms of use that you will need to abide by, and 

the people who use the platform may have an expectation of privacy which must be respected. Some 

platforms require users to register before content is visible, and others may have terms that prohibit data 

collection, data scraping, or republishing content elsewhere.  

Here are a series of questions to consider before you deposit social media data:  

● Could the topic you are studying be considered sensitive?  

● Could your data lead to stigmatization of, or discrimination against, the content author?  

● Is the study population vulnerable?  

● What expectation of privacy might the individual users of this platform have?  

● Is it possible or reasonable to obtain informed consent?  

● Can or should the data be anonymized?  

● Do the platform’s terms of use allow you to redistribute content?  

 
For example, Twitter allows the content author to maintain control over their tweets. As part of Twitter's 

policies, only numeric Tweet IDs and User IDs should be redistributed.
11

 If you have weighed the 

questions above and decide to deposit your dataset, the Tweets must first be ‘dehydrated’ (distilled down 

to just the Tweet ID) using a tool such as DocNow’s twarc.
12

 Any secondary use of the data would then 

require an end-user to “rehydrate” the Tweet IDs using the Twitter REST API or an external tool such as 

DocNow’s Hydrator.
13

 Content will not be returned for tweets that have since been deleted.  

The following resources provide more in-depth guidance:  

● Zeffiro and Brodeur, Social Media Research Data Ethics and Management (slides from a workshop 

presented at McMaster University).
14

  
● Ryerson University Research Ethics Board’s Guidelines for Research Involving Social Media.
15

  
11
 “Developer Terms: More About Restricted Uses of the Twitter APIs,” Twitter, accessed August 4, 2020, 

https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases. 
12

  Documenting the Now, “DocNow/twarc.” GitHub, accessed August 4, 2020, https://github.com/docnow/twarc. 
13

 Documenting the Now, “DocNow/hydrator.” GitHub, accessed August 4, 2020, 
https://github.com/DocNow/hydrator. 
 
14

 Andrea Zeffiro and Jay Brodeur, “Social Media Research Data Ethics and Management.” Workshop presented April 
5, 2018, Sherman Centre for Digital Scholarship, McMaster University, http://hdl.handle.net/11375/25327. 
15

 Ryerson University Research Ethics Board, “Guidelines for Research Involving Social Media,” Ryerson University, 
November, 2017, https://www.ryerson.ca/content/dam/research/documents/ethics/guidelines-for-research-
involving-social-media.pdf. 

https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases
https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases
https://github.com/docnow/twarc
https://github.com/DocNow/hydrator
https://macsphere.mcmaster.ca/bitstream/11375/22699/1/DMDS%20-%20SM%20Ethics%20and%20Data%20Management%20-%2020180405.pdf
https://www.ryerson.ca/content/dam/research/documents/ethics/guidelines-for-research-involving-social-media.pdf
https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases
https://github.com/docnow/twarc
https://github.com/DocNow/hydrator
http://hdl.handle.net/11375/25327
https://www.ryerson.ca/content/dam/research/documents/ethics/guidelines-for-research-involving-social-media.pdf
https://www.ryerson.ca/content/dam/research/documents/ethics/guidelines-for-research-involving-social-media.pdf


PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 11 

● Mannheimer and Hull, Sharing Selves: Developing an Ethical Framework for Curating Social Media 

Data.
16

 
North Carolina State University’s Social Media Archives Toolkit, which contains guidance on the legal 

and ethical implications of sharing social media data, and an annotated bibliography with further 

resources.
17

  
Medical Images 

Before you archive medical images, remove any direct identifiers you do not have explicit consent to 

share, such as name, patient ID, and exact dates from the image header or embedded metadata, and 

black out any pixels in the image that contain identifying information. Neuroimages must also be defaced 

using a tool such as PyDeface.
18

  
The following resources provide more guidance for de-identifying DICOM files:  

● The Cancer Imaging Archive (TCIA) De-identification Overview.
19

  
○ See specifically “Table 1 - DICOM Tags Modified or Removed at the source site” for a list 

of DICOM tags deemed to be unsafe.  

● The Radiological Society of North America (RSNA) International Covid-19 Open Radiology 

Database (RICORD) De-identification Protocol.
20

  
● The DICOM standard itself provides important guidance for de-identifying header information. 

Specifically, DICOM Part 15: Security and System Management Profiles, Appendix E: Attribute 

Confidentiality Profiles may be useful.
21

  
16
 Sara Mannheimer and Elizabeth Hull, “Sharing Selves: Developing an Ethical Framework for Curating Social Media 

Data,” International Journal of Digital Curation 12, no. 2 (April 18, 2018), https://doi.org/10.2218/ijdc.v12i2.518. 
17

 “Social Media Archives Toolkit,” North Carolina State University Libraries, accessed August 4, 2020, 
https://www.lib.ncsu.edu/social-media-archives-toolkit. 
18

 Some repositories may be able to assist you or recommend tools for defacing. For example, the International 
Neuroimaging Data-Sharing Initiative (INDI) can help researchers who plan to share their data on the INDI platform. 
For further information, see the INDI Data Contribution Guide, accessed August 31, 2020, 
http://fcon_1000.projects.nitrc.org/indi/indi_data_contribution_guide.pdf. See also, Omer Faruk Gulban, Dylan 
Nielson, Russ Poldrack, John Lee, Chris Gorgolewski, Vanessasaurus, and Satrajit Ghosh, “Poldracklab/pydeface: 
V2.0.0.” October 31, 2019. http://doi.org/10.5281/zenodo.3524401. 

19
 Kirby, Justin. “Submission and De-identification Overview.” The Cancer Imaging Archive (TCIA), University of 

Arkansas for Medical Sciences, April 27, 2020, 
https://wiki.cancerimagingarchive.net/display/Public/Submission+and+De-identification+Overview. 
20

 “RSNA International Covid-19 Open Radiology Database (RICORD) De-identification Protocol,” Radiological Society 
of North America, International COVID-19 Open Radiology Database, accessed August 10, 2020, 
https://www.rsna.org/-/media/Files/RSNA/Covid-19/RICORD/RSNA-Covid-19-Deidentification-Protocol.pdf. 
21

 Medical Imaging & Technology Alliance, DICOM Standards Committee, “DICOM Part 15: Security and System 
Management Profiles.” DICOM Standard (Arlington, VA: National Electrical Manufacturers Association), accessed 
August 4, 2020, 
https://www.dicomstandard.org/current/.  

https://doi.org/10.2218/ijdc.v12i2.518
https://doi.org/10.2218/ijdc.v12i2.518
https://www.lib.ncsu.edu/social-media-archives-toolkit
https://www.lib.ncsu.edu/social-media-archives-toolkit/legal
https://www.lib.ncsu.edu/social-media-archives-toolkit/legal
https://pypi.org/project/pydeface/
https://wiki.cancerimagingarchive.net/display/Public/Submission+and+De-identification+Overview
https://www.rsna.org/covid-19/COVID-19-RICORD/RICORD-resources#identification
https://www.rsna.org/covid-19/COVID-19-RICORD/RICORD-resources#identification
https://www.rsna.org/-/media/Files/RSNA/Covid-19/RICORD/RSNA-Covid-19-Deidentification-Protocol.pdf
https://www.dicomstandard.org/current/
http://dicom.nema.org/medical/dicom/current/output/pdf/part15.pdf#chapter_E
http://dicom.nema.org/medical/dicom/current/output/pdf/part15.pdf#chapter_E
https://doi.org/10.2218/ijdc.v12i2.518
https://www.lib.ncsu.edu/social-media-archives-toolkit
http://fcon_1000.projects.nitrc.org/indi/indi_data_contribution_guide.pdf
http://doi.org/10.5281/zenodo.3524401
https://wiki.cancerimagingarchive.net/display/Public/Submission+and+De-identification+Overview
https://www.rsna.org/-/media/Files/RSNA/Covid-19/RICORD/RSNA-Covid-19-Deidentification-Protocol.pdf
https://www.dicomstandard.org/current/


PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 12 

○ These profiles attempt to balance the need to protect privacy with the need to retain 

information so the data remain useful. 

○ If it is necessary to retain identifiers, your REB application will have ideally referenced the 

profile you intend to use, and your consent form should clearly state what information 

will be shared.  

 
De-identification of DICOM files may be done programmatically, using a software to strip 

identifiers from the header.  

● TCIA recommends the Clinical Trial Processor (CTP) software developed by RSNA.
22

 
● RSNA’s Covid-19 Open Radiology Database (RICORD) recommends another RSNA software called 

Anonymizer, and has published instructions on how to install and use it. 
23

Anonymizer 

implements RICORD’s de-identification protocol.
24

 
● There are many other non-commercial options available, such as the DicomCleaner™ tool.
25

  
● As with all de-identification software, results may be variable, and you should confirm that 

identifying information was removed before you share your images. Note that: 

○ Vendors or end-users may not have always used DICOM elements in a way that conforms 

to the standard. 

○ Private elements or private tags may have been used to store personal information, and 

the use of these tags may not be well-defined in the vendor documentation. 

 
Genomics data, and other biomedical samples 

Because each person's DNA sequence is unique, human biological materials can never be truly 

anonymous. Before you archive or biobank these data, please review your consent form. Ideally the 

consent process will have:  

● provided participants with information about how their data will be used, analyzed, stored and 

shared, 

● identified what information will be stored alongside the data,  

● communicated what level of privacy or confidentiality a participant may expect, and who may 

have access to the data, 

● indicated whether the data/samples will be stored in Canada or outside of Canada,  

● acknowledged whether there is a possibility that the data will be used for commercial purposes, 

 
22
 “Clinical trial processor (CTP),” Radiological Society of North America, Medical Imaging Resource Community 

(MIRC), accessed August 4, 2020, https://www.rsna.org/research/imaging-research-tools. 
23

 “RSNA COVID-19 DICOM Data Anonymizer,” Radiological Society of North America, International COVID-19 Open 
Radiology Database, accessed August 10, 2020, https://www.rsna.org/-/media/Files/RSNA/Covid-19/RICORD/RSNA-
Anonymizer-Program-Instructions.pdf. 
24 “RSNA International Covid-19 Open Radiology Database (RICORD) De-identification Protocol,” Radiological Society 
of North America, International COVID-19 Open Radiology Database. 
25

 Clunie, David A., “DicomCleaner™,” PixelMed Publishing™, accessed July 16, 2020, 
http://www.dclunie.com/pixelmed/software/webstart/DicomCleanerUsage.html. 

https://www.rsna.org/research/imaging-research-tools
https://www.rsna.org/-/media/Files/RSNA/Covid-19/RICORD/RSNA-Anonymizer-Program-Instructions.pdf
https://www.rsna.org/-/media/Files/RSNA/Covid-19/RICORD/RSNA-Covid-19-Deidentification-Protocol.pdf
http://www.dclunie.com/pixelmed/software/webstart/DicomCleanerUsage.html
https://www.rsna.org/research/imaging-research-tools
https://www.rsna.org/-/media/Files/RSNA/Covid-19/RICORD/RSNA-Anonymizer-Program-Instructions.pdf
https://www.rsna.org/-/media/Files/RSNA/Covid-19/RICORD/RSNA-Anonymizer-Program-Instructions.pdf
http://www.dclunie.com/pixelmed/software/webstart/DicomCleanerUsage.html


PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 13 

● clearly explained the risks of disclosure. 

 
Further information is available in TCPS 2 (2018), Chapter 12: Human Biological Materials Including 

Materials Related to Human Reproduction (sections A and D specifically), and Chapter 13: Human Genetic 

Research.
26

 See also Thorogood (2018) Canada: will privacy rules continue to favour open science?
27

 
The NIH Privacy in Genomics webpage provides a concise overview of some of the benefits and risks of 

sharing genetic information.
28

 For an example of how genetic information was used to identify study 

participants, see Identifying Personal Genomes by Surname Inference, or a summary of the study in the 

2013 Nature editorial on Genetic privacy.
29

 
For further information on ethics and consent in genomics, see the Global Alliance for Genomics and 

Health Regulatory & Ethics Toolkit resources, such as Data Privacy and Security Policy and Consent 

Policy.
30

  
26
 Government of Canada (Canadian Institutes of Health Research, the Natural Sciences and Engineering Research 

Council of Canada, and the Social Sciences and Humanities Research Council), Tri-Council Policy Statement: Ethical 
Conduct for Research Involving Humans, December 2018,  
https://ethics.gc.ca/eng/policy-politique_tcps2-eptc2_2018.html. 
27

 Adrian Thorogood, “Canada: Will Privacy Rules Continue to Favour Open Science?” Human Genetics 137: 595–602 
(July 16, 2018), https://doi.org/10.1007/s00439-018-1922-z.  
28

 “Privacy in Genomics,” National Human Genome Research Institute, February 24, 2020, 
https://www.genome.gov/about-genomics/policy-issues/Privacy. 
29

 Gymrek, Melissa, Amy L. McGuire, David Golan, Eran Halperin, and Yaniv Erlich. “Identifying Personal Genomes by 
Surname Inference.” Science 339, no. 6117 (Jan 18, 2013): 321-324. https://doi.org/10.1126/science.1229566; and 
“Genetic privacy” [Editorial], Nature 493 (January 24, 2013): 451, https://doi.org/10.1038/493451a. 
30

 Global Alliance for Genomics & Health. Genomic Toolkit: Regulatory & Ethics Toolkit. Toronto, ON: Global Alliance 
for Genomics and Health, accessed July 20, 2020, 
https://www.ga4gh.org/genomic-data-toolkit/regulatory-ethics-toolkit/. 

https://ethics.gc.ca/eng/tcps2-eptc2_2018_chapter12-chapitre12.html
https://ethics.gc.ca/eng/tcps2-eptc2_2018_chapter12-chapitre12.html
https://ethics.gc.ca/eng/tcps2-eptc2_2018_chapter13-chapitre13.html
https://ethics.gc.ca/eng/tcps2-eptc2_2018_chapter13-chapitre13.html
https://link.springer.com/article/10.1007/s00439-018-1905-0
https://www.genome.gov/about-genomics/policy-issues/Privacy
https://doi.org/10.1126/science.1229566
https://www.nature.com/news/genetic-privacy-1.12238
https://www.ga4gh.org/genomic-data-toolkit/regulatory-ethics-toolkit/
https://www.ga4gh.org/genomic-data-toolkit/regulatory-ethics-toolkit/
https://www.ga4gh.org/wp-content/uploads/GA4GH-Data-Privacy-and-Security-Policy_FINAL-August-2019_wPolicyVersions.pdf
https://www.ga4gh.org/wp-content/uploads/GA4GH-Final-Revised-Consent-Policy_16Sept2019.pdf
https://www.ga4gh.org/wp-content/uploads/GA4GH-Final-Revised-Consent-Policy_16Sept2019.pdf
https://doi.org/10.1007/s00439-018-1922-z
https://www.genome.gov/about-genomics/policy-issues/Privacy
https://doi.org/10.1126/science.1229566
https://doi.org/10.1038/493451a
https://www.ga4gh.org/genomic-data-toolkit/regulatory-ethics-toolkit/


PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 14 

Appendix 1: Code for Checking K-Anonymity 
 

-- Stata -- 

* Stata code for checking k-anonymity  

* Kristi Thompson, May 2020 

 
* create the equivalence groups  

egen equivalence_group= group(var1 var2 var3 var4 var5) 

* create a variable to count cases in each equivalence group  

sort equivalence_group 

by equivalence_group: gen equivalence_size =_N 

* list the ID numbers of equivalence groups containing 3 or fewer cases 

tab equivalence_group if equivalence_size < 3, sort 

* list the values of the quasi-identifiers for each small equivalence 

class.  

list var1 var2 var3 var4 var5 if equivalence_group == 1  

 
--- R -- 

# R code for checking k-anonymity 

# Carolyn Sullivan and Kristi Thompson, May 2020  

 
# install plyr, a useful data manipulation package. 

install.packages("plyr") 

# Load the library. 

library('plyr') 


PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 15 

 
datafile <- " location of the data file - csv format -  " 

# Read the csv  file. 

df <- read.csv (datafile) 

 
# Figure out what equivalence classes there are, and how many cases in 

each equivalence class. 

dfunique <- ddply(df, .(var1, var2, var3, var4, var5), nrow) 

dfunique <- dfunique[order(dfunique$V1),] 

View(dfunique) 

 
The  UK Anonymisation Network’s Anonymisation Decision-Making Framework, appendix B has code for 

doing this in SPSS.
31

 
31 Elliot, Mackey, O’Hara, and Tudor, The Anonymisation Decision-Making Framework. 

https://ukanon.net/ukan-resources/ukan-decision-making-framework/


PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 16 

Appendix 2: Free de-identification software packages  
Many of these tools take a hierarchical approach to de-identifying data, which means that you will need 

to pre-define possible generalizations for the quasi-identifiers in the dataset, and the program will search 

for possible solutions and recommend a set of the generalizations to use to best meet anonymization 

goals. For datasets with a large number of quasi-identifiers, or cases where several datasets with similar 

quasi-identifiers need to be de-identified, this might be a useful approach. For smaller datasets, it may be 

more straightforward to work in a statistical package. The software packages included here all have some 

usability issues, and fairly steep learning curves. Amnesia and the graphical user interface to sdcMicro 

may be the most user-friendly.  

Recommended tools: 

● Amnesia  

○ This software has both online and desktop versions, however, uploading sensitive data to 

a third-party web site is not generally recommended. If possible, install the software 

locally (Windows or Linux only). 

○ Amnesia supports k-anonymity and k
m

-anonymity (a slightly more flexible approach to 

anonymity when the number of quasi-identifiers in a dataset is very high, as it allows for 

combinations up to m quasi-identifiers to appear at least k times in the published data). 

○ A few limitations: there is not currently a way to specify missing values; documentation 

could be more thorough, for instance, defining hierarchies is not straightforward.  

○ This software may work best for clinical data, or data which are not survey data.  

● sdcMicro 

○ An R package for statistical disclosure control (microdata anonymization). This software 

can read many data types (e.g., csv, sav, dta, sas7bdat, xlsx) and can be used in Windows, 

Linux or Mac operating systems. Implements muArgus code. 

○ A graphical user interface is available, and there is a vignette with guidance called ‘Using 

the interactive GUI - sdcApp’ linked from the sdcMicro landing page in CRAN repository.
32

 
○ Please be aware that large datasets take time to load, and computation time for large or 

complex datasets may be lengthy.
33

 
○ The Statistical Disclosure Control for Microdata practice guide section on SDC with 

sdcMicro in R may be helpful if you need further guidance installing and using the 

sdcMicro package, or see Benschop’s sdcMicro GUI manual Documentation.
34

 
32
 “Using the interactive GUI – sdcApp, The Comprehensive R Archive Network (CRAN), accessed August 31, 2020, 

https://cran.r-project.org/web/packages/sdcMicro/vignettes/sdcMicro.html. 
33

 “Computation time,” SDC with sdcMicro in R: Setting Up Your Data and more, SDC Practice Guide, 2019, 
https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html#computation-time.  
34

 “Statistical Disclosure Control (SDC): An Introduction,” SDC Practice Guide, 2019, 
https://sdcpractice.readthedocs.io/en/latest/SDC_intro.html; and Thijs Benschop and Matthew Welch. Statistical 
Disclosure Control for Microdata: A Practice Guide for sdcMicro, International Household Survey Network, accessed 
August 10, 2020, https://sdcpractice.readthedocs.io/en/latest/index.html. 

https://amnesia.openaire.eu/
https://cran.r-project.org/web/packages/sdcMicro/index.html
https://cran.r-project.org/web/packages/sdcMicro/vignettes/sdcMicro.html
https://cran.r-project.org/web/packages/sdcMicro/vignettes/sdcMicro.html
https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html#computation-time
https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html
https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html
https://readthedocs.org/projects/sdcappdocs/downloads/pdf/latest/
https://cran.r-project.org/web/packages/sdcMicro/vignettes/sdcMicro.html
https://sdcpractice.readthedocs.io/en/latest/sdcMicro.html#computation-time
https://sdcpractice.readthedocs.io/en/latest/SDC_intro.html
https://sdcpractice.readthedocs.io/en/latest/index.html


PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 17 

Other tools that may be useful: 

● ARX  

○ Open source anonymization tool for use in Windows, Linux, and Mac. Provides support 

for SQL databases, xlsx and csv files, and has a graphical user interface. 

○ Supports various privacy models including k-anonymity, and variants ℓ-diversity, t-

closeness, β-Likeness, and more.
35

  
○ Allows end-users to categorize, top and bottom code, generalize, and transform data in 

more complex ways. 

○ Large datasets take time to load, and computation time for large or complex datasets 

may be lengthy. 

● mu-Argus  

○ Software to apply Statistical Disclosure Control techniques. The program takes a 

hierarchical approach to de-identifying data. 

○ JAR file should be executable in Windows or Mac OS.  

○ A tester found that getting data loaded and correctly defined was a bit of a challenge and 

advised that the program could use better documentation on setting up hierarchies.  

● The University of Texas at Dallas Anonymization Toolbox 

○ The toolbox currently supports 6 different anonymization methods and 3 privacy 

definitions, including k-anonymity, ℓ-diversity, and t-closeness. 

○ Algorithms can either be applied directly to a dataset or can be used as library functions 

inside other applications.  

○ This is a set of Java routines. Data curators who prefer to do their statistical programming 

in Java might find it useful. 

 
35
 “Privacy Models,” ARX – Data Anonymization Tool, accessed August 31, 2020, 

https://arx.deidentifier.org/overview/privacy-criteria/.  

https://arx.deidentifier.org/overview/
https://arx.deidentifier.org/overview/privacy-criteria/
https://github.com/sdcTools/muargus
http://cs.utdallas.edu/dspl/cgi-bin/toolbox/index.php
https://arx.deidentifier.org/overview/privacy-criteria/


PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 18 

 
Appendix 3: Fee-based services for de-identification 
A few fee-based services that researchers may opt to use for de-identification are included below: 

● d-wise (American & European offices) 

○ Offering free anonymization services to anyone working on a COVID-19 vaccine.
36

 
○ Offering free anonymization services to researchers who deposit individual participant-

level data from COVID-19 clinical trials in Vivli.
37

  
● Inter-university Consortium for Political and Social Research (ICPSR) (Archive headquartered at 

University of Michigan) 

○ If you wish ICPSR to conduct disclosure analysis of your data, you will need to purchase 

the Professional Curation package. Cost is based on the number of variables and 

complexity of the data. Contact ICPSR Acquisitions at deposit@icpsr.umich.edu for 

additional information (information obtained from Open ICPSR FAQ under Pricing and 

Sensitive Data sections).
38

 
● Privacy Analytics (Ottawa-based company) 

○ Privacy Analytics can review datasets as part of their Data Privacy Validation Services.
39

  
○ Methodology based on the HIPAA Expert Determination De-identification Standard. 

○ To find out more about their services, please fill in the form at the bottom of their 

“Certification” webpage.
40

 
36
 “d-wise Offers Free Transparency Services Accelerating COVID-19 Vaccine Research,” Cision PRWeb, March 10, 

2020, 
https://www.prweb.com/releases/d_wise_offers_free_transparency_services_accelerating_covis_19_vaccine_rese
arch/prweb16970368.htm.  
37

 “d-wise offers anonymization services available on Vivli COVID-19 portal,” Center for Global Clinical Research 
Data, April 13, 2020, 
https://www.prweb.com/releases/d_wise_offers_free_transparency_services_accelerating_covis_19_vaccine_rese
arch/prweb16970368.htm.  
38

 “FAQs,” OpenICPSR, accessed August 31, 2020, https://www.openicpsr.org/openicpsr/faqs.  
39

 “Clinical Trial Transparency Services,” Privacy Analytics, accessed on August 31, 2020, https://privacy-
analytics.com/clinical-trial-transparency/ctt-services/.  
40

 “Double-check your data and leverage it with confidence,” Privacy Analytics, accessed on August 31, 2020, 
https://privacy-analytics.com/health-data-privacy/health-data-services/expert-data-opinion-services/.  

https://www.d-wise.com/de-identification-services
https://www.prweb.com/releases/d_wise_offers_free_transparency_services_accelerating_covis_19_vaccine_research/prweb16970368.htm
https://vivli.org/vivli-covid-19-portal/
https://vivli.org/about/overview-2/
https://www.openicpsr.org/openicpsr/
https://www.openicpsr.org/openicpsr/faqs
https://privacy-analytics.com/services/certification/
https://privacy-analytics.com/clinical-trial-transparency/ctt-services/
https://privacy-analytics.com/services/certification/
https://www.prweb.com/releases/d_wise_offers_free_transparency_services_accelerating_covis_19_vaccine_research/prweb16970368.htm
https://www.prweb.com/releases/d_wise_offers_free_transparency_services_accelerating_covis_19_vaccine_research/prweb16970368.htm
https://www.prweb.com/releases/d_wise_offers_free_transparency_services_accelerating_covis_19_vaccine_research/prweb16970368.htm
https://www.prweb.com/releases/d_wise_offers_free_transparency_services_accelerating_covis_19_vaccine_research/prweb16970368.htm
https://www.openicpsr.org/openicpsr/faqs
https://privacy-analytics.com/clinical-trial-transparency/ctt-services/
https://privacy-analytics.com/clinical-trial-transparency/ctt-services/
https://privacy-analytics.com/health-data-privacy/health-data-services/expert-data-opinion-services/


PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 19 

Resources 

1. Amnesia https://amnesia.openaire.eu/  

2. ARX https://arx.deidentifier.org/overview/ 

3. d-wise https://www.d-wise.com/de-identification-services 

4. mu-Argus https://github.com/sdcTools/muargus  

5. Inter-university Consortium for Political and Social Research (ICPSR) 

https://www.openicpsr.org/openicpsr/  

6. Privacy Analytics https://privacy-analytics.com/services/certification/ 

7. sdcMicro https://cran.r-project.org/web/packages/sdcMicro/index.html 

8. The University of Texas at Dallas Anonymization Toolbox http://cs.utdallas.edu/dspl/cgi-

bin/toolbox/index.php 

 
References 

1. Alder, Steve. “What is Considered PHI Under HIPAA Rules?” HIPAA Journal, December 28, 2017. 
https://www.hipaajournal.com/considered-phi-hipaa/. 

2. Benschop, Thijs, and Matthew Welch. “Statistical Disclosure Control for Microdata: A Practice 
Guide for sdcMicro.” International Household Survey Network. Accessed August 10, 2020. 
https://sdcpractice.readthedocs.io/en/latest/index.html. 

3. Clunie, David A. “DicomCleaner™.” PixelMed Publishing™. Accessed July 16, 2020. 
http://www.dclunie.com/pixelmed/software/webstart/DicomCleanerUsage.html.  

4. Documenting the Now. “DocNow/hydrator.” GitHub. Accessed August 4, 2020. 
https://github.com/DocNow/hydrator. 

5. Documenting the Now. “DocNow/twarc.” GitHub. Accessed August 4, 2020. 
https://github.com/docnow/twarc. 

6. El Emam, Khaled, and Fida Kamal Dankar. “Protecting Privacy Using k-Anonymity.” Journal of the 
American Medical Informatics Association 15, no. 5 (September 2008): 627–637. 
https://doi.org/10.1197/jamia.M2716. 

7. Elliot, Mark, Elaine Mackey, Kieron O’Hara, and Caroline Tudor. The Anonymisation Decision-
Making Framework. UK Anonymisation Network (UKAN). University of Manchester. 2016. 
https://ukanon.net/ukan-resources/ukan-decision-making-framework/. 

8. “Genetic privacy.” [Editorial]. Nature 493 (January 24, 2013): 451. 
https://doi.org/10.1038/493451a. 

9. Global Alliance for Genomics & Health. Genomic Toolkit: Regulatory & Ethics Toolkit. Toronto, ON: 
Global Alliance for Genomics and Health. Accessed July 20, 2020. 
https://www.ga4gh.org/genomic-data-toolkit/regulatory-ethics-toolkit/. 

10. Government of Canada (Canadian Institutes of Health Research, the Natural Sciences and 
Engineering Research Council of Canada, and the Social Sciences and Humanities Research 
Council). Tri-Council Policy Statement: Ethical Conduct for Research Involving Humans. December 
2018. https://ethics.gc.ca/eng/policy-politique_tcps2-eptc2_2018.html.  

https://amnesia.openaire.eu/
https://github.com/sdcTools/muargus
https://www.openicpsr.org/openicpsr/
https://www.hipaajournal.com/considered-phi-hipaa/
https://sdcpractice.readthedocs.io/en/latest/index.html
http://www.dclunie.com/pixelmed/software/webstart/DicomCleanerUsage.html
https://github.com/DocNow/hydrator
https://github.com/docnow/twarc
https://doi.org/10.1197/jamia.M2716
https://ukanon.net/ukan-resources/ukan-decision-making-framework/
https://doi.org/10.1038/493451a
https://www.ga4gh.org/genomic-data-toolkit/regulatory-ethics-toolkit/
https://ethics.gc.ca/eng/policy-politique_tcps2-eptc2_2018.html


PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 20 

11. Gulban, Omer Faruk, Dylan Nielson, Russ Poldrack, John Lee, Chris Gorgolewski, Vanessasaurus, 
and Satrajit Ghosh. “Poldracklab/pydeface: V2.0.0.” Zenodo. October 31, 2019. 
http://doi.org/10.5281/zenodo.3524401. 

12. Gymrek, Melissa, Amy L. McGuire, David Golan, Eran Halperin, and Yaniv Erlich. “Identifying 
Personal Genomes by Surname Inference.” Science 339, no. 6117 (Jan 18, 2013): 321-324. 
https://doi.org/10.1126/science.1229566.  

13. Hrynaszkiewicz, Iain, Melissa L. Norton, Andrew J. Vickers, and Douglas G. Altman. “Preparing 
Raw Clinical Data for Publication: Guidance for Journal Editors, Authors, and Peer Reviewers.” 
BMJ 340 (January 29, 2010): c181. https://www.bmj.com/content/340/bmj.c181. 

14. Information and Privacy Commissioner Ontario. Deidentification Guidelines for Structured Data. 
Information and Privacy Commissioner of Ontario. June 8, 2016. https://www.ipc.on.ca/privacy-
organizations/de-identification-centre/. 

15. International Household Survey Network. “Measuring the Disclosure Risk.” Accessed August 4, 
2020. https://ihsn.org/anonymization-risk-measure. 

16. International Neuroimaging Data-Sharing Initiative (INDI). Data Contribution Guide. Accessed 
August 4, 2020. http://fcon_1000.projects.nitrc.org/indi/indi_data_contribution_guide.pdf. 

17. Kirby, Justin. “Submission and De-identification Overview.” The Cancer Imaging Archive (TCIA), 
University of Arkansas for Medical Sciences. April 27, 2020. 
https://wiki.cancerimagingarchive.net/display/Public/Submission+and+De-
identification+Overview.   

18. Mannheimer, Sara, and Elizabeth Hull. “Sharing Selves: Developing an Ethical Framework for 
Curating Social Media Data.” International Journal of Digital Curation 12, no. 2 (April 18, 2018). 
https://doi.org/10.2218/ijdc.v12i2.518. 

19. Medical Imaging & Technology Alliance, DICOM Standards Committee. “DICOM Part 15: Security 
and System Management Profiles.” In DICOM Standard. Arlington, VA: National Electrical 
Manufacturers Association. Accessed August 4, 2020. https://www.dicomstandard.org/current/. 

20. Moore, Stephen M., David R. Maffitt, Kirk E. Smith, Justin S. Kirby, Kenneth W. Clark, John B. 
Freymann, Bruce A. Vendt, Lawrence R. Tarbox, and Fred W. Prior. “De-identification of Medical 
Images with Retention of Scientific Research Value.” RadioGraphics 35, no. 3 (May 13, 2015). 
https://doi.org/10.1148/rg.2015140244. 

21. National Human Genome Research Institute. “Privacy in Genomics.” February 24, 2020. Accessed 
August 10, 2020. https://www.genome.gov/about-genomics/policy-issues/Privacy.  

22. North Carolina State University Libraries. “Social Media Archives Toolkit.” Accessed August 4, 
2020. https://www.lib.ncsu.edu/social-media-archives-toolkit. 

23. Portage COVID-19 Working Group, “Can I Share My Data?” September 25, 2020. 
https://doi.org/10.5281/zenodo.4041661.    

24. Portage COVID-19 Working Group. “Documentation and Supporting Materials Required for 
Deposit.” September 25, 2020. https://doi.org/10.5281/zenodo.4042034.    

25. Portage COVID-19 Working Group. “Recommended Repositories.” September 25, 2020. 
https://doi.org/10.5281/zenodo.4042037. 

26. International Household Survey Network. “Anonymization Principles.” Accessed August 4, 2020. 

https://ihsn.org/node/137. 

27. Radiological Society of North America, International COVID-19 Open Radiology Database. “RSNA 

International Covid-19 Open Radiology Database (RICORD) De-identification Protocol.” Accessed 

August 10, 2020. https://www.rsna.org/-/media/Files/RSNA/Covid-19/RICORD/RSNA-Covid-19-

Deidentification-Protocol.pdf.  

http://doi.org/10.5281/zenodo.3524401
https://doi.org/10.1126/science.1229566
https://www.bmj.com/content/340/bmj.c181
https://www.ipc.on.ca/privacy-organizations/de-identification-centre/
https://www.ipc.on.ca/privacy-organizations/de-identification-centre/
https://ihsn.org/anonymization-risk-measure
http://fcon_1000.projects.nitrc.org/indi/indi_data_contribution_guide.pdf
https://wiki.cancerimagingarchive.net/display/Public/Submission+and+De-identification+Overview
https://wiki.cancerimagingarchive.net/display/Public/Submission+and+De-identification+Overview
https://doi.org/10.2218/ijdc.v12i2.518
https://www.dicomstandard.org/current/
https://doi.org/10.1148/rg.2015140244
https://www.genome.gov/about-genomics/policy-issues/Privacy
https://www.lib.ncsu.edu/social-media-archives-toolkit
https://doi.org/10.5281/zenodo.4041661
https://doi.org/10.5281/zenodo.4042034
https://doi.org/10.5281/zenodo.4042037
https://ihsn.org/node/137
https://www.rsna.org/-/media/Files/RSNA/Covid-19/RICORD/RSNA-Covid-19-Deidentification-Protocol.pdf
https://www.rsna.org/-/media/Files/RSNA/Covid-19/RICORD/RSNA-Covid-19-Deidentification-Protocol.pdf


PORTAGE NETWORK / CANADIAN ASSOCIATION OF RESEARCH LIBRARIES 21 

28. Radiological Society of North America, International COVID-19 Open Radiology Database. “RSNA 

COVID-19 DICOM Data Anonymizer.” Accessed August 10, 2020. https://www.rsna.org/-

/media/Files/RSNA/Covid-19/RICORD/RSNA-Anonymizer-Program-Instructions.pdf.  

29. Radiological Society of North America, Medical Imaging Resource Community (MIRC). “Clinical 

trial processor (CTP).” Accessed August 4, 2020. https://www.rsna.org/research/imaging-

research-tools. 

30. Ryerson University Research Ethics Board. “Guidelines for Research Involving Social Media.” 

Ryerson University. November, 2017. 

https://www.ryerson.ca/content/dam/research/documents/ethics/guidelines-for-research-

involving-social-media.pdf. 

31. Thorogood, Adrian. “Canada: Will Privacy Rules Continue to Favour Open Science?” Human 

Genetics 137: 595–602 (July 16, 2018). https://doi.org/10.1007/s00439-018-1922-z.  

32. Twitter. “Developer Terms: More About Restricted Uses of the Twitter APIs.” Accessed August 4, 

2020. https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases. 

33. UK Data Service. “Anonymisation: Qualitative Data.” Last modified June 30, 2020. 

https://www.ukdataservice.ac.uk/manage-data/legal-ethical/anonymisation/qualitative.aspx. 

34. Zeffiro, Andrea, and Jay Brodeur. “Social Media Research Data Ethics and Management.” 

Workshop presented April 5, 2018. Sherman Centre for Digital Scholarship. McMaster University. 

http://hdl.handle.net/11375/25327. 

 
https://www.rsna.org/-/media/Files/RSNA/Covid-19/RICORD/RSNA-Anonymizer-Program-Instructions.pdf
https://www.rsna.org/-/media/Files/RSNA/Covid-19/RICORD/RSNA-Anonymizer-Program-Instructions.pdf
https://www.rsna.org/research/imaging-research-tools
https://www.rsna.org/research/imaging-research-tools
https://www.ryerson.ca/content/dam/research/documents/ethics/guidelines-for-research-involving-social-media.pdf
https://www.ryerson.ca/content/dam/research/documents/ethics/guidelines-for-research-involving-social-media.pdf
https://doi.org/10.1007/s00439-018-1922-z
https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases
https://www.ukdataservice.ac.uk/manage-data/legal-ethical/anonymisation/qualitative.aspx
http://hdl.handle.net/11375/25327

	De-identification Guidance
	De-identification Guidance
	Identify and Remove Direct Identifiers
	How do I remove this information?

	Identify and Evaluate Indirect or Quasi-Identifiers based on Perceived Risk and Utility
	How do I figure out what combination of quasi-identifiers are a problem?
	1. Observe the possible combinations
	2. Assess these combinations mathematically
	3. Use data reduction techniques to address dataset risk

	How do I assess the sensitivity of non-identifying variables in dataset?

	Considerations for Qualitative Data De-identification
	Brief Considerations for Social Media, Medical Images, and Genomics Data
	Data collected from social media or social networking platforms (e.g., Twitter, Facebook).
	Medical Images
	Genomics data, and other biomedical samples

	Appendix 1: Code for Checking K-Anonymity
	Appendix 2: Free de-identification software packages
	Appendix 3: Fee-based services for de-identification

	Resources
	References