Anonymising and sharing individual patient dataBMJ 2015; 350 doi: https://doi.org/10.1136/bmj.h1139 (Published 20 March 2015) Cite this as: BMJ 2015;350:h1139
- Khaled El Emam, associate professor in pediatrics, Canada research chair in electronic health information12,
- Sam Rodgers, lead general practitioner3,
- Bradley Malin, vice chair for research4
- 1Children’s Hospital of Eastern Ontario Research Institute, Ottawa, Ontario, Canada
- 2Faculty of Medicine and School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa
- 3Earls Court Health and Wellbeing Centre, London, UK
- 4Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee, USA
- Correspondence to: K El Eman
There is increasing pressure to share individual patient data for secondary purposes such as research.1 2 3 For example, research funding agencies are strongly encouraging recipients of funds to share data collected by their projects.4 5 6 The expected benefits from sharing individual patient data for health research purposes include: it ensures accountability in results and that reported study results are valid, it allows researchers to build on the work of others more efficiently and to perform individual patient data meta-analyses to summarise evidence, and it decreases the burden on research subjects through the reuse of existing data.7 In many instances, however, patient privacy concerns have been perceived as a key barrier for making individual patient data available.3 8
There are two legal mechanisms that would permit data custodians to share patient data for secondary purposes (unless there is an exemption in the law): (a) consent and (b) anonymisation. If the data was originally collected in a medical context, then consent for unanticipated secondary analyses is often not obtained in advance. It is not always practical to go back and obtain consent from a large number of patients, and there is evidence of systematic consent bias whereby consenters and non-consenters differ on important characteristics.9 10 11 As a consequence, it is challenging to rely on consent as the primary mechanism for sharing data. With respect to the second option, there is evidence that many research ethics boards will permit the sharing of patient data without consent for research purposes if it is anonymised.12 (The term “de-identification” is more commonly used in North America while “anonymisation” is more commonly used in Europe; for this article, we treat the terms as equivalent.)
Many jurisdictions, including those in North America and Europe, do not designate anonymised health data as personal information.7 Therefore, such data would no longer be covered by privacy laws, allowing it to be used and disclosed for any secondary purpose. However, there is an expectation that the anonymised data will be used only for purposes that are legitimate, in a manner that would not surprise the patients, and not in a discriminatory or stigmatising manner. This expectation has been made explicit in the EU context,13 and falls under a privacy ethics framework outside the European Union.14
When sharing patient data for secondary purposes it is important to be mindful of patient trust. While patients are supportive of the use of their data for research,7 often there is an expectation that that data will be adequately anonymised. Trust is important because there is evidence that patients adopt privacy protective behaviors, such as lying and not seeking care, when they have concerns about how their health information may be shared.15
Definitions of anonymity in privacy laws and regulations do not provide an operational method to follow for anonymising health information. Even the concept of anonymous or non-identifiable data is ambiguous. For example, the European Data Protection Directive 95/46/EC states that “‘personal data’ shall mean any information relating to an identified or identifiable natural person (‘data subject’); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity”; and the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule of 1996 in the US notes that “Health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual is not individually identifiable health information.” This ambiguity contributes to heterogeneity and inconsistency in actual anonymisation practices for health data.
The subdiscipline of statistics known as disclosure control has developed a substantial body of knowledge around anonymisation techniques.16 17 In this article we describe the key concepts and principles behind the anonymisation of health data in an effort to find a common language and mitigate current inconsistencies. As a running example, we will use the Ontario (Canada) birth registry dataset (known as BORN) to illustrate various points. BORN is a population registry of all births in the province. The data is collected from hospitals, clinics, midwives, and the provincial newborn screening laboratory and stored in a data warehouse. The data is then used and disclosed for research and public health purposes.18
From a technical perspective, ensuring anonymity equates to ensuring that the probability of assigning a correct identity to a record in a dataset is very small. This probability can be conditional on other factors, such as the skills required and resources available to an adversary seeking to re-identify a record.7 When data is shared, it is not possible to ensure that the probability of re-identification is zero, but it is possible to ensure that the probability is very small.
Existing standards and guidelines tend to divide the variables in a dataset into two groups: direct identifiers and quasi-identifiers. The direct identifiers are features that permit direct recognition or communication with the corresponding individuals, such as personal names, email addresses, telephone numbers, and social insurance numbers. Quasi-identifiers are features that can indirectly identify individuals, such as their date of birth, death, or clinic visit, residence postal code, and ethnicity. Quasi-identifiers include demographics and socioeconomic information. Both types of variables must be addressed during anonymisation.
In the case of the BORN registry, variables such as the mother’s name and health insurance number are designated as direct identifiers. These variables are removed before data arrives at the registry. Sometimes there are unique identifiers that need to be retained to allow linking of all of the records that belong to the same mother (for example, to track multiple births) such as a medical record number. Because a medical record number is often considered a patient identifier as well, it is converted to a pseudonym. The data is then called “pseudonymised.” Pseudonymous data is still considered personal information under the European Data Protection Directive 95/46/EC19 and should not be treated as anonymous.
To date, all known successful re-identification attacks (excluding genetic data) were performed on pseudonymous data.20 Adversaries performing such an attack attempt to determine the identity of individuals in a dataset that has been shared. Known re-identification attacks are performed almost exclusively by researchers and the media.20
The motives of the media are believed to be to show that shared data is unsafe (which makes for a good story) or to contact individuals and their families for a story. Academics perform these attacks to publish new computational algorithms for attacking databases and also to show weaknesses in available databases. In general, such “white hats” get recognition for finding weaknesses in systems and databases. We consider two examples below.
An example of a media initiated re-identification attack is when a national Canadian broadcaster re-identified an individual in the adverse drug event database from Health Canada. The purpose was to report on the adverse events associated with a drug, and they wanted to interview the family of the deceased individual who was re-identified.21 The re-identification attack used publicly available obituaries to match on age, location, and date of death to determine the identity of the 26 year old woman who had died while taking the drug in question.
A recent example of a successful re-identification attack by a team of a reporter and an academic was performed on a hospital discharge database. The department of health in Washington state in the United States was sharing pseudonymised data with few restrictions on who could access the data and what the data recipient could do with it. In this attack, the adversaries used information from newspaper articles about vehicle accidents and reports involving hospitalisations of famous people in the media to re-identify individuals in the hospital discharge database.22 23 This was accomplished by combining the discharge data with publicly available phone number directories and voter registration lists. Specifically, in this attack the adversary leveraged knowledge about the date of admission, the injury code, the age of the patient, which hospital was visited, the ZIP code of the patient, whether it was a weekend admission, as well as the gender and race of the patient. This amounted to 11 quasi-identifiers that were leveraged to attack the database.
In both of the above cases the successful re-identification attack used quasi-identifiers. It is therefore important to protect the quasi-identifiers as well as the direct identifiers.
Types of data sharing
There are three general ways to share data for secondary purposes: public, quasi-public, and non-public.
Public data has the least amount of restrictions placed on it. Such public data is available, typically online, for anyone to download either free or for a nominal fee. Many national statistical agencies release census and national survey data as public data. Some of this survey data includes health information. There are also publicly available clinical trials data from the International Stroke Trial24 and data posted in the Dryad online open access data repository.25 26
Non-public data has the most restrictions placed on it. In this case the data recipient would need to sign a full contract that, in addition to the above specifications, includes a prescriptive set of security and privacy controls that the data recipient needs to have in place, such as encrypting their computers and providing privacy training to the analysts who will work with the data. The data custodian may also reserve the right to audit recipients to ensure that they comply with all of the conditions.
The data needs to be anonymised in all of the three cases above. However, the acceptable probability of re-identification would vary. For a public data release the probability needs to be quite low because there are no other controls that can be put in place. However, for non-public data a higher probability would be acceptable because other security, privacy, and contractual controls would be put in place. This balancing of controls to manage the risk is illustrated in the figure⇓.
The above distinctions mean that the same data can be sufficiently anonymised in different ways depending on the context of the data release. Accounting for the context of the data release when deciding on how to anonymise is consistent with existing best practices and regulatory guidance.29 30 31
The mechanism of data release can also vary. For example, individual patient data may be provided to a researcher for download, or the researcher may get access to the individual patient data through a portal that does not allow any data to be downloaded. In the latter case all of the analysis must happen on the portal itself. Some data custodians require the researcher to be physically present in a secure room in order to access individual patient data. Each of these mechanisms has a different set of controls imposed on the researcher, and therefore the acceptable probability of re-identification would be set accordingly.
Measuring the probability of re-identification
The balancing described above is premised on the ability to measure the probability of re-identification. Several metrics have been developed for measuring the probability of re-identification.7 These can be applied for datasets over a large population or for samples derived from the population. The BORN registry is an example of a population dataset because it includes all births in Ontario. In that case, the probability of re-identification can be directly measured from the data. A sample dataset could be, for example, a clinical trial with diabetic patients (because only a subset of all patients with diabetes will participate in that trial). In the case of the clinical trial dataset, the probability of re-identification would have to be estimated from the data.
To start with, the probability of re-identification will depend on two factors: (a) which quasi-identifiers are included in the shared dataset and (b) the extent to which the data has been perturbed (or modified).
In the BORN registry, variables such as the baby’s date of birth and sex and the mother’s date of birth and postal code are designated quasi-identifiers. They could also be discovered by an adversary for various reasons: births are commonly announced, residence information is available from sources such as the Whitepages (Canadian and US telephone and address directories), and basic demographics are generally available from a variety of public resources.32 We can illustrate how the probability of re-identification is affected by the selected quasi-identifiers.
Table 1⇓ shows the probability of re-identification for different combinations of quasi-identifiers in BORN. The dataset we use has 919 710 births from 2005 to 2011. This probability will vary depending on which quasi-identifiers are included in the released data. In general, the more quasi-identifiers that are included in the released data, the greater the probability of re-identification. Some quasi-identifiers have a substantial impact, such as the Canadian six-digit postal code, followed by the mother’s date of birth, whereas other quasi-identifiers have little to no impact (such as the baby’s sex). The inclusion of all four quasi-identifiers leads to a high probability of re-identification because at that level of detail almost all births are unique.
Data transformations and data quality
If the probability of re-identification is deemed to be too high, then various perturbation techniques can be applied to reduce it.14 For example, if all quasi-identifiers in table 1 need to be shared without perturbation, it is almost certain that re-identification can happen.
One of the simplest ways to perturb the data is to reduce the precision of data fields through generalisation. This approach is used quite often in practice. As an illustration, it is natural for a date of birth to be generalised into a month and year of birth. Generalisation is, in many instances, considered to be an acceptable strategy for protection because it is consistent with how the data will be analysed. For example, if the analysis only requires the year of birth of the mother, then generalising the mother‘s date of birth in BORN will reduce the probability of re-identification and will be consistent with the intended analysis.
Table 2⇓ depicts the probability of re-identification after various generalisations were applied to the BORN quasi-identifiers. Simple changes to the data can result in substantial reductions in the probability of re-identification. Which generalisation should be chosen is determined using a combination of two methods: (a) a data analyst subjectively judges whether a particular generalisation would affect the ability to analyse the data, and (b) formal metrics are applied to evaluate data utility, such as the entropy in the resulting records.10
In table 2, scenario S1 reduces the precision of the mother‘s date of birth to a year and the postal code to the first three characters, but the probability of re-identification remains quite high. By contrast, scenarios S5 and S6 have the lowest probability of re-identification, but the postal code is truncated to the first character only. This precludes most meaningful geospatial analysis. The lowest probability that maintains location information is reached with scenario S8, with the baby‘s date of birth converted to quarter and year and the mother’s age is categorised as ≤19, 20-30, 30-40, or >40 years. However, the changes in S8 reduce the utility of the data because details around the exact age of the infant at certain time points cannot be calculated, and geospatial analysis is still limited by the three character postal code.
Better methods of perturbation can be used than simple generalisation. These computational methods can reduce the amount of distortion to the data (such as allowing more granularity than the three character postal code) and produce higher data quality.14 33
In practice, when there are many quasi-identifiers in a dataset, simple techniques such as generalising the values for all the records in the same way are unlikely to produce datasets that are analytically useful. With just the four quasi-identifiers in table 2⇑, the acceptable generalisations were already approaching the limits of data utility. However, as mentioned earlier, recent re-identification attacks leveraged as many as 11 quasi-identifiers.22 23 To maintain the utility of the data, more sophisticated methods can be applied that retain details in dates and geospatial information during the anonymisation process.14
When to stop
A practical question that the data custodian needs to answer is how much generalisation is enough? For instance, are all the solutions in table 2 that are below a probability of re-identification of 0.2 acceptable from a risk perspective? There are precedents (regulatory, legal, and practical) going back decades for what is an acceptable probability of re-identification for public and non-public data releases.7 These precedents provide a range of possible acceptable thresholds that can justifiably be used. In general, they vary from an acceptable probability of 0.33 to 0.05.7
There are instances where anonymisation schemes do not include risk measurement nor the setting of thresholds to ensure that the probability of re-identification is acceptable.34 35 36 For example, these schemes provide a fixed list of quasi-identifiers that should be removed from the dataset. These approaches cannot provide assurance that the probability of re-identification is small for any single dataset because the actual quasi-idenditifers may differ from the list. Moreover, their application may result in datasets being excessively perturbed. Therefore, such approaches would not be appropriate for complex datasets. Knowing when to stop perturbing the data is important to balance privacy protection and data utility.
Methods for measuring the risk of re-identification can be used to decide how much to anonymise health data for different types of data release. Perturbation that retains sufficient data quality requires data-centric methods rather than simplistic rules regarding how to generalise fields. Anonymisation methods cannot ensure that the risk of re-identification is zero, but this is not the threshold that is expected by privacy laws and regulations in any jurisdiction. Strong precedents exist for choosing suitable probability thresholds for anonymising data. There is a need for anonymisation standards that can provide operational guidance to data custodians and promote consistency in the applications of anonymisation.
Is it necessary to obtain patient consent to anonymise health data or to share anonymised data?
In most jurisdictions, including the European Union, anonymisation is considered a permitted use.13 This means that it is not necessary to obtain patient consent to anonymise the data.
Can data on rare diseases be anonymised?
The presence of a rare disease does not necessarily make it impossible to anonymise. If the dataset is a sample from the population of patients with that disease, then the probability of re-identification may still be small. If the rare disease is not visible then that reduces the likelihood that an adversary would know that someone has that disease.37
Will advances in technology and the greater availability of data increase the risk of re-identification?
Anonymisation is typicaly time limited to account for changes in technology and the availability of other data that can be used to re-identify individuals. This time limit is typically 18–24 months. After that time has elapsed, the risk of re-identification needs to be re-evaluated to determine if circumstances make the originally anonymised data high risk. This is possible to achieve for non-public datasets where permission to use a dataset is time limited and the data use agreement stipulates a re-assessment of re-identification risk. For public data, the initial anonymisation needs to be more stringent to be applicable for a longer period since it is not possible to “call back” a public dataset.
Anonymising clinical trials data
Regulators such as the European Medicines Agency are planning to make data from clinical trials more generally available.38 39 Initially, the contents of clinical study reports will be made available under a two-track process, with broad public access through a portal and the ability to download for a narrower set of identified users.40 In a second phase, individual patient data will be made available. However, the agency cannot collect individual patient data only for the purpose of sharing it and needs to formulate policies on how to use the individual patient data for scientific review as well. This has resulted in some delays in formulating a policy for sharing individual patient data.
In anticipation of individual patient data being made available by regulators, or the requirement by them to do so, manufacturers have already started putting in place policies and infrastructure for sharing individual patient data.41 Recent examples include:
The Immport Immunology Database and Analysis Portal48
Furthermore, some pharmaceutical companies are creating their own company-specific portals to facilitate the sharing of their own datasets, and these are typically accessible through their corporate websites.
Given that trial participants are often from multiple sites across the world, anonymisation practices for the data must meet the regulatory requirements globally. This means that the burden of evidence that the probability of re-identification is acceptably small is not trivial because regulators in different jurisdictions do not use the same standards. Organisations such as the European Medicines Agency could help address such gaps by providing or recommending robust and scalable methods that can provide quantitative anonymity assurances while producing high quality data.
Cite this as: BMJ 2015;350:h1139
Contributors: KEE’s research over the past 10 years has been focused on privacy issues related to electronic health information, particularly the policy and technological obstacles and solutions to data anonymisation and the sharing of health data for secondary purposes. He has published three books on this topic covering policy, legal, and methodology issues, as well as a series of case studies to show how to anonymise health data. SR was formerly the clinical chief information officer for NHS Central London CCG. He was the clinical lead on implementing a single primary care IT system for Central London CCG and contributed to the development of the data storage and analysis platform which will house patient data in the North West London area. BM conducts research on methods and software tools to anonymise health data for secondary purposes. He contributed to the development of the US HIPAA de-identification guidelines.
Funding: BM’s work on this article was funded by grants R01LM009989 (National Library or Medicine, National Institutes of Health) and U01HG006385 (National Human Genome Research Institute, National Institutes of Health).
Competing interests: The authors have read and understood the BMJ Group policy on declaration of interests and declare the following interests: KEE and BM have financial interests in Privacy Analytics, a University of Ottawa and Children‘s Hospital of Eastern Ontario spin-off company which develops anonymisation software for the health sector.
Ethical approval: The analysis of the BORN registry data described in this article was approved by the Research Ethics Board of the Children’s Hospital of Eastern Ontario Research Institute.