key: cord-1020326-237zu9ak
authors: Thorogood, Adrian
title: Policy-Aware Data Lakes: A Flexible Approach to Achieve Legal Interoperability for Global Research Collaborations
date: 2020-08-19
journal: J Law Biosci
DOI: 10.1093/jlb/lsaa065
sha: 4efb04650499aa04230f10f775e883ebee56ac9a
doc_id: 1020326
cord_uid: 237zu9ak

A popular model for global scientific repositories is the data commons, which pools or connects many datasets alongside supporting infrastructure. A data commons must establish legally interoperability between datasets to ensure researchers can aggregate and re-use them. This is usually achieved by establishing a shared governance structure. Unfortunately, governance often takes years to negotiate, and involves a trade-off between data inclusion and data availability. It can also be difficult for repositories to modify governance structures in response to changing scientific priorities, data sharing practices, or legal frameworks. This problem has been laid bare by the sudden shock of the COVID-19 pandemic. This paper proposes a rapid and flexible strategy for scientific repositories to achieve legal interoperability: the policy-aware data lake. This strategy draws on technical concepts of modularity, metadata, and data lakes. Datasets are treated as independent modules, which can be subject to distinctive legal requirements. Each module must, however, be described using standard legal metadata. This allows legally compatible datasets to be rapidly combined and made available on a just-in-time basis to certain researchers for certain purposes. Global scientific repositories increasingly need such flexibility to manage scientific, organizational, and legal complexity, and to improve their responsiveness to global pandemics.

Health research involving Big Data approaches, or the training of artificial intelligence and machine learning algorithms (AI/ML), depends on access to numerous data sources. 1 Especially in cases like the current global COVID-19 pandemic, researchers need timely access to numerous data sources from around the globe. Unfortunately, in the absence of dedicated transborder data sharing collaborations and supporting infrastructure, scientific data aggregation especially in times of crisistends to be left to researchers and research organizations. Before any analysis can take place, researchers bear the burden of finding, negotiating access to, and curating fragmented data sources, often at great cost and delay. 2 Particularly during public health emergencies, rapid international data sharing requires "appropriate infrastructure … such as repositories and information technology platforms." 3 A popular model for global repositories is the transborder data commons, a scientific resource that pools or connects many datasets and associated infrastructure. A key challenge for establishing a transborder data commons, however, is defining a "clear governance structure that … adheres to national and international ethical and legal requirements." 4 Diverse legal requirements may be associated with scientific datasets including copyright or database rights, data privacy laws, health research norms, or contractual requirements to provide data generators with academic credit or intellectual property rights in downstream discoveries. Requirements may differ significantly across national and regional legal frameworks. To bring together

datasets from around the world associated with different legal requirements, a transborder data commons must develop a shared governance structure that establishes legal interoperability between datasets. Interoperability generally is characterized by the ability to meaningfully exchange data. 5 Datasets are legally interoperable where associated legal requirements are sufficiently compatible to allow for their exchange, aggregation, and re-use. 6

The challenge of establishing a shared governance model is often underestimated. Scientific communities often spend years negotiating governance, delaying data sharing and research. Once established, it may be difficult if not impossible to re-negotiate governance to accommodate valuable new contributions, or to respond to changing circumstances. Furthermore, where datasets are subject to diverse legal requirements, establishing shared governance can also involve important compromise. This process can often involve trade-offs between data inclusivity (what data resources are included in the commons) and data availability (how broadly the commons can be accessed and re-used by researchers).

This paper introduces a novel, alternative model for structuring transborder research projects: the policy-aware data lake. This approach is inspired by technical concepts of modularity, metadata, and data lakes. Under this approach, dataset contributors do not have to agree up-front to a single set of legal requirements. Instead, they are free to articulate distinct legal requirements for each contributed dataset, and to modify these requirements over time. Data contributors are required, however, to describe the legal requirements applying to their dataset using an agreed-upon menu of legal terms. In other words, all the datasets in a policy-aware data lake must be labelled with 5 

use of data, defines legal interoperability in the context of publicly funded research data as "the ability to combine data from two or more sources without conflicts among restrictions imposed by data providers … and without having to seek authorization from the data providers on a case- A range of legal requirements may be associated with scientific data, which may stem from copyright or database rights, data privacy laws, health research regulations, or contractual terms. 13 Launching transborder scientific resources is particularly challenging where they deal with regulated datasuch as personal data protected by data privacy lawswhich may be subject to multiple, potentially divergent legal definitions and requirements across countries.

Jorge Contreras and Jerry Reichman warn that "failure to account for legal and policy issues at the outset of a large transborder data-sharing project can lead to undue resource expenditures and data-sharing structures that may offer fewer benefits than hoped." 14

The International Cancer Genome Consortium is a successful example of a transborder data commons, with a central data access process and data use policy for its large collection of cancer datasets. The International Cancer Genome Consortium (now called the 25K Initiative) was a 11 Id. at 8. 12 Id. at 8. 13 See generally Thorogood, supra note 6. 14 Contreras and Reichman, supra note 7 at 1312. 

governance structure often requires years of negotiation within a global scientific community. 20

In order to participate in negotiating a shared governance model, potential contributors must first determine what legal requirements apply to their datasets. Articulating these requirements can be challenging even for sophisticated contributors, in light of rapidly evolving data sharing practices and scientific techniques, and associated legal uncertainty. This step can further delay negotiations. Once a governance structure is established, it can be hard to re-negotiate, and a scientific community can find itself locked-in to a static set of rules.

In the face of diverse legal requirements, negotiating a homogenous governance structure can involve significant compromise. A transborder data commons typically has to make a key tradeoff between data inclusion and data availability (see Figure 1 ). On the one hand, if the data commons establishes a permissive governance model, datasets subject to more restrictive legal requirements are excluded (unless it is possible to re-negotiate these requirements locally). On the other hand, if the data commons establishes a restrictive governance model, more datasets can be included in the commons, but the overall availability of data for research is curtailed.

Disagreements within a scientific community over the right balance to strike can often prolong negotiations over governance.

The Human Cell Atlas (HCA) illustrates the trade-off between data availability and inclusivity, primarily because of regional differences in data privacy law standards regarding data identifiability, consent to data processing, and associated safeguards. Once a governance structure is firmly established, a data commons must also maintain legal interoperability over time. This is typically done through complex, lengthy processes of compliance assessment and due diligence by data contributors at the local level. 25 Potential new data contributors must assess if the model covers their local legal requirements. At this stage, the governance model is usually take-it-or-leave it; new contributors have little ability to influence it.

This can result in the exclusion of scientifically valuable datasets, perhaps for legal reasons unforeseen at the time the governance structure was established.

These processes raise concerns about murky and unaccountable decision making. Datasets may be illegally released, which can present legal and reputational risks for the data commons. While

legal compliance is primarily the responsibility of contributors, a data commons can provide guidance, compliance assessment tools, or due diligence processes to support responsible contribution. For example, the HCA established "core consent elements" for public (open)

sharing of raw RNA sequence data 26 , and an associated consent template for prospective research studies allowing data to be deposited in the HCA. 27 The core consent elements are also being integrated into an assessment tool (forthcoming), to help potential submitters holding already- 

Borrowing on technology and data science concepts, this article proposes a promising alternative to a data commons model for transborder scientific resources: the policy-aware data lake.

Recall that a transborder data commons achieves legal interoperability upfront by establishing a common legal data governance structure, achieved through extensive negotiation and compliance assessment processes. A policy-aware data lake, by contrast, is characterized by a modular approach to achieving legal interoperability (see Table 1 for a comparison). The flexibility inherent in this modular approach reduces the need for prolonged, upfront negotiations over legal data governance before datasets can start to be pooled or otherwise connected, and in turn made available to researchers. Speed, flexibility, and scalability are highly desirable when seeking to building transborder resources, especially in response to global public health emergencies.

The concept of a policy-aware data lake draws on three technology and data science concepts:

Modularity. Modularity is "the degree to which a system's components may be separated and recombined, often with the benefit of flexibility and variety in use." 34 The concept is used in the design of complex systemsfrom industries to softwareby breaking down the system into modules, which are "units in a larger system that are structurally independent of one another, but work together." 35 Modules have freedom with respect to their internal design as long as they respect certain design rules which allow them to interact with other modules. Modularity deals with complexity in two ways: 1) through abstraction, which hides the internal complexity of The legal complexity presented by transborder projects can be addressed through analogous strategies. A policy-aware data lake is defined by three essential characteristics:

Modularity/flexibility: scientifically relevant datasets can be contributed to a policy-aware data lake, even if they are subject to quite different legal requirements. 41 This addresses the data commons problem of data inclusion (see Figure 2 ). This flexibility is essential in the era of Big Data, where researchers seek to link diverse datasets together, from diverse sources, subject to diverse legal requirements. By providing this flexibility and independence to data contributors, a policy-aware data lake can be launched and scaled up quickly. Modularity therefore alleviates problems of prolonged negotiations and compliance assessments encountered by the data commons model. The importance of modularity is that it helps to optimize the legal availability of datasets for research (see Figure 2 ). Not all researchers will need to use all datasets for all

research purposes. A data commons defines in advance the extent to which its entire catalogue of data will be legally available, which may be inefficient and ineffective. In a policy-aware data lake, different modules (datasets) can be combined into various different subsets. Each subset of modules is legally interoperable, meaning the legal requirements associated with each dataset in the subset are sufficiently compatible to permit re-use for certain research purposes. Some subsets will have fewer datasets but more permissive requirements (see Figure 2 -Blue Arrow).

Other subsets will have many datasets, but more restrictive requirements (see Figure 2 -Red Arrow). Essentially, the modules of a policy-aware data lake can be reconfigured into various, smaller data commons, which can each be legally made available to certain researchers for certain purposes.

Modularity also enables legal data governance to evolve "along with the data sharing zeitgeist." 42

Recall that a transborder data commons cannot easily change its governance structure once 

Furthermore, during public health emergencies, data stewards may be granted exceptional, broad legal authorization to share and use regulated data for research, for example personal data governed by data privacy laws. 45 These policy and legal changes may also be temporary, suddenly reverting to the status quo once the emergency has passed. A transborder data commons is ill-equipped to adapt its shared legal data governance structure to change, whether slow or sudden. Policy-aware data lakes, by contrast, offer greater flexibility, providing individual data contributors ongoing freedom to unilaterally modify the legal governance associated with their datasets.

Legal metadata. Datasets subject to fragmented legal requirements and access processes cannot be meaningfully aggregated and re-used. How does a policy-aware data lake overcome this problem? Because a policy-aware data lake does not establish legal interoperability between datasets at the outset, it must be able to do so rapidly at a later point in time. This is where the policy-awareness aspect of the data lake comes into play: each data contribution must be associated with high-quality legal metadata. Legal metadata is simply data that describes the legal requirements associated with a dataset. For example, a policy-aware data lake may allow data contributors to define the scope of research purposes for which a dataset is legally permitted to be used. Contributors may be provided with a menu of options to choose from, to allow them to comply with their local legal requirements. Options might include: any scientific research; health or biomedical research; diabetes-specific research only; and/or non-commercial research only.

In terms of modularity, one can think of legal metadata as an interface describing how different datasets (modules) interact, i.e., how they can be used, compared, or combined. Legal metadata is also a form of abstraction, which hides the complexity of the legal context from which legal requirements arise. A design rule of a policy-aware data lake is that each module must be described with high-quality legal metadata. In other words, datasets must be explicitly and accurately labelled with legal requirements selected from a standard menu. 

Rapid legal interoperability assessment: a policy-aware data lake does not establish legal interoperability when datasets are initially contributed. It therefore needs a rapid mechanism to determine what modules (i.e., datasets) can be legally combined and made available to certain researchers for certain purposes. This determination would need to be made "just-in-time" in response to a data access request for a specific research purpose. Practically speaking, a policyaware data lake might function as follows. First, a researcher submits a request to access all scientifically relevant datasets that are legally available for his or her context and purposes. The data lake then compares the nature of the access request against the legal metadata of its constituent datasets. Finally, the policy-aware data lake aggregates the legally available datasets and provides the researcher access. While a data commons establishes legal interoperability at the ingress phase (deposit of datasets), a data lake establishes legal interoperability at the egress stage (provision of access to datasets). This can only occur if each dataset is described by highquality legal metadata. Machine-readable metadata may also be desirable to carry out this matching process quickly, as the number of datasets and diversity of legal requirements scales.

Policy-aware data lakes are modular, encouraging inclusion of diverse datasets from around the world, even if they are subject to different legal requirements. High-quality legal metadata is a kind of legal interface, describing how different datasets can be legally combined and re-used. A policy-aware data lake can be reconfigured into various, legally interoperable subsets in realtime. This ensures the legal availability of data is optimized for different research contexts and purposes. By providing flexibility, and by reducing the trade-off between data inclusion and availability, policy-aware data lakes can avoid the delays, compromise, and governance lock-in encountered by the data commons model. I now turn to some existing examples of policy-aware data lakes, before discussing their challenges and limitations.

There are already scientific resources exhibiting some characteristics of policy-aware data lakes.

The United States dbGaP is a central repository for genomic and health-related data from studies funded by the National Institutes of Health. 47 While the goal of the repository is to maximize the sharing and broad re-use of datasets among the genomics community, the repository has long recognized that datasets from certain research projects may come with distinctive data use limitations. 48 When researchers deposit datasets into dbGaP, they are asked to specify any data use limitations according to a standard list. 49 These data use limitations are then enforced by dbGaP's data access committees when researchers seek access to data. The Broad Institute is piloting a software system called DUOS that can assist dbGaP's data access committees determine if data access requests comply with data use limitations for requested datasets. 50 The DUOS system is a software system based on the Data Use Ontology, a standard ontology of data use terms maintained by the Global Alliance for Genomics and Health (GA4GH). 51 dbGaP reflects many of the characteristics of a policy-aware data lake. It accepts datasets subject to distinctive data use limitations. Standard data use metadata for each dataset are captured during the submission process. This data use metadata is reliable, as contributors know that the metadata will be acted on by a data access committee. And with the implementation of the DUOS system, dbGap will be able to automatically determine what subsets of its data resources are ethically "available" for a particular access request.

A policy-aware data lake that crosses borders is a different beast altogether. Such a data lake requires a globally accepted legal metadata standard, able to express legal requirements emanating from diverse legal frameworks. The GA4GH Data Use Ontology is an emerging global standard that can be mapped to at least some legal requirements. 52 One transborder project resembling a policy-aware data lake is euCanShare. This project "is a joint EU-Canada project to establish a cross-border data sharing and multi-cohort cardiovascular research platform …[that]

integrates more than 35 Canadian and European cohorts making up over 1 million records… ." 53

The project is seeking to establish a governance structure that respects open science tenets but also complies with diverse applicable legal frameworks. One of its proposals is to develop a data access portal to facilitate access to the project's multiple research resources. This portal is built on the existing infrastructure of the European Genome-Phenome Archive (EGA), which allows research projects to deposit and manage their data using central infrastructure. 54 The contributing research projects would remain responsible for establishing and enforcing their own access policies, through the establishment of a local data access committee, though the EGA can provide centralized infrastructure for granting and managing dataset access credentials. 55 One of the deliverables of this project is to code the consent forms and associated documents of contributing projects into machine-readable data use profiles. While the access portal of euCanSHare will permit contributing projects to establish their own local data access policies and procedures, it still aims to ensure data use terms are expressed in a standard, machinereadable format. 56 The metadata framework euCanSHare is using to represent legal metadata is

the Automatable Discovery and Access Matrix. 57 This metadata model was also developed under the auspices of the GA4GH, and is similar to the GA4GH Data Use Ontology. These data use profiles can then be fed into a search engine, allowing researchers to find datasets across the consortium that are ethically and legally available for their research purposes and context. The consortium will also explore the extent to which a computable approach can improve the ability to both automate and document researcher access to multiple datasets. 58

These examples reveal concrete differences between how one constructs a data commons and how one constructs a policy-aware data lake. A data commons begins with upfront negotiations over a shared legal governance structure, based on an ex ante vision of the resource's purpose.

Datasets can only be contributed to the data commons if they comply with the existing governance structure. Researchers then typically request access to the data commons as a whole.

A policy-aware data lake, by contrast, does not make up-front decisions about what datasets can be included or excluded. Instead, it begins by simply mapping the legal requirements associated with each dataset to a standard menu of terms. Researchers then request access to the subset of datasets that are legally available for their proposed purpose.

A data commons may be a better model for transborder projects where datasets are subject to relatively homogenous legal requirements, and where a scientific community has a clear, shared vision for how the resource will be used. A policy-aware data lake may be preferred where datasets are subject to more diverse legal requirements, and where the purposes of a transborder resource are likely to evolve over time. While policy-aware data lakes offer potential advantages of flexibility, speed, and scalability, however, they also come with new challenges.

If not designed and implemented carefully, policy-aware data lakes can degrade into legally fragmented data swamps. A data swamp is a collection of superficially pooled or connected data resources providing a mere aura of aggregation, but little meaningful opportunity for researchers to aggregate, access, and re-use data. This is analogous to the scientific data management context where data lakes, due to a lack of upfront data curation to ensure scientific quality and interoperability, end up being scientifically useless. 59 Policy-aware data lakes may be susceptible to such degradation, because they do not establish legal interoperability upfront. The majority of data sets may end up being subject to conflicting or highly restrictive legal requirements. A scientific community may put significant effort into pooling or connecting datasets before it becomes clear that there is no real prospect of legally compliant data aggregation and re-use.

To avoid this problem, policy-aware data lakes could incorporate some of the harmonization processes used to establish legal interoperability for a data commons (discussed above).

Scientific communities could, for example, negotiate minimum legal availability standards for contributions to ensure included datasets have a reasonable prospect of being aggregated with other datasets and re-used. Contributors could also be encouraged to establish the most permissive legal profile for data possible before tagging datasets with legal metadata. Indeed, contributors may be able to modify some legal requirements, by re-negotiating agreements, consents, or approvals. Contributors could also be encouraged to avoid excessively conservative interpretations of local legal requirements. 60 These harmonization processes would, however, be far more lightweight than in the data commons context. Admittedly, there are many problems of 59 

legal interoperability that cannot simply resolved by negotiation between private parties; the success of policy-aware data lakes will also depend on the continued evolution of background data sharing policy 61 and international legislative harmonization. 62 The frictionless vision presented here of a modular and scalable transborder data sharing resource depends on the existence of a global standard for expressing legal metadata.

Establishing such a standard is complicated both conceptually and procedurally. Conceptually, a legal metadata standard would, at a minimum, consist of a controlled vocabulary of terms describing different legal permissions, restrictions, and requirements that may apply to the release and use of scientific data. 63 A slightly more complex legal metadata standard could be in the form of an ontology, which is a structured and hierarchical terminology. Regulation, personal data may be transferred to countries outside of Europe, but in some cases this may only be permitted on the condition that standard contractual clauses are in place. 64 To address this, a legal metadata standard may need to go further and include embedded logic to capture interdependencies. 65 An example of such a standard including a basic logic is the HL7 FHIR Consent standard, which allows healthcare providers to capture, communicate, and enforce

25 a patient's privacy consent directive, an agreement determining how the patient's health information will be accessed, used, and disclosed. 66

Legitimate consensus-building processes are also needed to establish a global legal metadata standard. Admittedly, this may require higher levels of international collaboration than negotiating a shared legal data governance structure for a data commons. Developing any international technical standard requires high-levels of coordination and collaboration, effort that also needs to be sustained over time to maintain the standard. 67 

Modularity will become an important governance principle for the era of Big Data, in order to handle the growing scientific, organizational, and legal complexity of international research systems. Scientific opportunities to combine data across jurisdictions, sectors, and contexts now far outpace the ability of transborder projects to negotiate a shared legal data governance structure. These opportunities will also continue to outpace lawmakers' efforts to harmonize legal requirements across countries, as well as data stewards' efforts to renegotiate agreements, consents, and other private order sources of legal requirements associated with data. The COVID-19 pandemic has brought home the pressing need for infrastructure that supports international research collaboration. The data commons model will remain an important, perhaps ideal, model for transborder data sharing. Policy-aware data lakes present a promising new alternative for projects dealing with significant legal heterogeneity, or prioritizing speed and flexibility. Pilots are needed to identify this model's limits, and to demonstrate its potential to optimize responsible data aggregation and re-use. Black line -limit of uses permitted by commons' governance structure; Red arrow -uncontroversial scientific project (3 datasets available); Green arrow -somewhat controversial scientific project (3 datasets available); Blue arrow -controversial scientific project (0 datasets available); Red stripes -datasets subject to restrictive legal requirements that must be excluded from the commons; Blue waves -legally permitted uses prohibited by governance structure. 

The shared legal data governance structure must respect the legal requirements associated with dataset

The legal requirements associated with the dataset provide a reasonable likelihood of aggregation and re-use Legal Availability of Data for Re-use All datasets within the commons are available for research uses that respect the legal requirements of the most-restrictive dataset included in the commons All datasets legally available for a proposed research use

Limited. The common legal data governance structure must be re-negotiated Data contributors are free at any time to update their legal metadata Data flows across borders

Yes, though some datasets may be subject to distinct data localization rules Shared infrastructure Spectrum from centralized to fully distributed Spectrum from centralized to fully distributed

Data Use Under the NIH GWAS Data Sharing Policy and Future Directions

Broad Data Use Oversight System

Global Alliance for Genomics and Health, Data Use Ontology, Ontology for Consent Codes and Data Use Requirements

Deliverable D3.1 -Data Management Plan

The Genomic Commons, 19 ANNUAL REVIEW OF

Legal Interoperability as a Tool for Combatting Fragmentation

National Institutes of Health, supra note 46

Supra note 23

Big Data Analytics Need Standards to Thrive: What Standards are and why they Matter 209 CIGI Papers

Creative Commons, When We share

Acknowledgements: I would like to thank the funding support of Genome Canada, Genome Quebec, and the Canadian Institutes of Health Research. I would also like to thank Professor Bartha Knoppers, and Alexander Bernier for their invaluable feedback on earlier drafts of this paper.