key: cord-1052563-hgkjnvby
authors: Nicholson, Nicholas; Perego, Andrea
title: Interoperability of population-based patient registries
date: 2020-06-13
journal: J Biomed Inform X
DOI: 10.1016/j.yjbinx.2020.100074
sha: 93ba82570b454b3b7a8950aeb4866aea3311a549
doc_id: 1052563
cord_uid: hgkjnvby

Enabling full interoperability within and between population-based patient- registry domains would open up access to a rich and unique source of health data for secondary data usage. Previous attempts to tackle patient-registry interoperability have met with varying degrees of success, but a unifying solution remains elusive. The purpose of this paper is to show by practical example how a solution is attainable via the implementation of an existing framework based of the concept of federated, semantic metadata registries. One important feature motivating the use of this framework is that it can be implemented gradually and independently within each patient-registry domain. By employing linked open data principles, the framework extends the ISO/IEC 11179 standard to provide both syntactic and semantic interoperability of data elements with the means of specifying automated extraction scripts for retrieval of data from different registry content models. The examples provided address the domain of European population-based cancer registries to demonstrate the feasibility of the approach. One of the examples shows how quick gains are derivable by allowing retrieval of aggregated core data sets. The other examples show how aggregated full sets of data and record-level data might also be retrieved from each local registry. An infrastructure of patient-registry domains adhering to the principles of the framework would provide the semantic contexts and inter-linkage of data necessary for automated search and retrieval of registry data. It would thereby also lay the foundation for making registry data serviceable to artificial intelligence (AI) applications.

Whereas no consistent definition exists for the term patient registry -possibly underlying the many different purposes for which they are used [1] - [3] -an important qualifier is attached to the definition of population-based registries [4] .

Within the domain of cancer, the principal aim of population-based cancer registries is to record all new cancer cases arising in a defined population with emphasis on epidemiology and public health practice [5] . Population-based cancer registries provide information on the cancer burden for healthcare planning and evaluation purposes, and also provide valuable data for studies on prevention, early detection/screening, and cancer-related healthcare. As an example, it is of general public interest to know the risk (and its evolution over time) of developing or dying from a particular cancer. Such information is obtained by epidemiologists and used by public health planners to effect changes in healthcare practice [6] .

The concepts of population-based cancer registries are equally applicable to other disease paradigms and we therefore introduce the encompassing term:

Population-Based Patient Registries (PBPR).

PBPRs attempt to capture all cases related to a specific illness/condition within a defined population, which is important for removing sources of selection bias from epidemiological studies. PBPRs are therefore instrumental in the planning and evaluation of disease control programmes as well as in the effectiveness of patient healthcare measures.

Data collection in a PBPR is a time-consuming and labour-intensive undertaking requiring access to a number of different data sources that include hospital discharge records, clinical records, pathology reports, and death certificates, all of which may use different encoding schemes for disease.

Painstaking commitment is required to ensure the quality of the registry's data.

Data quality can be compromised through such things as undiagnosed cases, uncertainty of diagnosis, under-reporting of cases, and inaccurate application of codes [7] . PBPRs are different from clinical registries, where the focus is on clinical care and hospital administration. PBPRs collect fewer variables and the variables they do collect are at less granular detail for purposes of comparison at population levels. Consequently PBPRs and clinical registries are used for quite different purposes, which serves to explain the conflicts that can arise between clinical demands for prognostic precision and epidemiological demands for comparability and completeness [8] .

Encouraging secondary data usage of PBPRs would further the symbiosis between data usefulness and data quality. It has been observed that using population-based registry data not only reduces the time and costs otherwise spent on epidemiological studies but leads to increased validity of results [9] .

Moreover, linkage of registry data with other types of data, such as environmental, socioeconomic or dietary/lifestyle data covering the same populations can stimulate more specific and targeted research based on observed correlations.

In this paper we present a means based on an existing metadata framework by which this linkage could be achieved at a technical level both for aggregated data sets and individual record level data. Most 

Attempts to link health data are soon frustrated by the need to align different systems used for the various operations of collecting, recording, describing, and classifying information. Nowhere is this more apparent than in the area of electronic health records (EHR).

Considerable effort has been expended over many years in the drive towards data standards that allow interoperability between disparate EHR systems.

Data standards are needed at many different levels, including address protocols, message formats, document architecture, management of document sharing processes, and healthcare terminology [10] .

Two examples of widely used message format standards include Digital Imaging and Communications in Medicine (DICOM) and Health Level Seven version 2 (HL7 v2). As well as facilitating interoperability by ensuring common encoding specifications, they also provide transport-packaging mechanisms for documents conforming to document architecture standards, such as HL7

Clinical Document Architecture (CDA).

Integrating the Healthcare Enterprise (IHE) is an initiative between healthcare professionals and industry to improve healthcare information sharing. Part of this work involves defining integration profiles that provide precise definitions of how standards can be implemented to meet specific clinical needs [11] 

Examples of terminology standards include the International Statistical

Classification of Diseases (in various revisions: ICD-10, ICD-11) and

Systematized Nomenclature of Medicine (SNOMED). ICD is a medical classification list maintained by the World Health Organisation (WHO) and

defines the universe of diseases, disorders, injuries, and other related health conditions in a comprehensive, hierarchical fashion [16] . distributes key terminology, classification and coding standards, and associated resources to promote creation of more effective and interoperable biomedical information systems and services, including electronic health records [18] . It provides a useful tool for mapping between ICD-O-3 codes (International Classification of Diseases for Oncology, 3 rd edition) and SNOMED CT terms [19] .

Most of the focus regarding data-interchange standards has been on electronic health records and the need to access health data related to a particular patient or group of patients. While this work is of direct importance to PBPRs in the collection and submission of data to registries, less attention has been paid to the exchange of aggregated data sets that are important for epidemiological studies. A survey of European PBPRs indicated that whereas respondents were most familiar with the HL7 standards, they were not necessarily using them beyond collection of primary data for the reason that the standards were not appropriate to their specific data structure and needed information [20] .

Epidemiology is not so much concerned with accessing particular individual cases as it is with selecting complete sets of patient cases sharing certain commonalities from a known population. In this regard, aggregation of cases from a PBPR, or several PBPRs, is particularly pertinent.

PBPRs collect certain information on individual patients in a defined population who have been diagnosed with a given condition. Apart from general information such as date of birth, date of diagnosis, sex, the data variables collected by PBPRs are dependent upon the disease domain and can vary between registries depending on national or regional health policies as well as the resources available to the registry. The most important variables in a specific healthcare domain form what is called the common or core data set.

The variables in the core data set are generally the most harmonised since they are compared across regional and national boundaries.

Examples of core data sets associated with European-based PBPRs in a selection of healthcare domains are: cystic fibrosis [21] , cardiovascular disease [22] , congenital anomalies [23] , diabetes [24] , rare diseases [25] , and cancer [26] . As a specific example, the variables in the European population-based cancer-registry core data set capture information concerning the tumour such as: topography (tumour location); morphology (tumour form/structure); behaviour (whether the tumour is benign/in situ/malignant/uncertain); grade (the degree of the abnormality of the tumour cells); basis of diagnosis (how the tumour was diagnosed); and stage (the state of progression of the tumour at diagnosis). The latter is generally described by the TNM Classification of Malignant Tumours globally recognised standard [27] .

Whereas PBPRs hold specific information on patients in defined populations, the focus of their work is not so much at the individual level. Epidemiology requires individuals' information in order to identify the relevant cohorts of patients for a particular study. Once the cohorts have been identified, the personal identifiers are removed and results are provided as aggregated data.

Data may be aggregated in a number of ways. One example is by age group whereby number of cases (incidence, mortality, etc.) is aggregated in predefined age ranges. In case of rare occurrences of a specific type of disease, where the number of cases is low, data may also need to be aggregated across wider geographical areas to avoid potential identification of individuals -this is particularly the situation encountered with rare-disease registries (RDRs).

Indicators (such as incidence, mortality, survival, and prevalence) derived from the core data set provide the means of comparing the disease burden between different populations. Considerable effort is expended in ensuring the accuracy and comparability of the indicators and underlying data and therefore it is important to allow maximum re-use where possible.

It is oftentimes not a straightforward matter even to find the core data sets. In addition, the descriptions of the associated variables are not necessarily defined in rigorous and unambiguous terms. Just addressing these two aspects alone would bring an immediate breakthrough in the possibility of mapping heterogeneous data sets along common fields of aggregation.

In order to appreciate more fully the data interoperability needs for effective secondary use of PBPRs, they can be considered as a number of distinct use cases:

The first use case relates to the need of collecting harmonised data (corresponding to the core data sets discussed in section 3) from a number of

PBPRs. This use case is an example of processes already in operation to gather datasets for comparison at European Union (EU) level for monitoring the burden of disease in different healthcare domains. Currently this is undertaken via some central entity issuing a call for data to the participating registries. The call specifies a common data protocol to which registries are expected to adhere in submitting their data. The collection entity validates the data against a harmonised data validation protocol, which may require a number of iterations in the data submission. Once all the data sets have been validated, the data is aggregated and made publically available. The process is not optimal for three main reasons. Firstly, it introduces significant time delays on top of those already incurred by the registries themselves in collecting data from the primary sources; when the aggregated data sets are finally made available they may be several years out of date. These delays compromise the value of the data for timely feedback into healthcare planning processes. Secondly, it is demanding of resources -the iterations are relatively manual and require a significant number of communication workflows. Thirdly, data sets are thereby duplicated, leading eventually to data integrity and versioning issues.

The second use case is an amplification of the first use case but would allow maximum reuse of the registry data by access to the registry's aggregated full data set. Access to the core data set has been described in section 4.1, but this forms only a subset of the data stored by the registry. Whereas variables outside the core data set may not be standardised or harmonised, if they were described following standard metadata concepts their meaning would be clearer for data users to analyse them in an appropriate way with less danger of making false assumptions. Indeed a rich source of untapped data resides in the variables of the full data sets. This use case therefore introduces the notion of a registry's aggregated full data set.

The third use case concerns the situation in which access is needed to individual record level data. Such a scenario may be faced in research studies that need to select a set of case records across several registries and then perform the aggregation. This use case is trickier regarding the data privacy requirements since registries would in this case need to release individual record-level data, albeit to another registry. This use case is primarily needed in the case of rare diseases where one registry may not have a sufficient number of cases to undertake a particular study or for high-resolution studies in the absence of solutions for use case #4 described in section 4.4.

The fourth use case concerns a registry service that could be termed aggregation on demand. As discussed in section 3, the focus of epidemiology is not the individual per se, but groups of individuals sharing a common condition. The normal procedure follows a request to registries interested in participating in a given research project. Depending on data privacy agreements, the study proceeds with individual case records and the results are published in terms of aggregated data with no reference to specific individuals. If instead a means were available of aggregating data according to a study-dependent aggregation protocol, there may be no need for registries and studies to set up data-protection agreements and protocols that add time delays and costs to projects. This use case would require a standard way of specifying aggregation protocols that could be simply applied to registry data.

The fifth use case also concerns high-resolution studies. Analysis of aggregated data sets may suggest correlations or patterns, the statistical validity of which may require investigation in greater detail via high-resolution studies. These studies generally require more specific data and it is therefore necessary to identify the individual data subjects of those constituting the aggregated set of interest. It is not a straightforward process to trace back this information to the primary data sources and the exercise is a costly process both in time and resources. Having the possibility to trace back automatically the original primary data source given a (pseudonymised) patient identifier held in the registry would greatly facilitate these sorts of studies.

The current limitations regarding access to patient registry data -even within a given patient domain -are widely apparent [3] , [7] and constitute a first major challenge without regard to the more complex one of linking data between heterogeneous registries covering different patient domains.

The difficulty of understanding which PBPR data sources are available and what sorts of data they hold, coupled with the cost in both time and resources of making that data available in the format required greatly compromises the secondary-data usage of PBPRs. Even if the data were readily available, it is not a straightforward matter for researchers to know how the associated variables relate to the research study in mind or even how the data can be used appropriately. The registry is normally actively involved in the study to ensure the data is used correctly, but this also serves to limit the number of studies ongoing at any one time.

Previous attempts have been made at EU level to address some of the underlying needs [28] , particularly within the field of rare diseases in which the problem of interoperability is more acute on account of the widely different types of diseases classified within the same overall patient domain [29] . These efforts, however, remain largely focused within each specific patient-registry domain and although the ensuing solutions may ease access to Europeanharmonised data on a thematic disease level, the use of different metadata methodologies coupled with different data-registration and data-discovery mechanisms will still present a challenge for the inter-linkage of data between registry domains.

Without any overarching strategy and overall coordination across patient registry domains, much effort will continue to be duplicated in terms of reinvention of solutions that have at their basis shared and common requirements. It would be worthwhile to find some way of uniting these efforts towards a common and scalable methodology. With a common framework initiatives would converge more rapidly towards greater data interoperability with consequently greater scope for secondary data usage.

In view of the fact that the data sets may stretch back many years, it is not feasible to require fundamental changes to already existing data representations or individual registry infrastructures. A more practical solution would be one that encouraged, wherever possible, mapping of local registry data structures to common metadata constructs and the means of retrieving data on the basis of those mappings. Given the number of entities involved and the autonomy of those entities, any solution should as far as possible also be standards-agnostic; it would not be realistic to advocate the use of any one standard and the framework should ideally be able to work with the different standards in place.

A framework with the potential of meeting many of these requirements already exists and its capacity of interfacing with different healthcare standards in the field of electronic health records has been demonstrated [30] . The semantic MDR framework was initially proposed for enabling data interlinkage between different EHR formats with a primary focus on mapping the individual common data elements (CDEs) to standard CDE models. The framework has also been successfully applied to secondary use of EHRs for post-marketing surveillance [32] . Combining the concepts of metadata [30] .

The versatility of the model would make it amenable to any application requiring standardised data exchange and, arguably, the limitation of its applicability is constrained more by implementation-based decisions within the given domain than by the technological constraints themselves. This paper presents a proposal for an implementation of the framework to address the particular data interoperability issues in the field of PBPRs. 

By way of a specific example, METeOR defines one of its OCs as "Person with cancer". Associated with this OC are a number of Properties, one of which is "Primary site of cancer". The associated DEC is "Person with cancer -primary site of cancer" which encapsulates the concept of a person with cancer having a primary site of the tumour. In METeOR, the fields "ANN{.N[N]}" within the data elements' VDs refer to the format of the codes (one alphabetical character followed by two numeric characters with an optional decimal point followed by one or two numeric characters). The fact that the codes for these two schemes for classifying cancer-type are expressed in exactly the same format underlines the need to make unambiguous distinction between the VDs, as supported by the ISO/IEC 11179.

A further important principle of ISO/IEC 11179 concerns the use of classification schemes, which provide the means for developing metadata with enhanced semantic descriptions. OCs, Properties, VDs, DECs, and Data

Elements are all classifiable components, and this aspect is an integral part of the philosophy underlying the federated semantic MDR framework.

By way of illustration, Figure 1 

The CDEs are generally abstract data element definitions but the underlying data that have been described by these CDEs can be retrieved through the framework. The way in which this achieved is by setting up an extractionspecification CS with an extraction-specification script that is executed on a data server. The semantic MDR framework supports three types of extraction specifications: XPath, SPARQL, and SQL. The mechanism is described more fully in [30] and examples are provided in section 7.

As an example of linkage with a terminology system, Figure 2 The ingenuity of the framework leads to a number of advantages; namely, that:

 it can be implemented gradually in a well-staged approach;

 it requires no fundamental change to the local data -the underlying principle is to map local metadata to standard metadata descriptions or common dictionaries without enforcing compliance of metadata to any particular standard;

 it is scalable across many different patient-registry domains and can moreover be implemented in each domain independently of the other domains;

 not only are CDEs automatically registered and therefore findable, but they can be reused in an interoperable way via the semantic mapping descriptions to harmonised metadata standards.

Furthermore, via these mappings and their associated extraction specifications, local data elements described by otherwise non-standard CDEs are readily accessible. The semantic MDR framework therefore intrinsically supports all four of the FAIR (Findable, Accessible, Interoperable, Reusable) guiding principles for scientific data and stewardship [42] .

In order to illustrate how the semantic metadata framework can be implemented to address the use cases identified in section 4, the practical example of population-based cancer registries will be considered.

The aggregated core data set of the European Network of Cancer Registries (ENCR) consists of five main variables: indicator type (incidence/mortality), sex at birth, cancer site, historical year, and number of cases broken down into fiveyear age ranges. Within a cancer registry's (CR) tabularised aggregated data set, these variables equate to the column names (c.f. Figure 3 ).

(ECIS) [43] , but requires some effort to extract in its entirety. Furthermore, the metadata is described in different places [26] , [44] , and not in machine-readable terms, nor do the majority of variables link to more generic metadata terms thereby rendering cross-linkage of data difficult between different PBPR domains.

In order to make the aggregated data set accessible via the semantic MDR framework, it is first of all transcribed in RDF, and maintained in a triple store.

RDF is used since it provides a convenient means for the data to be searched and accessed via a SPARQL end point.

A triple store stores data according to triplets corresponding to subject, predicate, and object. RDF provides a standard model for data interchange on the Web and allows structured and semi-structured data to be shared across different applications. To transcribe the CR aggregated data in terms of RDF, the column names translate to the predicates of the RDF triples, with the row identifier (or primary key) forming the subject and the data values of the column names forming the objects. The predicates are essentially the concatenated references to the metadata concepts of the ISO/IEC 11179 model described in section 5. In Figure 4 for example, the predicate: encr:personWithCancer_TumourPrimarySite_TumourCodeECISv1 is the URI associated with the CDE similar to that described earlier with OC of "Person with cancer" and Property of "primary site of cancer". In this CDE however, the VD "TumourCodeECISv1" would have the list of possible values as defined currently by ECIS [44] .

A tentative full RDF definition of the subjects, predicates, and objects used in this PBPR example is provided in [46] . 

The remaining five predicates in Figure 4 refer to the other CDEs needed for describing the data associated with the ENCR aggregated core data set, 

ISO/IEC 11179 defines a data element as a "unit of data that is considered in context to be indivisible", where context is defined as "the circumstance, purpose and perspective under which an object is defined or used" [31] . The standard provides the example of a telephone number that may be considered indivisible in one context but divisible in another (where telephone numbers need to be divided into country code, area code, and local number). To all intents and purposes, the core data set can be considered indivisible for the purpose of making it searchable and accessible as a whole. Not only are the core data sets standardised and harmonised within any given PBPR domain, but an individual field within the aggregated data set is not meaningful without explicit reference to the values of all the other fields in the same row. The way in which the core data set will eventually be used forms a higher-level context outside our immediate concern and since our interest is in allowing access to and retrieval of the core data sets as a whole within each PBPR domain, we specifically define a CDE to represent a registry's aggregated core data set.

This marks a slight departure from the aim of [30] where the purpose was to extract the value of specific CDEs associated with the EHRs of specific patients.

The alternatives (none of which are preclusive to the above) would be: (a) to specify CDEs representing each individual record within the aggregated core data set, e.g. the number of incident, male (at birth), lung cancer cases aggregated within the age range 55-59 years diagnosed in the year 2016. This however would require a major initial outlay of effort in defining the whole set of CDEs (for each individual tumour type, age bracket, indicator type, sex, and year) as well as complicate the task of reconstituting the entire aggregated core data set for users requiring it; or (b) to provide a user interface for selecting the particular fields of interest within the data set and on the basis of these choices, to run a script on the data set to return the relevant result. This option also would require more initial effort, although such an interface would be useful also in accessing individual record data (discussed in Section 7.3).

Following a similar semantic-MDR schematic representation to that provided in [30] , Figure 5 illustrates the definition of a proposed CDE for the ENCR aggregated core data set with semantic links through the LOD cloud.

The CDE is the association of an OC that represents the concept of a population-based ENCR registry with a Property referring to the feature of an aggregated core data set, and a VD that specifies the form of the data element (in this case RDF-formatted text).

The Object Class of the CDE is annotated with a concept of a population-based patient registry (via an association with a classification scheme item) through the SKOS (Simple Knowledge Organization System) [47] semantic relation "broader" to indicate that a PBPR is related to a population-based ENCR registry but broader in context. Figure 5 : Semantic links of a CDE and its Property inside the ENCR semantic MDR. The Property (P) is annotated through the SKOS mapping property "exactMatch" to indicate that the CR core data set is an aggregated core data set of a population-based patient registry. The Object Class (OC) is annotated through the SKOS semantic relation "broader" to indicate that a PBPR is related to an ENCR registry but is broader in context. The CDE has an "Extraction Specification", which in this example is a SPARQL script that is defined in Figure 6 . Figure 6 : SPARQL script to retrieve an aggregate core data set from local CRs. The <<localRDFfile>> tag is a generic tag that is overwritten by the URI of the RDF graph containing the local data set. The latter is returned after searching the federated semantic MDR framework for links to the "ENCRreg.AggregatedCoreDataSet.Text" CDE.

The CDE also has an extraction specification specified with a SPARQL script (described in Figure 6 ) that can be used to retrieve ENCR-conformant aggregated core data sets from local CR MDRs.

Executing the script in Figure 6 with Object Class (OC) is annotated through the SKOS mapping property "exactMatch" to indicate that the local registry is an ENCR registry-which is itself mapped to the PBPR OC via the SKOS semantic relation "broader" (c.f. Figure 5) . The Property (P) is also annotated through the SKOS mapping property "exactMatch" to indicate that the CR aggregated full data set is an aggregated European harmonized full data set. The CDE has an "Extraction Specification" that, dependent on local decisions, could be a SPARQL or SQL script to return the data set in a way similar to that described for the aggregated core data set.

The core data sets, as useful as they are, hold only a fraction of the data potentially available. Access to the full set of a registry's data variables would not only provide users with much richer data sets but also serve, by wider use of the data, to accelerate the data-harmonisation process.

Due to the currently limited degree of harmonisation of the full variable sets for many PBPRs, the CDEs of the non-harmonised variables and the extraction specifications for the full data sets would need to be provided and maintained by the local registries until such time as the variables became harmonised. It is perhaps important to add that whereas the semantic linkages of the CDEs would provide comprehensive descriptions of the CDEs themselves as well as their potential relation to standard terminologies and dictionaries, this may still not be sufficient to provide users with a full understanding of data paradigm and the interdependency of the data variables themselves. For a higher-level view, it may be necessary to provide a data model and/or ontology describing the data domain. The description of such models can however be integrated into the semantic MDR framework using the mechanism of the Classification Scheme described earlier. As long as the data model/ontology were accessible via URIs, the DEC (constituting the local registry OC and the aggregated full data set Property) for the aggregated full data set could be classified by CSIs that would point to the associated URIs. The higher-level descriptions would then be accessible along with all the other semantic links once the URI of the CDE for the aggregated full data were dereferenced.

Access to all the data variables at record-level would allow the greatest use and value of patient-registry data, but requires explicit patient consent under the EU's recent general data protection regulation (GDPR) [48] . The record-level data is considered sensitive data even though it is generally pseudonymised through the recoding of the patient-identity field.

It should be re-emphasised that the ultimate reason why epidemiological studies require access to individuals' data is for the purpose of selecting the relevant cohorts of patients for testing a particular research hypothesis. Once the cohort is created, the analysis generally proceeds without further reference to individual patients. With this concept in mind, there are potentially two ways in which the data-sensitivity aspect of record-level data might be relaxed and allow data users more straightforward access to the underlying data:

1. By allowing users to specify the exact criteria for aggregating data. As an example, a user could ask the registry for the group of patients with survival less than a certain length of time suffering from a given cancertype. As long as the number of corresponding patients were greater than a pre-defined minimum to prevent possible identification, the returned data set would be an aggregated measure and therefore essentially anonymous. This is essentially the scenario described in section 4.4.

of data fields. Minimising the number of data fields complicates the task of identifying a particular individual. This could be accomplished using a SPARQL/SQL front-end allowing users to specify search criteria based on a number of specific predicate/column names up to a permissible maximum number. Notwithstanding, appropriate measures would have to be in place to avoid successive calls on the same data being able to reconstruct the complete set of variables for any given record.

Data-access procedures for both these scenarios could in principle be automated since the results returned are arguably anonymised data. Requests of this nature could possibly be constructed via a similar type of interface to that described in [32] albeit with extra functionality for handling service negotiation for data retrieval through a firewall.

Use case 5 lies somewhat outside the immediate focus of this paper, but it addresses a topic that could bring significant cost-savings to high-resolution studies. Particularly relevant is the work of Sinaci and Laleci Eturkmen [30] in which they developed the semantic MDR framework for EHRs and consequently demonstrated how it is possible to retrieve data for a particular patient held in various clinical systems.

IHE-RFD and FHIR could potentially also provide alternative solutions for this task. Work is in progress to address the cost-benefit of implementing FHIR in the clinical registry world [49] . Whereas FHIR could greatly facilitate the task, it would require widespread uptake and even then may not solve the issue of data stretching back many years held in legacy systems.

The advantage of the semantic MDR framework is that it does not require adherence to any one single standard. It would however require effort to establish the mapping to the various clinical sources for each local registry. Nor could the mappings be easily duplicated in other registries given the wide variety of differences between the various regional and national health infrastructures, processes, and data constructs.

Within Europe, a number of initiatives in context of the EU cross-border healthcare directive [50] have been undertaken to improve interoperability of patient registries.

One of the broader initiatives was the PARENT joint action [28] , PARENT also piloted a "registry of registries" (RoR) -essentially foreseen as a web portal providing reliable and up-to-date information about European patient registries' metadata [51] . Furthermore, COEUS derives the predicates of the RDF triples from the column names of the underlying data sources and then maps them to the relevant predicates in a given ontology, wherein can lie potential inconsistencies as discussed in [61] . In contrast, the federated semantic MDR framework has these mappings already established within the local MDRs and linked at the different conceptual levels of a data element through the classification scheme associations, allowing greater flexibility in semantic searches. However, some of the shared similarities could potentially be used to provide the semantic mappings to access data elements between the frameworks. Full alignment would be possible by defining the CDEs in terms of ISO/IEC 11179 and then using the semantic MDR framework to link the CDEs to the associated semantic mappings created using COEUS.

Overcoming the barriers to secondary data usage of PBPRs is an important goal. PBPRs provide a rich source of summarised health data in well defined populations stretching back many years.

Given the complex array of healthcare infrastructures and health data systems and the need still to interface with legacy systems, it is unlikely that any single health data standard will solve all the interoperability issues.

The intention of this work has been to show at a practical level how the federated semantic MDR framework might provide an elegant solution for addressing many aspects of the interoperability challenges facing PBPRs. The framework is able to operate across standards enabling data linkage between disparate systems. It can provide semantic linkage across heterogeneous PBPR domains and is not disruptive. It would also encourage federation of data and thereby remove the need for centralised data collection processes.

Its implementation is however not without certain obstacles.

In order to function across PBPR domains, it would be important to agree some Mapping also suffers the drawback of potential loss of information -for example when a data element of broader scope is mapped down to one of lesser scope.

The framework avoids such loss of information by using SKOS mapping concepts in which the semantics of these broader or narrower relationships are retained and does not force a one-to-one mapping between data elements where it does not exist. The latter is an important aspect since it furnishes data users with a full picture of the difference between CDEs in different data sets, thereby providing them the necessary information on which to make an informed decision on how to compare/integrate the data. It does however put the onus on the data users to perform the higher level mapping to integrate non-harmonised CDEs from several data sets, which in practice may be difficult to accomplish. It is likely that a dedicated application is needed to help marshal all the data from various sources. As an initial approach, if the central coordinating entities of the patient-registry domains store the aggregated core data sets as a type of proxy for the local registries, then the central semantic MDR of the different PBPR domains could service all the requests to retrieve the individual core data sets. However, to unlock the full power of the registry data, each individual local registry would also need to set up their own semantic MDR and RESTful interface in order to handle the requests on the local registry. As a result, all data sets would reside on the servers of the local patient registries without the need for any centrallevel repository; furthermore, automatic access to non-sensitive aggregated full variable-set data would then be possible.

In view of the fact that the implementation of these steps will share many commonalities between registries and also for ease of rolling out such a framework encompassing all population-based registries, it would be worthwhile to prototype the whole concept on one patient-registry domain and, in so doing, create an implementation manual for other domains to follow.

The CR domain would serve as a good starting point. The CR domain is well established and comprises over 200 individual registries. Currently the core data sets are collected centrally and thereafter cleaned and aggregated prior to being made available on the ECIS website [43] . The aim is eventually to eliminate the central data-collection process altogether, thereby avoiding extra overheads and delays as well as the need for retaining copies of data sets with all the consequent maintenance and data-integrity issues. One of the hurdles to overcome before this becomes a reality however concerns the datavalidation operation. Data validation is necessary at the central level to ensure all the data sets conform to a similar degree of data quality such that they can more accurately be compared. Work is in progress to provide open-source data-validation software tools based on a standard ENCR data model using an ontological approach [62] , which would allow a federated approach also to the data-validation process. One acid test will be to ascertain if these tools in conjunction with the framework itself provide the necessary robustness to devolve the current prerequisite central processes to the local level. In any case, self-regulation strongly motivates conformance to standard procedures and such a process could well be encouraged via the allocation of data-quality stamps to distinguish between different degrees of quality of data sets. Where data is seen to be essential, more effort will be given to ensuring compliance to standard practices.

In view of overcoming the issues facing PBPRs regarding secondary usage of data and data-linkage across heterogeneous patient registries, the federated semantic MDR framework provides a powerful, versatile, and -more importantly -non-disruptive solution. The framework maps local metadata to standard metadata descriptions and linking their components semantically via knowledge organisation system ontologies and terminology systems without enforcing compliance to any one common data model. In a world where data has long been collected and managed with local contexts in mind, this is a critical aspect towards allowing secondary data usage without requiring fundamental changes to existing data sets or local data-collection practices.

A major advantage of the framework lies in the fact that it is not a disruptive technology but rather provides the means, via the integration of a number of powerful tools and standards, to link data that would otherwise remain fragmented.

Whereas the implementation is not cost-neutral -a number of elements need to be established and thereafter maintained -and registries already contending with limited resources would undoubtedly require support, the following points must be borne in mind:

 Population-based patient registries contain valuable and important data stretching over many years. The value of the data may be gauged from all the previous initiatives and endeavours to make them interoperable.

If a registry is established, it makes inherent sense to ensure the data it collects are interoperable with those of other registries to provide extra value;

 Not agreeing a common framework only postpones the problem of data inter-linkage to some future date -data inter-linkage will always depend on semantic description of the data and where this has not been considered at an early stage, it will have to be done later on and at potentially greater cost;  The framework can be implemented in degrees without breaking any of the underlying data processes. Metadata is not changed but rather mapped to standard metadata descriptions. Moreover, focusing initially on access to core data sets held centrally by each PBPR domain allows a quick win that can be rolled forward gradually to extend accessibility to other data variables;

 The framework provides a clear model and set of procedures for guiding the establishment of new patient registries and patient registry domains in order to make them compliant from the outset and thereby save future effort in making them interoperable;

 The issues raised within the PARENT guidelines [3] relating to registry data re-use (e.g. data compatibility and comparability; data exchange; mapping of classification codes; and data semantics) are all addressed by the functionality provided by the federated semantic MDR framework.

In particular, metadata will be described in a formal manner, removing ambiguities and duplication of terms. Furthermore, it will be described in machine-readable terms and conform to the ISO/IEC 11179 metadata registry standard [31] ;

 Within the federated semantic MDR framework the need for any overarching registration function (such as PARENT's concept of an RoR) would be redundant -all the metadata is already registered and linked in the framework and therefore can be browsed on a patientdomain basis using a tool similar to the one described in [32] ;

 The framework would eventually negate the need for any central collections of data for data-validation and data-cleaning needs, thereby avoiding further resource-intensive operations;

 With such a framework in place, attention and resources can be directed to developing user-interface tools for facilitating browsing, searching, marshalling, and fusing data retrieved from multiple data sources. These 

Registries for Evaluating Patient Outcomes: A User's Guide

Categorizing the world of registries

Methodological guidelines and recommendations for efficient and rational governance of patient registries, PARENT joint action (cross-border Patient Registries iNiTiative

What is a population-based registry?

National Institutes of Health, National Cancer Institute, SEER Training Modules, Types of Registries

Responding to the challenge of cancer in Europe. Institute of Public Health of the Republic of Slovenia

A framework for evaluation of secondary data sources for epidemiological research

Comparability of stage data in cancer registries in six countries: Lessons from the International Cancer Benchmarking Partnership

Data quality and quality control of a populationbased cancer registry. Experience in Finland

Ch.4 Health Care Data Standards

Integrating the Healthcare Enterprise (IHE), IHE Profiles

HL7 International, C-CDA (HL7 CDA R2 Implementation Guide: Consolidated CDA Templates for Clinical Notes -US Realm)

The Relationship between FHIR and other HL7 Standards

Health Level Seven (HL7), Introducing HL7 FHIR Release 4

IHE Cross-Enterprise Document Sharing

Classification of diseases ICD

A review of medical terminology standards and structured reporting

Observational Health Data Sciences and Informatics (OHDSI) forum, International Classification of Diseases for Oncology

Addressing the Data Linking Challenges: Interviewing for Best Practices in Patient Registry Interoperability

Common data elements metadata, cystic fibrosis

Common data elements metadata

European Commission, Common data elements metadata, congenital anomalies

European best information through regional outcomes in diabetes (EUROBIROD), BIRO data elements

Common data elements metadata, rare diseases

A proposal on cancer data quality checks: one common procedure for European cancer registries (version 1.1), JRC Technical Report. Publications Office of the European Union

Union for International Cancer Control's (UICC), TNM classification of malignant tumours

European Commission Consumers, Health, Agriculture, and Food Executive Agency (CHAFEA) Health Programmes Database, Cross-Border Patient Registries Initiative

A federated semantic meta-data registry framework for enabling interoperability across clinical research and care domains

ISO/IEC 11179: Information technology -Metadata Registries (MDR) Parts 1-7, ISO Standards. International Organization for Standardization

Postmarketing safety study tool: A web based, dynamic, and interoperable system for postmarketing drug surveillance studies

IHE Data Element Exchange (DEX) profile

METeOR: Metadata online registry

World Health Organization, International statistical classification of diseases and related health problems -10th revision (ICD-10 Version

International classification of diseases for oncology

RDF 1.1 Concepts and Abstract Syntax, W3C Recommendation, World Wide Web Consortium

W3C SPARQL Working Group, 2013. SPARQL 1.1 Overview. W3C Recommendation. World Wide Web Consortium

The FAIR guiding principles for scientific data management and stewardship

European cancer information system (ECIS)

RDF 1.1 Turtle -Terse RDF Triple Language. W3C Recommendation. World Wide Web Consortium

RDF representation of the ECIS data set, v1, OSF

SKOS Simple Knowledge Organization System Reference, W3C Recommendation

Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation

Improving Interoperability between Registries and EHRs

on the application of patients' rights in cross-border healthcare

Biobanking and biomolecular resources European research infrastructure (BBMRI-ERIC

Minimum information about biobank data sharing (MIABIS) standards

An integrated platform connecting registries, biobanks and clinical bioinformatics for rare disease research. Project Fact Sheet. European Commission

European platform on rare disease registration

European best information through regional outcomes in diabetes (EUBIROD)

European cardiovascular indicators surveillance set (EUROCISS)

Europe against Cancer: Optimisation of the Use of Registries for Scientific Excellence in re-search

European Cystic Fibrosis Society patient registry (ECFSPR)

semantic web in a box" for biomedical applications

Linked registries: Connecting rare diseases patient registries through a semantic web layer

An ontology-based approach for developing a harmonised data-validation tool for European cancer registration