Guiding Principles for Findable, Accessible, Interoperable and Re-usable Data Publishing version b1.0

Printer-friendly version

FAIR PRINCIPLES

1. Preamble:

In the eScience ecosystem, the challenge of enabling optimal use of research data and methods is a complex one with multiple stakeholders: Researchers wanting to share their data and interpretations; Professional data publishers offering their services, software and tool-builders providing data analysis and processing services; Funding agencies (private and public) increasingly concerned with proper Data Stewardship; and a Data Science community mining, integrating and analysing the output to advance discovery. Computational analysis to discover meaningful patterns in massive, interlinked datasets is rapidly becoming a routine research activity. Providing machine-readable data as the main substrate for Knowledge Discovery and for these eScientific processes to run smoothly and sustainably is one of the Grand Challenges of eScience.

In January 2014, representatives of a range of these stakeholders came together at the request of the Netherlands eScience Center and the Dutch Techcentre for the Life Sciences (DTL) at the Lorentz Center in Leiden, The Netherlands, to think and debate about how to further enhance this ecosystem. From these discussions, the notion emerged that, through the definition and widespread support of a minimal set of community-agreed guiding principles and practices, data providers and data consumers - both machine and human - could more easily discover, access, interoperate, and sensibly re-use, with proper citation, the vast quantities of information being generated by contemporary data-intensive science. These simple principles and practices should enable a broad range of integrative and exploratory behaviors, and support a wide range of technology choices and implementations, just as the Internet Protocol (IP) provided a minimal layer - the "waist" of an hourglass - that enabled the creation of a vast array of data provision, consumption, and visualization tools on the Internet

2. Context

It is important to note that this document is a general 'guide to FAIRness of data', not a “specification”. In compiling the FAIR guiding principles for this document, technical implementation choices have been consciously avoided. The minimal [FAIR Guiding Principles] are meant to guide implementers of FAIR data environments in checking whether their particular implementation choices are indeed rendering the resulting data FAIR. In Explanatory notes and annexes we give some non-binding explanation and guidance for a FAIR view on data and what constitutes a repository of FAIR data (a 'Data FAIRport')

3. FAIR for machines as well a people

In eScience, two clearly separated substrates for knowledge discovery can be distinguished.

  1. The actual data, which is as a rule beyond human intellectual capacity to analyse and

  2. The 'Explicitome' (everything we already made explicit in text, databases and any other format to date).

The essence of eScience is that either functionally interlinked existing data or the combination of those with newly generated 'relatively small' datasets lead to new insights. A crucial step is machine-assisted 'pattern recognition' in the data, which is followed by 'conformational' human study of the Explicitome to rationalise patterns and determine testable hypotheses. Obviously this is a cyclical process by nature, but computational analysis of massive, originally dispersed and variable datasets is a crucial phase in any eScience process.

Recognizing this new grand challenge in contemporary science, in its inaugural meeting: [Jointly Designing a Data FAIRTport'] the stakeholder group coalesced around four desiderata that a modern data publishing environment should provide to support both manual and automated deposition, exploration, sharing, and use to support machines as well as humans.

These are summarized as the FAIR "Facets":

  • Data should be Findable
  • Data should be Accessible
  • Data should be Interoperable
  • Data should be Re-usable.

These FAIR Facets are obviously related, but technically somewhat independent from one another, and may be implemented in any combination, incrementally, as data providers and FAIRports evolve to increasing degrees of FAIR-ness. As such, the barrier-to-entry for FAIR data producers, publishers and stewards is maintained as low as possible, with providers being encouraged to gradually increase the number of FAIR Facets they comply with.

Therefore, the purpose of this document is not to define nor suggest any technological implementation for any of these facets, but rather to define the characteristics, norms, and practices that data resources, tools, and infrastructures should exhibit in order to be considered 'FAIR', and FAIR-ness can be achieved with a wide range of technologies and implementations.

 

FAIR data Guiding Principles

For all parties involved in Data Stewardship, the facets of FAIRness, described below, provide incremental guidance regarding how they can benefit from moving toward the ultimate objective of having all concepts referred-to in Data Objects (Meta data or Data Elements themselves) unambiguously resolvable for machines, and thus also for humans.

By adopting all FAIR facets, Data Objects become fully: Findable, Accessible, Interoperable, and Reusable

Definitions

  • A Concept is any defined 'unit of thought' to which we refer in our digital formats [1]
  • A Data Object is defined for the purpose of the principles below as: An Identifiable Data Item with Data elements + Metadata + an Identifier [2]
  • When we use the term (Meta) data here, we intend to indicate that the principle is true for Metadata as well as for the actual, collected Data Elements in the Data Object, but that the principle in question can be independently implemented for each of them [3].

FAIR Guiding Principles

1. To be Findable any Data Object should be uniquely and persistently identifiable [4]
1.1. The same Data Object should be re-findable at any point in time, thus Data Objects should be persistent, with emphasis on their metadata, [4 and JDDCP 4 and JDDCP 6]
1.2. A Data Object should minimally contain basic machine actionable metadata that allows it to be distinguished from other Data Objects [see JDDCP 5]
1.3. Identifiers for any concept used in Data Objects should therefore be Unique and Persistent [5 and JDDCP 4 and JDDCP 6].

2. Data is Accessible in that it can be always obtained by machines and humans
2.1 Upon appropriate authorization [6]
2.2 Through a well-defined protocol [7 and JDDCP 5]
2.3 Thus, machines and humans alike will be able to judge the actual accessibilty of each Data Object.

3. Data Objects can be Interoperable only if:
3.1. (Meta) data is machine-actionable [8]
3.2. (Meta) data formats utilize shared vocabularies and/or ontologies [9]
3.3  (Meta) data within the Data Object should thus be both syntactically parseable and semantically machine-accessible [10]

4. For Data Objects to be Re-usable additional criteria are:
4.1 Data Objects should be compliant with principles 1-3
4.2 (Meta) data should be sufficiently well-described and rich that it can be automatically (or with minimal human effort) linked or integrated, like-with-like, with other data sources [11 and JDDCP 7 and JDDCP 8]
4.3 Published Data Objects should refer to their sources with rich enough metadata and provenance to enable proper citation (ref to JDDCP 1-3).

JDDCP (Joint Declaration of Data Citation Principles)

RDA DFT (Data Foundation and Terminology)

 


Annex 1: Explanatory notes to FAIR Guiding Principles

[1] We follow the definitions and arguments of the Ogden/Richard Triangle and theory of meaning for concept, symbol and meaning definitions: see http://en.m.wikipedia.org/wiki/Triangle_of_reference. The Concept itself is not a Digital Object, but any symbol referring to it in computers is a Digital Object. Lingual words, URI's URLs and any other identifier are all symbols referring to the concept

[2] See an exemplar view on Data Objects in Annex 4

We propose the term 'Data Object' to refer to the combination of data elements + their metadata + a unique identifier. These objects are arbitrarily complex and may appear in many forms and syntaxes.

[3] We explicitly recognize that repositories of Data Objects with FAIR metadata for Data Elements that as such are not (yet) FAIR (as in machine-readable, for instance pictures, video or recorded text) are highly valuable, but should be distinct from repositories of fully machine readable, highly curated data elements (the latter obviously also with FAIR metadata attached). So FAIR metadata is a must-have and FAIR data elements are the 'ultimate goal'.

[4] Persistence is an organizational property; effectively, it is an obligation, formally or informally, that an organization guarantees that something will be maintained. As such, the organizations persistence policy should be explicit and public. We propose that FAIRports clearly state their persistence guarantees and seek for replication and back up of their resources whenever possible.

[5] There are ongoing and fierce debates on what exactly constitutes a 'persistent' identifier. The acronym-term PID is consciously avoided here as it may have connotations of proprietary implementations. We propose to allow many identifiers in FAIR data publishing environments as long as an identifier is uniquely referring to only one concept and the publisher provides a clear policy and description on the maximum achievable guarantee for persistent resolving of the identifier to the correct location/meaning. Obviously, 'locally' used identifiers that cannot be mapped automatically to community adopted and publicly shared identifier schemes are not FAIR. The data publisher choosing a 'proprietary' identifier scheme, will need to provide appropriate and correct mappings to public identifiers to be considered FAIR.

Organizations providing persistent identifiers (i.e. 'authorities') should clearly publish the policies that govern the persistence criteria of these identifiers. Such policies should be machine readable.

[6] Especially also for commercial use of FAIR data, companies need to have a clear appreciation and legal position on their ability to use data. Non-licensed data, although 'open' in the mind of most academics, will be avoided by most major companies, due to legal risks. We appreciate exceptions to full Open Access of data (for instance for patient privacy or intellectual property reasons). We therefore consider appropriate licensing of Data Objects (or even individual data elements within them) as key to FAIR data publishing.

Data Object Licenses and conditions of use (academic and/or private/commercial) should be well described. Such licenses can be referred to with persistent identifiers as well as part of the metadata in Data Objects. The FAIRport community will increasingly provide and recommend standard licenses to choose from. The FAIRport community strongly recommends to publish data in complete Open Access wherever possible. It is expected that most 'authorities' to endorse FAIRports will require that exceptions to Open Access need to be well-argued (see Annex 3)

(list of licenses) Jan Velterop/John Wilbanks.

[7] Putting data 'on the web' is not enough. To be actually interoperable and reusable, Data Objects should not only be properly licensed, but the methods to access and/or download them should also be well described and preferably fully automated and using well established protocols.

[8] in eScience, machine-readability of data is imminent. Metadata being machine readable is a conditio sine qua non for FAIRness. Having the actual data elements also machine-readable will make the Data Object of a higher level of interoperability and makes functional interlinking and analysis in broader context much easier, but it is not a pre-condition for FAIR data publishing. Some data elements, for instance images and 'raw data' can not always be made machine-processable. Being published with FAIR metadata is of very high value in its own right.

[9] When the use of community adopted and public terminology systems is not possible, for instance for reasons described in explanatory note 5, or because the Data Objects contain concepts that have not yet been described in any public vocabulary or ontology known to the provider, the provider should nevertheless try to create a term vocabulary of their own and publish it publicly and openly, preferably in a machine-readable form. The vocabulary or ontology that constrains each constrained data field should be unambiguously identified either by the field itself or by the associated Data Object metadata. For non-constrained fields, whenever possible the value-type of the field should be annotated using a publicly-accessible vocabulary or ontology. This annotation should be clear in the Data Object metadata.

[10] Both syntax and semantics of data models and formats used for (Meat) data in Data Objects should be easy to identify and use, parse or translate by machines. As in the case of identifier schemes and vocabularies, a wide variety of data formats (ranging from URI-featuring spread-sheets such as RightField or OntoMaton to rich RDF) can be principally FAIR. It is obvious that any parsing and translation protocol is error-prone and the ideal situation is to restrict FAIR data publishing to as few community adopted formats and standards as possible. However, if a provider can prove that an alternative data model/format is unambiguously parsable to one of the community adopted FAIR formats, there is no particular reason why such a format could not be considered FAIR. Some data types may simply be not 'capturable' in one of the existing formats, and in that case maybe only part of the data elements can be parsed. FAIRports will increasingly offer guidance and assistance in such cases.

[11] The metadata of a Data Object should be sufficiently rich that a machine or a human user, upon discovery, can make an informed choice about whether or not it is appropriate to use that Data Object in the context of their analysis. Metadata contained within the Data Object should inform the consumer about the license of the data elements; this metadata should be machine-readable to facilitate automated data harvesting while maintaining proper attribution. The Metadata contained within the Data Object should inform about any access-control policy, such that consumers can determine which components of the data they are allowed to access. The Metadata within the Data Object should inform about the authentication protocol leading to access, if applicable.

Furthermore, in eScience, where pattern recognition in 'big' functionally linked or integrated data sets is becoming the norm, provenance is key. In case a pattern emerges from the data analysis algorithms, rationalization and confirmational studies in the underlying data sources is a crucial next step. If the provenance of the Data Elements to their original Data Object and subsequently to the underlying resources (human readable text, data bases, raw data files etc.) is lost, researchers will not be able to track the evidence for what the pattern seems to suggest for a testable hypothesis.

Final note: We explicitly acknowledge that it is possible to implement any of these sub-facets without implementing all of them. Here we give some initial guidance on how to gradully improve FAIR-ness of Data Objects.

Facet-I-syn: Metadata is provided in a format that can be parsed by a machine; i.e. that there is an open standard for the format against which reliable parsing code can be written

Metadata should refer to the schemata used

Facet-I-sem: Metadata takes advantage of shared controlled vocabularies or ontologies, allowing the mapping of metadata fields between disparate resources (regardless of their syntax in each of those repositories)

Metadata should refer to the vocabularies or ontologies used

Facet-I-data: Whenever possible, data should be provided in a format that can be parsed by a machine; i.e. that there is an open standard for the format against which reliable parsing code can be written

Data structures should be defined according to public, documented, and where possible machine readable, schemata.


Annex 2: An exemplar modular view on Data and Data Objects

At the core of the FAIR data formatting and publishing process is a comprehensive view on what constitutes Data and how is it structured. The added value (eScience) perspective of FAIR data is first and foremost 'FAIR for machines'. Human readability as a 'derivative' of well formatted and defined machine readable data is obviously crucial for final interpretation.

Actually, FAIR data will improve human readability as for instance concept-denoting terms can be presented to human users in their own language, based on ARTA (Also Referred To As) tables translating machine resolvable identifiers to lingual terms.

So we view data here initially in the 'digital format'. From that perspective also 'Data' and 'Metadata' are only different in 'what they represent' and in 'what they are used for' not in their technical format. Finally, in eScience,'software' dealing with the data is inseparable from the data itself and for simplicity sake we will treat 'code' as 'executable data' for the purpose of this brief document.

  • Data used by machines are intrinsically 'digital' and each Data Object (defined in the FAIR principles) is therefore a 'Digital Object' by nature.
  • One of smallest Digital Objects in a FAIR data setting is a single Identifier referring to a concept (unit of thought), while the concept it denotes in itself is not a Digital Object. [ref. to Ogden Triangle see FAIR principles, explanatory note 1]
  • Identifiers can be designed for computers or for people, in FAIR data context we recommend minimally one machine-resolvable Persistent Identifier (PID) for each concept used in a Data Object.
  • Multiple PIDs and other IDs for the same concepts are a fact of life and thus accepted, but FAIR ID's must be guaranteed to map to only one concept.
  • Mapping Tables and Mapping Service to deal with multiple (P)IDs for concepts are thus accepted in FAIR data and should be provided where needed.
  • Data Elements are defined as the actual data, and are therefore practically although not technically distinct from their metadata.
  • One of the smallest possible 'Data Elements' is a single association between two concepts.
    • Each FAIR Data Object (even a simple assertion about a single association) should have a PID (for the Data Object as a whole) and a minimal set of metadata 'about' the actual Data Object

    • Multiple identifiable data elements can share the same metadata and PID and form one FAIR Data Object (for instance a set of images or a micro-arry data set with hundreds of expression values for genes).

    • Individual Identifiable Data Elements can be separately used, integrated, cited and distributed as new Data Objects with a new PID and carrying sufficient metadata from the original Data Object to be traceable back to it and citable in itself or as 'derived from' the original larger Data Object.

    • Data Objects are thus 'modular' and 'recurrent' Digital Objects that can scale from a single association between two concepts to entire databases or workflows with many modules.

    • FAIR Data Objects can have rich or minimal, intrinsic and user defined metadata (see picture 1), they can have one or up to millions of separately identifiable data elements.


Annex 3: What constitutes a data FAIRport? (BM, PG, MW)

As FAIR is not a trademark, we propose to leave the decision to 'endorse' repositories as FAIRports (meta data or metadata + data can be separated) to 'authorities', such as ELIXIR nodes/the Hub, NIH or SciELO.

We propose to define a 'candidate' FAIRport as any machine-oriented data repository that:

  • Contains FAIR Data Objects (to be judged by the endorsing authority)
  • Provides these Data Objects under well defined accessibility for Re-use
  • Has a full and open description of all technologies, controlled vocabularies and formats used.

We propose that Trusted Parties in each scientific discipline

  • Define the 'authorities' for each 'semantic category' of concepts typically referred to in Data Objects in their discipline.
  • Define their minimal criteria to qualify Data Objects as FAIR
  • Review individual data FAIRports against these established criteria
  • Give a FAIR[Trusted party] stamp of approval to compliant FAIRports
  • Publish in Open Repositories (preferably FAIR themselves) what can be expected from FAIRports in their index and with their quality stamp.

We propose to consider the following 'levels' for FAIRports, or actually Data Objects contained in them (in other words, one FAIRport could contain Data Objects with Different 'levels of FAIRness) (see figure).

Level 1: Each Data Object has a PID and intrinsic FAIR metadata (in essence 'static')

Level 2: Each Data Object has 'user defined' (and updated) metadata to give rich provenance in FAIR format of the data, what happened to it, what it has been used for, can be used for etc., which could also be seen as rich FAIR annotations

Level 3. The Data Elements themselves in the Data Objects are 'technically' also FAIR, but not fully Open Access and not Reusable without restrictions (for instance Patient data or Proprietary data).

Level 4: The metadata as well as the data elements themselves are fully FAIR and completely public, under well defined license. (Non-licensed data considered 'public' by their owner will still be excluded from integration projects by for instance Pharmaceutical companies).

Data as increasingly FAIR Digital Objects


Annex 4: User Scenarios and links to sister initiatives

(adopted from Michel's and Juns original contributions)

In data driven science, researchers, but increasingly primarily machines, need first of all to find/discover data having features of interest, for which they will be using using links, metadata, as well as actual data elements/contents)

Once found, machines need to be able to access/retrieve data of interest (i.e. obtain a copy of the contents in some format). Next, for researchers to decide on 'giving a go' to their computers to start to re-use/analyze data of interest in the long-list retrieved from 'the web of data' they need to have easy access to and easy workflow tools to process (a.o.):

a. Rich Metadata Information about the harvested Data Objects of interest

b. Answer a question using one or a group of many more datasets

c. Aggregate datasets and perform a statistical analysis

d. Validate the correctness / authenticity of the data

e. Mirror/exchange of data between repositories (sustainability by redundancy)

f. Repeat/reproduce data generation/analysis

g. Functionally link or Integrate data in order to have a coherent view

h. Retrieve evidence at multiple levels to indicate support for a testable hypothesis

i Cite entire Data Objects or individual data elements (where possible) for proper credit.

j. At any point in time, retrieve the 'cited data cluster' as it was at the time it was cited (for dynamically growing data sets, such as twitter feeds or patient blogs and side effect records.

 

For all these eScience workflow steps (and many more could be imagined), the following features of proper data as the main substrate for machine-assisted Knowledge Discovery are (ao):

  • a richness of description (in machine readable format)
  • persistence (available when requested)
  • identifiers and citation schemes in place
  • accessibility - available in a variety of formats
  • interoperability - formats and standards/guidelines
  • prepared for functional interlinking and where needed integration
  • appropriate licensing of each data object
  • user control
  • reusability
  • provenance
  • quality measures
  • user-contributed content

The FAIR (Findable, Accessible, Interoperable and Re-usable) principles have been designed with these research workflow steps and concerns in mind:

  • to be findable (F) or discoverable, data and metadata should be richly described to enable attribute-based search.
  • to be broadly accessible (A), data and metadata should be retrievable in a variety of formats that are sensible to humans and machines using persistent identifiers
  • to be interoperable (I), the description of metadata elements should follow community guidelines that use an open, well defined vocabulary.
    • to be reusable (R), the description of essential, recommended, and optional metadata elements should be machine processable and verifiable, use should be easy and data should be citable to sustain data sharing and recognize the value of data.

(adopted from Jun and with ref. to JDDCP)

Data being FAIR is also a way to supporting the '7-R's', that initially motivated the creation of Research Objects. The 7-R's fit into the FAIR principles and the desired scientific and research activities in which Research Objects play the key role.

Reference: 7-R (v1): Why Linked Data is not enough for scientists (2012). DOI:10.1016/j.future.2011.08.004

  1. Reusable.
  2. Repurposeable
  3. Repeatable
  4. Reproducible
  5. Replayable
  6. Referenceable
  7. Respectful

see also: http://www.scilogs.com/eresearch/more-rs-than-pirates/

We will elaborate on the implementation of FAIR principles in sister activities that seek to support machine-friendly, high quality and reproducible science such as Research Objects, BioSharing, Force11, and FAIRdom (FAIR SB models). We see FAIR principles as an overarching way to support many novel practices associated with eScience, data sharing and re-use catering for data and the accompanying software, data capture practices in study design and multi scale models, visualization and proper data citation and alt-metrics.