key: cord-0876241-aiywek57
authors: Alper, Brian S.; Flynn, Allen; Bray, Bruce E.; Conte, Marisa L.; Eldredge, Christina; Gold, Sigfried; Greenes, Robert A.; Haug, Peter; Jacoby, Kim; Koru, Gunes; McClay, James; Sainvil, Marc L.; Sottara, Davide; Tuttle, Mark; Visweswaran, Shyam; Yurk, Robin Ann
title: Categorizing metadata to help mobilize computable biomedical knowledge
date: 2021-05-09
journal: Learn Health Syst
DOI: 10.1002/lrh2.10271
sha: ed397d70f61632f15d5de241b498b63469f65ab6
doc_id: 876241
cord_uid: aiywek57

INTRODUCTION: Computable biomedical knowledge artifacts (CBKs) are digital objects conveying biomedical knowledge in machine‐interpretable structures. As more CBKs are produced and their complexity increases, the value obtained from sharing CBKs grows. Mobilizing CBKs and sharing them widely can only be achieved if the CBKs are findable, accessible, interoperable, reusable, and trustable (FAIR+T). To help mobilize CBKs, we describe our efforts to outline metadata categories to make CBKs FAIR+T. METHODS: We examined the literature regarding metadata with the potential to make digital artifacts FAIR+T. We also examined metadata available online today for actual CBKs of 12 different types. With iterative refinement, we came to a consensus on key categories of metadata that, when taken together, can make CBKs FAIR+T. We use subject‐predicate‐object triples to more clearly differentiate metadata categories. RESULTS: We defined 13 categories of CBK metadata most relevant to making CBKs FAIR+T. Eleven of these categories (type, domain, purpose, identification, location, CBK‐to‐CBK relationships, technical, authorization and rights management, provenance, evidential basis, and evidence from use metadata) are evident today where CBKs are stored online. Two additional categories (preservation and integrity metadata) were not evident in our examples. We provide a research agenda to guide further study and development of these and other metadata categories. CONCLUSION: A wide variety of metadata elements in various categories is needed to make CBKs FAIR+T. More work is needed to develop a common framework for CBK metadata that can make CBKs FAIR+T for all stakeholders.

Three key components of all DOs are content (in the form of a bit sequence), a unique identifier, and describable properties (eg, size in bits). 4, 24, 25 CBKs are often custom-built and incorporated into larger software applications in ways that make them difficult to identify, isolate, extract, and share. 26 However, we assume that all CBKs can be isolated and shared as independent DOs, depending on software design. 27 , 28 We further assume that isolating CBKs is a precursor to mobilizing them. Therefore, we do not consider applications (apps) or software services (APIs) that incorporate CBKs to be CBKs themselves. Instead, we view CBKs as the smaller DO components of apps and APIs that represent biomedical knowledge in concrete, machine-independent encodings or data structures. 4, 27 CBKs may either be standalone or be embedded within apps, APIs, information systems, or platforms.

We draw on multiple perspectives about different CBK types.

First, CBK types may reflect the structured machine-interpretable formats or languages used to represent their knowledge content (eg, JSON, propositional logic, or Python). 29 Second, CBKs may be distinguished by their place in a hierarchy of increasing CBK complexity, such as building on basic CBKs like terms and relationships and constructing increasingly complex composite CBKs such as decision trees, workflows, and plans. 29, 30 Third, it is clear from real-world examples that CBKs may also be typed according to their logic or purpose (eg, rule, predictive model, risk-scoring mechanism). To demonstrate and contextualize our ideas about different CBK types, we provide 12 examples of CBKs in our supplement (see Supplement).

In summary, we view CBKs as DOs that are concrete, distinct, shareable information content entities. 31, 32 Some CBKs represent and communicate knowledge as assertions with an evidential basis. In general, CBKs explicitly represent and convey biomedical knowledge that holds significance for an identified community. 1, 33 Their explicitness enables CBKs to be immediately processed or executed by digital computers. Because CBKs are increasingly important throughout biomedicine, there is a vast and diverse audience for this work to help mobilize CBKs. Mobilizing CBKs means making them available wherever they can be appropriately used to advance biomedical science and improve human health.

The members of the Mobilizing Computable Biomedical Knowledge To assist specifically with CBK findability and access, repositories for CBKs are emerging. Two examples of public CBK repositories are CDS Connect 35 and the Value Set Authority Center. 36 Other examples include the computable phenotype repository PheKB, 11 the Kipoi repository of predictive models for genomics, 14 and the DDMORE repository of computable models for pharmaceutics. 37 Some suggest that private software code repositories, such as GitHub, Source Forge, and Bitbucket, are suitable for hosting CBKs. 38 However, others point out the policies governing these repositories may not fully support the CBK long-term sharing needs of biomedical scientists. 39, 40 We assume, in the future, there will be many CBK repositories and CBK metadata registries supporting a robust CBK ecosystem.

There exist extensive prior bodies of work on metadata, for example, those described in Greenberg's 2017 overview entitled, "Metadata and Digital Information. 41 " Since the 1960s, metadata developments within and beyond the digital library community have significantly matured. 41 It is clear that different communities value metadata for different reasons, such as the library community emphasizing descriptive metadata for distinguishing information resources and the business community emphasizing machine processing of metadata to improve information systems. The purpose of this manuscript is to highlight categories of metadata to assist in greater sharing and dissemination of CBKs. We are not attempting here to provide a comprehensive framework for metadata formalism or to create a standard, such as ISO/IEC 11179-3:2013 which specifies the structure of a metadata registry in the form of a conceptual data model.

It is clear specific metadata can support CBK sharing and use. 42 Much prior work focuses on making data sets FAIR. 4 Organizations and efforts like FORCE11, 42 We anticipate that the production of CBKs will continue to increase as it has since the 1970s. 48 Mobilizing the growing number of CBKs for optimal use requires them to be well organized and managed. This work significantly advances a metadata strategy to mobilize CBKs. Just as other classes of digital artifacts (eg, music and video files) have been mobilized in part by using rich metadata, further development of metadata for CBKs should enable them to be widely shared and appropriately used for research, education, health promotion, health care, population health, and public health. Outlining the metadata that can make CBKs FAIR+T is an initial step in a larger mobilization strategy.

Our goal is to engage both the many who have previously advanced our theory and practice in metadata usage and the many who are currently developing applications within specific domains, to facilitate development of a CBK metadata framework to help mobilize CBKs across a wide spectrum.

There are several unique aspects (individually or in combination) to our current effort. First, our focus is on specification of metadata for computable knowledge artifacts. Second, our description of metadata elements includes subject-predicate-object triples to enable clear definitions and reduce overlaps across metadata categories. Third, although we do not presume any specific application of our metadata categories, we are approaching this work with a primary focus of functional application and thus limiting attention to metadata that is mainly for a FAIR+T purpose. Even so, our current approach is purely conceptual and independent of any particular application and/or realization of the metadata, so it could be easily adapted in subsequent efforts to provide a reference framework for both existing and future implementations for a common meaning and purpose, enabling interoperability in the process. In particular for repositories, we envision ecosystems where the metadata records themselves are implemented as CBKs.

What categories of metadata hold the potential to make CBKs findable, accessible, interoperable, reusable, and trustable (FAIR+T)? 

We conducted a rapid environmental scan to identify key types of metadata specified in existing standards, for example, Dublin Core. In 

During the spring of 2020, we iteratively analyzed potential metadata categories by applying an evolving list of categories to a convenience sample of several real-world CBKs (see Supplement). Our example CBKs were all accessible online and came with metadata from their existing repositories.

For each candidate metadata category, we listed specific metadata elements from the category. Next, we attempted to identify prior published works about each candidate metadata category in our list.

During this phase, we also explored how metadata elements in each candidate category assist in making CBKs FAIR+T.

After several cycles of applying our CBK metadata categories list to these actual CBK examples, discussing the categories list and the CBK examples together, and refining our categories list further, we realized 15 candidate metadata categories for an initial draft of our CBK metadata list.

When deciding on which metadata categories to keep and which to combine or set aside, we gave preference to previously defined metadata categories over new categories. As part of our decision-making process, we clarified the scope of the metadata categories in our initial list by collaboratively drafting and revising a paragraph outlining each category's scope. We agreed upon a list of 11 metadata categories at this intermediate stage.

In advance of the MCBK Community's Annual Meeting at the end of June and the beginning of July 2020, we produced a draft document describing our initial metadata categories. This draft document conveyed our initial metadata categories list and described each category in detail. At the Annual Meeting, we convened the MCBK Community's Standards Workgroup and gathered feedback on our preliminary metadata categories list. We organized breakout sessions to discuss four metadata categories in particular (Biomedical Domain, Coverage, Purpose, and Type).

After the MCBK Community's Annual Meeting in 2020, we consolidated our meeting notes and the feedback we obtained from Standards Workgroup members about our preliminary metadata categories into a summary document. We circulated that summary document throughout our group of authors and discussed the feedback we received in detail. As a result, an updated but still unfinished list of metadata categories emerged by the end of August 2020.

We created our final list of CBK metadata categories using an iterative process. During this process, to address overlap, we developed and repeatedly applied a method of specifying subject-predicate-object triples for each metadata category. Making these triples explicit provided us with a needed mechanism to see, discuss, and address several significant problems of category overlap.

Finally, we further clarified the scope of the metadata categories in our working list by drafting and revising a paragraph outlining each category's scope. Once our group decided upon a set of metadata categories for our final list, we examined and discussed the final list to generate a related CBK metadata research agenda focused on remaining issues and areas of ambiguity. This research agenda describes future work toward having sufficient metadata to make CBKs FAIR+T.

We generated a final list of 13 categories of CBK metadata elements with specific utility for making CBKs FAIR+T. In Table 1 , we classify each category according to the principle to which it most closely applies. We briefly summarize the elements included in each category, offer some example predicates, and complete Table 1 

In the list of metadata categories spanning metadata to make CBKs FAIR+T, we have combined authorization metadata together with rights management metadata. Our view is that authorization is an important and special class of rights, including the rights to view (or access), comment on, or modify CBKs.

Other rights related to CBKs may be specified as copyrights or through various software and other licenses. We also include metadata that assign specific responsibilities to individuals or organizations in this category and leave room for metadata about disclaimers too.

To Rather than start from scratch, for our examples of preservation metadata, we draw on two predicates from the preservation metadata:

implementation strategies (PREMIS) ontology. 67 These predicates are has_preservation_level and should_be_kept_until. According to PREMIS, achieving a preservation level of "medium" means two copies of a CBK are stored on different media types with a minimum of 150 km distance between the two stored copies, with separate checksums checked annually. Since long-term access to CBKs directly supports their reuse, we associate preservation metadata most strongly with reusability. we also incorporate metadata about evidence grades into this evidential metadata category.

Furthermore, following the work of Lehmann and Downs that specified desiderata for shareable CBKs, 2 we recognize the complexity of specifying aspects of the evidential basis of CBKs using metadata. We foresee the need for a substantial body of future work on evidential basis metadata for CBKs.

Here we make a small start by specifying several initial predicates 

To check the current availability of metadata from the 13 metadata categories, we identified 12 CBKs available online and examined the existing metadata for each CBK in light of the categories.

Summary information about the metadata we found by category is provided in Figure 1 . In addition, a Supplement with this paper provides more details about these 12 CBKs and their metadata.

Another result is the research agenda for future CBK metadata research ( 

We envision a future in which CBKs are widely shared to support biomedical research, education, and improvement of individual and population health. A year of effort has resulted in a list of 13 metadata categories relevant for making CBKs FAIR+T. Having reviewed the metadata for a variety of actual CBKs, it seems likely that many CBK stakeholders will benefit from higher quality CBK metadata.

The list of categories should not be confused with a settled metadata framework, let alone a specification. Instead, we view this list of CBK metadata categories as the first step in a longer CBK metadata specification process. Next steps include gathering feedback toward achieving broad consensus for a draft CBK metadata framework and specification, including common elements and value sets for metadata in each category. We hope that by providing a list of potentially relevant metadata categories for making CBKs FAIR+T along with a research agenda, we have done enough to prompt further steps toward a common CBK metadata framework and future specification.

Metadata involve a variety of standards and models for their structure, syntax, content, and communication. 41 We make use of certain existing metadata standards and models to offer examples (eg, Dublin Core, RDF). We do not put forward any new standard or model. Instead, we offer guidance about the scope of CBK metadata for future standards and model development. Likewise, while we recognize the importance of the metadata generation process, we do not address metadata generation for CBK. Instead, we limit our investigation to examining previously generated metadata about CBKs.

Our metadata categories list focuses primarily on the metadata needs and contributions of CBK producers and consumers (or users).

When the value of specific metadata elements is demonstrated, we expect CBK producers will provide a minimum set of metadata to support CBK consumers. Some of this metadata, such as persistent unique identifiers and access locations, could be generated automatically.

The large scope of our metadata categories is a major concern.

The costs of generating and managing sufficient CBK metadata to make CBKs FAIR+T could be high, potentially limiting widespread CBK mobilization, sharing, and use. The barriers to creating such metadata are high. 43 Consequently, CBK producers and consumers need ways to minimize and recoup the costs of providing sufficient metadata. While producers need to supply most of the metadata to make CBKs FAIR, consumers must supply some metadata from their experience of CBK use to uphold trust. 5 The value of every metadata element in each category needs to be determined to justify costs. For the sample of 12 CBKs that we inspected, we did not find any integrity or preservation metadata (see Figure 1 ), and we found little technical metadata giving instructions for CBK use. These metadata may be costlier to produce than others.

Two categories of metadata in the list are tentative-the "Purpose" category and the "CBK-to-CBK Relationships" category. We believe both these categories need to be further refined. Studies to test the hypotheses surfaced here that metadata from 13 categories can uphold the findability, accessibility, interoperability, reusability, and trustability of CBKs are needed.

Of their six principles, five relate to metadata content. These five principles uphold software metadata for attribution, identifiers, persistence and preservation, accessibility, and version specificity. The metadata in our 13 categories includes these principles. The authors of these five principles on software citation also discuss software types and distinguish between software that is accessible as source code and software that is only accessible as a service. Adding to these ideas, in mid-2020, a group allied with the Research Data Alliance published the paper, toward FAIR Principles for Research Software. 71 As we do in this work, these authors also ground their efforts to make research software FAIR by evoking the notion of FAIR Digital Objects.

They stipulate that research software is not data and argue that making software FAIR will require a software-specific approach like the approach pioneered in this manuscript. 

The main limitations of this work are its consensus-based approach and the small number of real-world CBKs examined. Consensus among a small group is not predictive of consensus among a much larger group of stakeholders.

We had only enough input to work on metadata categories and did not specify the metadata elements in each category. We do not believe that one set of metadata elements will suffice to describe all CBKs. Our explorations show that many different types of CBKs already exist, and that their metadata vary by type. In addition, although complex hierarchical sets of metadata assertions are sometimes required (such as system specification for identifiers or codes), we limited our examples to simple metadata assertions (presented wholly as independent triples).

This will not suffice for a future specification.

There still exists some conceptual overlap among our categories.

For example, the "Type" and "Technical" metadata categories overlap.

If CBK typing is done based on technical differences, then these two categories blur. However, it is well established that all categorization schemes are imperfect and incomplete. 73 As a strategy to mobilize CBK, we look forward to further developing and refining our CBK metadata categories list and to learning more about CBK metadata from the real-world experiences of researchers, educators, clinicians, and other consumers who use CBKs in their work.

Computable biomedical knowledge artifacts (CBKs) vary widely in their complexity, goals, and anticipated audience. Each CBK offers knowledge of potential value for clinical care, public health, education, or for advancing biomedical science. Sharing of complex CBKs is key to support systems biology, precision health, population health, and learning health system initiatives.

To mobilize CBKs effectively, the value from sharing CBKs has to be greater than the costs of sharing them. For producers of CBKs, easier ways to disseminate CBKs to those able to benefit is of prime importance. For consumers of CBKs, the ability to readily discover, deploy, and use CBKs to meet their clinical, educational, or scientific needs is most important.

Ultimately, a common metadata framework for CBKs can advance efforts to mobilize CBKs. As an initial step, we contribute a list of 13 metadata categories for making CBKs findable, accessible, interoperable, reusable, and trustable (FAIR+T).

Computable knowledge: an imperative for learning health systems

Desiderata for sharable computable biomedical knowledge for learning health systems

Framework for discovery of identity management information

The FAIR guiding principles for scientific data management and stewardship

Recommendations for Building and Maintaining Trust in Clinical Decision Support Knowledge Artifacts

Bibliographic Framework as a Web of Data: Linked Data Model and Supporting Services

De rigueur, and even useful: standards for the published literature and their relationship to medical informatics

Clinical concept value sets and interoperability in health data analytics

Development of the logical observation identifier names and codes (LOINC) vocabulary

The OBO foundry: coordinated evolution of ontologies to support biomedical data integration

PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability

The InterMed approach to sharable computer-interpretable guidelines: a review

Achieving evidence interoperability in the computer age: setting evidence on FHIR

The Kipoi repository accelerates community exchange and reuse of predictive models for genomics

Causal network discovery from biomedical and clinical data

BPMN for healthcare processes

Workflow-centric research objects: a first class citizen in the scholarly discourse

A platform to render biomedical software findable, accessible, interoperable, and reusable. ArXiv Prepr ArXiv170606087

Patient-centered precision health in a learning health care system: Geisinger's genomic medicine experience

A rapid-learning health system: what would a rapidlearning health system look like, and how might we get there? Health Aff (Millwood)

Artificial intelligence in health care: The hope, the hype, the promise, the peril. Natl Acad Med Prepub

Biomedical informatics: changing what physicians need to know and how they learn

A framework for distributed digital object services. Corporation Natl Res Initiat Rest

Digital Objects as Drivers towards Convergence in Data Infrastructures

Effect of clinical decision-support systems: a systematic review

Digital knowledge objects and digital knowledge object clusters: unit holdings in a learning health system knowledge repository

Scalable collaborative infrastructure for a learning healthcare system (SCILHS): architecture

A multi-layered framework for disseminating knowledge for computer-based decision support

Disseminating medical knowledge: the PROforma approach

Aboutness: Towards foundations for the information artifact ontology

The knowledge object reference ontology (KORO): a formalism to support management and sharing of computable biomedical knowledge for learning health systems

Human problem solving: the state of the theory in 1970

To share is human! Advancing evidence into practice through a National Repository of interoperable clinical decision support

The NLM value set authority center

Drug and disease model resources: a consortium to create standards and tools to enhance modelbased drug development

Ten simple rules for reproducible computational research

GitHub repository recommendation for academic papers

We need a GitHub for academic research. Slate Published Online

Metadata and digital information

Research communication futures: a perspective on the FORCE11 scholarly communication institute

The center for expanded data annotation and retrieval

The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata that Describe Scientific Experiments. Paper presented at: International Semantic Web Conference

Ready, set, GO FAIR: accelerating convergence to an internet of FAIR data and services. Paper presented at: DAMDID/RCDL

A metadata scheme for DataCite

The RDA's metadata standards directory: information gathering

Exponential growth and the shifting global center of gravity of science production

Dublin Core Metadata Initiative. Dublin core metadata element set

The Dublin core metadata element set. RFC 5013

Ontology-based metadata management in medical domains

Transforming unstructured clinical free-text corpora into reconfigurable medical digital collections

The CIDOC conceptual reference module: an ontological approach to semantic interoperability of metadata

Specifying semantic conformance profiles in reusable learning object metadata

Identifying impact of software dependencies on replicability of biomedical workflows

A metadata architecture for digital libraries

DC: Library of Congress

Prov-o: The prov ontology. W3C recommendation

Automating experiments using semantic data in a bioinformatics grid

Handling topical metadata regarding the validity and completeness of multiplesource information: a possibilistic approach

GRADE guidelines: 3. Rating the quality of evidence

Evaluation Methods in Biomedical Informatics

Relational knowledge: the foundation of higher cognition

Semantic information and the network theory of account

Relational knowledge. Relational Knowledge Discovery

PREMIS 3.0 ontology: improving semantic interoperability of preservation metadata

GRADE: an emerging consensus on rating quality of evidence and strength of recommendations

The GRADE working group clarifies the construct of certainty of evidence

Software citation principles

Towards FAIR principles for research software

It is time for computable evidence synthesis: the COVID-19 knowledge accelerator initiative

Sorting Things Out: Classification and Its Consequences

Supporting Information section at the end of this article. How to cite this article

We thank Helen Pan for organizing and supporting our meetings. We 

Robin Ann Yurk https://orcid.org/0000-0001-5482-0475