key: cord-0425425-gksiltal
authors: Crusoe, Michael R.; Abeln, Sanne; Iosup, Alexandru; Amstutz, Peter; Chilton, John; Tijani'c, Nebojvsa; M'enager, Herv'e; Soiland-Reyes, Stian; Gavrilovic, Bogdan; Goble, Carole
title: Methods Included: Standardizing Computational Reuse and Portability with the Common Workflow Language
date: 2021-05-14
journal: nan
DOI: 10.1145/3486897
sha: f791dce07eb1dd9251f9bffb801d6e075091c98d
doc_id: 425425
cord_uid: gksiltal

Computational Workflows are widely used in data analysis, enabling innovation and decision-making. In many domains (bioinformatics, image analysis,&radio astronomy) the analysis components are numerous and written in multiple different computer languages by third parties. However, many competing workflow systems exist, severely limiting portability of such workflows, thereby hindering the transfer of workflows between different systems, between different projects and different settings, leading to vendor lock-ins and limiting their generic re-usability. Here we present the Common Workflow Language (CWL) project which produces free and open standards for describing command-line tool based workflows. The CWL standards provide a common but reduced set of abstractions that are both used in practice and implemented in many popular workflow systems. The CWL language is declarative, which allows expressing computational workflows constructed from diverse software tools, executed each through their command-line interface. Being explicit about the runtime environment and any use of software containers enables portability and reuse. Workflows written according to the CWL standards are a reusable description of that analysis that are runnable on a diverse set of computing environments. These descriptions contain enough information for advanced optimization without additional input from workflow authors. The CWL standards support polylingual workflows, enabling portability and reuse of such workflows, easing for example scholarly publication, fulfilling regulatory requirements, collaboration in/between academic research and industry, while reducing implementation costs. CWL has been taken up by a wide variety of domains, and industries and support has been implemented in many major workflow systems.

reduced set of abstractions that are both used in practice and implemented in many popular workflow systems. The CWL language is declarative, which allows expressing computational workflows constructed from diverse software tools, executed each through their command-line interface. Being explicit about the runtime environment and any use of software containers enables portability and reuse. The CWL project is not specific to a particular analysis domain, it is community-driven, and it produces consensus-built standards. Workflows written according to the CWL standards are a reusable description of that analysis that are runnable on a diverse set of computing environments. These descriptions contain enough information for advanced optimization without additional input from workflow authors. The CWL standards support polylingual workflows, enabling portability and reuse of such workflows, easing for example scholarly publication, fulfilling regulatory requirements, collaboration in/between academic research and industry,

Computational Workflows are widely used in data analysis, enabling innovation and decision-making for the modern society. But their growing popularity is also a cause for concern: unless we standardize computational reuse and portability, the use of workflows may end up hampering collaboration. How can we enjoy the common benefits of computational workflows and also eliminate such risks?

To answer this general question, we advocate in this work for workflow thinking as a shared way of reasoning across all domains and practitioners, introduce Common Workflow Language (CWL) as a pragmatic set of standards for describing and sharing computational workflows, and discuss the principles around which these standards have become central to a diverse community of users across multiple fields of science and engineering. This article focuses on an overview of the CWL standards and the CWL project and is complemented by the technical detail available in the CWL standards themselves 1 .

Workflow thinking is a form of "conceptualizing processes as recipes and protocols, structured as [work-or] dataflow graphs with computational steps, and subsequently developing tools and approaches for formalizing, analyzing and communicating these process descriptions" [15] . It introduces an abstraction, the workflow, which helps decouple expertise in a specific domain, for example of specific science or engineering fields, from expertise in computing. Derived from workflow thinking, a computational workflow describes a process for computing where different parts of the process (the tasks) are inter-dependent, e.g., a task can start processing after its predecessors have (partially) completed and where data flows between tasks.

Currently, many competing systems exist that enable simple workflow execution (workflow runners) or offer a comprehensive management of workflows and data (workflow management systems), each with their own syntax or method for describing workflows and infrastructure requirements. This limits computational 1 https://w3id.org/cwl/v1.2/ reuse and portability. In particular, although the data-flows are becoming increasingly more complex, most workflow abstractions do not enable explicit specifications of data-flows, increasing significantly the costs to reuse and port the workflow by third-parties.

We thus identify an important problem for the broad adoption of workflow thinking in practice: although communities require polylingual workflows (workflows that execute tools written in multiple different computer languages) and multi-party workflows, adopting and managing different workflow systems is costly and difficult. In this work, we propose to tame this complexity through a common abstraction that covers the majority of features used in practice, and is (or can be) implemented in many workflow systems.

In the computational workflow depicted in Figure 1 , practitioners solved the problem by adopting the CWL standards. We posit in this work that the CWL standards provide the common abstraction that can help solve the main problems of sharing workflows between institutions and users. CWL achieves this by providing a declarative language that allows expressing computational workflows constructed from diverse software tools, executed each through their command-line interface, with the inputs and outputs of each tool clearly specified and with inputs possibly resulting from the execution of other tools. We also set out to introduce the CWL standards, with a tri-fold focus: (1) the CWL standards focuses on maintaining a separation of concerns between the description and execution of tools and workflows, proposing a language that includes only operations commonly used across multiple communities of practice;

(2) the CWL standards support workflow automation, scalability, abstraction, provenance, portability, and reusability; and (3) the CWL project takes a principled, community-first open-source and open-standard approach which enables this result.

The CWL standards are the product of an open and free standardsmaking community. While the CWL project began in the bioinformatics domainthe many contributors to the CWL project shaped the standards so that it could be useful anywhere that experiences the problem of "many tools written in many programming languages by many parties". Since the ratification of the first version in 2016, the CWL standards have been used in other fields including hydrology 2 , radio astronomy 3 , geo-spatial analysis [13, 22, 30] , high energy physics [5] , in addition to fast-growing bioinformatics fields like (meta-)genomics [26] and cancer research [23] . The CWL standards are featured in US FDA sponsored and adopted IEEE Std 2791 ™ -2020 standard [1] and the Netherlands' National Plan for Open Science [32] . A list of free and open-source implementations of the CWL standards are listed in Table 1 . Additionally there are multiple commercially supported systems that support the CWL standards for executing workflows and they are available from vendors such as Curii (Arvados) 4 Figure 1 : Excerpt from a large microbiome bioinformatics CWL workflow [26] . This part of the workflow (which is interpretable/executable on its own) has the aim to match the workflow inputs of genomic sequences to provided sequencemodels, which are dispatched to four sub-workflows (e.g., find_16S_matches); the sub-workflows not detailed in the figure. The sub-worklow outputs are then collated to identify unique sequence hits, then provided as overall workflow outputs. Arrows define the connection between tasks and imply their partial ordering, depicted here as layers of tasks that may execute concurrently. Workflow steps (e.g., mask_rRNA_and_tRNA) execute command line tools, shown here with indicators for their different programming languages (e.g., [Py] for Python, [C] for the C language). (Diagram adapted from https://w3id.org/cwl/view/git/7bb76f33bf40b5cd2604001cac46f967a209c47f/workflows/rna-selector.cwl , which was originally retrieved from a corresponding CWL workflow of the EBI Metagenomics project, itself a conversion of the "rRNASelector" [24] program into a well structured workflow allowing for better parallelization of execution and provenance tracking.) Bridges 8 . The flexibility of the CWL standards enabled, for example, rapid collaboration on and prototyping of a COVID-19 public database and analysis resource [16] . The separation of concerns proposed by the CWL standards enable diverse projects, and can also benefit engineering and large industrial projects. Likewise, users of Docker (or other software container technologies) that distribute analysis tools can use just the CWL Command Line Tool standard for providing a structured workflow-independent description of how to run their tool(s) in the container, what data is required to be provided to the container, and what results to expect and where to find them in the container.

Key Insights [in CACM box] Toward computational reuse and portability of polylingual, multiparty workflows, the CWL project makes the following contributions:

(1) CWL is a set of standards for describing and sharing computational workflows. (2) The CWL standards are used daily in many science and engineering domains, including by multi-stakeholder teams. 

Workflows, and standards-based descriptions thereof, hold the potential to solve key problems in many domains of science and engineering. This section explains why.

In many domains, workflows include diverse analysis components, written in multiple (different) computer languages, by both endusers and third-parties. Such polylingual and multi-party workflows are already common or dominant in data-intensive fields like bioinformatics, image analysis, and radio astronomy; we envision they could bring important benefits to many other domains.

To thread data through analysis tools, domain experts such as bioinformaticians use specialized command-line interfaces [12, 29] and other domains use their own customized frameworks [3, 6] .

Workflow engines also help with efficient management of the resources used to run scientific workloads [7, 10] .

The workflow approach helps compose an entire application of these command-line analysis tools: developers build graphical or textual descriptions of how to run these command-line tools, and scientists and engineers connect their inputs and outputs so that the data flows through. An example of a complex workflow problem is metagenomic analysis, for which Figure 1 illustrates a subset (a sub-workflow).

In practice, many research and engineering groups use workflows of the kind described in Figure 1 . However, as highlighted in a recently published "Technology Toolbox" article [28] published in the journal Nature, these groups typically lack the ability to share and collaborate across institutions and infrastructures without costly manual translation of their workflows.

Using workflow techniques, especially with digital analysis processes, has become quite popular and does not look to be slowing down: one workflow management system recently celebrated its 10,000th citation 9 ; and over 298 computational data analysis workflow systems are known 10 .

A process, digital or otherwise, may grow to such complexity that the authors and users of that process have difficulties in understanding its structure, scaling the process, managing the running of the process, and keeping track of what happened in previous enactments of the process. Process dependencies may be undocumented, obfuscated, or otherwise effectively invisible; even an extensively documented process may be difficult to understand by outsiders or newcomers if a common framework or vocabulary is lacking. The need to run the process more frequently or with larger inputs is unlikely to be achieved by the initial entity (i.e., either script or person) running the process. What seemed once a reasonable manual step (run this command here and then paste the result there; then call this person for permission) will, under the pressure of porting and reusing, become a bottleneck. Informal logs (if any) will quickly become unsuitable for answering an organization's need to understand what happened, when, by whom, and to which data.

Workflow techniques aim to solve these problems by providing the Abstraction, Scaling, Automation, and Provenance (A.S.A.P.) features [8] . Workflow constructs enable a clear abstraction about the components, the relationships between components, and the inputs and outputs of the components turning them into well-labeled tools with documented expectations. This abstraction enables scaling (execution can be parallelized and distributed), automation (the abstraction can be used by a workflow engine to track, plan, and manage execution of tasks), and provenance tracking (descriptions of tasks, executors, inputs, outputs; with timestamps, identifiers (unique names), and other logs, can be stored in relation to each other to later answer structured queries).

Although workflows are very popular, prior to the CWL standards every workflow system was incompatible with every other. This means that those users not using the CWL standards are required to express their computational workflows in a different way every 9 https://galaxyproject.org/blog/2020-08-10k-pubs/ 10 https://s.apache.org/existing-workflow-systems time they have to use another workflow system leading to local success, but global unportability.

The success of workflows is now their biggest drawback: users are locked into a particular vendor, project, and often a particular hardware setup. This hampers sharing and re-use. Even nonacademics suffer from this situation, as the lack of standards (or the lack of their adoption) hinders effective collaboration on computational methods within and between companies. Likewise, this unportability affects public-private partnerships and the potential for technology transfer from public researchers.

A second significant problem is that incomplete method descriptions are common when computational analysis is reported in academic research [17] . Reproduction, re-use, and replication [11] of these digital methods requires a complete description of what computer applications were used, how exactly they were used, and how they were connected to each other. For precision and interoperability, this description should also be in an appropriate standardized machine-readable format.

A standard for sharing and reusing workflows can provide a solution to describing portable, re-usable workflows while also being workflow-engine and vendor-neutral.

Sharing workflow descriptions based on standards also addresses the second problem: the availability of the workflow description provides needed information when sharing; and the quality of the description provided by a structured, standards-based approach is much higher than the current approach of casual, unstructured, and almost always incomplete descriptions in scientific reports. Moreover, the operational parts of the description can be provided automated by the workflow management system, rather than by domain experts.

While (data) standards are commonly adopted and have become expected for funded projects in knowledge representation fields, the same cannot yet be said about workflows and workflow engines yet.

Workflows techniques can be implemented in many ways, i.e., with varying degrees of formalism, which tends to correlate with execution flexibility and features. Typically, whereas the most informal techniques require that all processing components are written in the same programming language or are at least callable from the same programming language, the formal workflow techniques tend to allow components to be developed in multiple programming languages. Among the informal techniques, the do-it-yourself approach uses from a particular programming language its built-in capabilities. For example, Python provides a threading library, and the Javabased Apache Hadoop [31] provides MapReduce capabilities. To gain more flexibility when working with a particular programming language, general third-party libraries, such as ipyparallel 11 , can enable remote or distributed execution without having to re-write one's code.

A more explicit workflow structure can be achieved by using a workflow library focusing on a specific programming language.

For example, in Parsl [3] , the workflow constructs ("this is a unit of processing", "here are the dependencies between the units") are made explicit and added by the developer to a Python script, to upgrade it to a scalable workflow. (While we list Parsl here as an example of a monolingual workflow system, it also contains explicit support for executing external command-line tools.)

Two approaches can accommodate polylingual workflows where the components are written in more than one programming language, or where the components come from third-parties and the user does not want to or cannot modify them: use of per-language add-in libraries or the use of the Portable Operating System Interface command-line interface (POSIX CLI) [14] . The use of per-language add-in libraries entails either explicit function calls (e.g., using ctypes in Python to call a C library 12 ) or the addition of annotations to the user's functions, and requires mapping/restricting to a common cross-language data model.

Essentially all programming languages support the creation of POSIX CLIs are familiar to many Linux and macOS users; scripts or binaries which can be invoked on the shell with a set of arguments, reading and writing files, and executed in a separate process. Choosing the POSIX command-line interface as the point of coordination means the connection between components is done by an array of string arguments representing program options (including paths to data files) along with a string-based environment variables (key-value pairs). Using the command-line as a coordination interface has the advantage of not needing additional implementation in every programming language, but has the disadvantages of process start-up time and a very simple data model. (As a polylingual workflow standard, CWL uses the POSIX CLI data model.)

The Common Workflow Language standards aim to cover the common needs of users and the commonly implemented features of workflow runners or platforms. The remainder of this section presents an overview of the CWL features, how they translate to executing workflows in CWL format, and where the CWL standards are not helpful. The CWL standard support polylingual and multi-party workflows, for which they enable computational reuse and portability (see also the CACM Box for main features). To do so, each release of the CWL standards has two 13 main components: (1) a standard for describing command line tools; and (2) a standard for describing workflows that compose such tool descriptions. The goal of the CWL Command Line Tool Description Standard 14 is to describe how a particular command line tool works: what are the inputs and parameters and their types; how to add the correct flags and switches to the command line invocation; and where to find the output files.

The CWL standards define an explicit language, both in syntax, and in its data and execution model. Its textual syntax is derived from YAML 15 . This syntax does not restrict the amount of detail; for example, Figure 2A depicts a simple example with sparse detail, and Figure 2B depicts the same example but with the execution augmented with further details. Each input to a tool has a name and a type (e.g., File, see label 1 in the figure) . Authors of tool descriptions are encouraged to include documentation and labels for all components (i.e., as in Figure 2B ), to enable the automatic generation of helpful visual depictions and even Graphical User Interfaces for any given CWL description. Metadata about the tool description authors themselves encourages attribution of their efforts. As shown in Figure 2B , item 3, these tool descriptions can contain well-defined hints or mandatory requirements such as which software container to use or how much compute resources are required (memory, number of CPU cores, disk space, and/or the maximum time or deadline to complete the step or entire workflow.)

The CWL execution model is explicit: Each tool's runtime environment is explicit and any required elements must be specified by the CWL tool-description author (in contrast to hints, which are optional) 16 . Each tool invocation uses a separate working directory, populated according to the CWL tool description, e.g., with the input files explicitly specified by the workflow author. Some applications expect particular filenames, directory layouts, and environment variables, and there are additional constructs in the CWL Command Line Tool standard to satisfy their needs.

The explicit runtime model enables portability, by being explicit about data locations. As Figure 3 indicates, this enables execution of CWL workflows on diverse environments as provided by various implementations of the CWL standards: the local environment of the author-scientist (e.g., a single desktop computer, laptop, or workstation), a remote batch production-environment (e.g., a cluster, an entire datacenter, or even a global multi-datacenter infrastructure), and an on-demand cloud environment.

The CWL standards explicitly support the use of software container technologies, such as Docker and Singularity, to enable portability of the underlying analysis tools. Figure 2B , item 2, illustrates the process of pulling a Docker container-image from the Quay.io registry; then, the workflow engine automates the mounting of files and folders within the container. The container included in the figure has been developed by a trusted author and is commonly used in the bioinformatics field with an expectation its results are reproducible. Indeed, the use of containers can be seen as a confirmation that a tool's execution is reproducible, when using only its explicitly declared runtime-environment. Similarly, when distributed execution is desired, no changes to the CWL tool-description are needed: because the file or directory inputs are already explicitly defined in the CWL description, the (distributed) workflow runner can handle (without additional configuration) both job placement and data routing between compute nodes.

Via these two features (special handling of data paths; the optional but recommended use of software containers), the CWL standards enables portability (execution "without change"). Although 15 JSON is an acceptable subset of YAML, and common when converting from another format to CWL syntax. 16 Details on how the CWL Command Line Tool standard specifies that tool executors should setup and control the runtime environment, available at https://w3id.org/ cwl/v1.2/CommandLineTool.html#Runtime_environment, which also specifies which directories tools are allowed to write to. 

Local execution on Linux, macOS, and MS Windows via the CWL reference implementation (cwltool) and Docker/uDocker/Singularity/podman/... Figure 3 : Example of CWL portability. The same workflow description runs on the scientist's own laptop or single machine, on any batch production-environment, and on any common public or private cloud. The CWL standards enable executionportability by being explicit about data locations and execution models.

various factors not controllable by software container technology can affect portability (e.g., variation in the underlying operating system kernel; variation in processor results), in practice the exact same software container and data inputs lead to portability without further adjustment from the user.

To support features that are not in the CWL standards, the CWL standards define extension points that permit (namespaced) vendorspecific features in explicitly defined ways. If these extensions do not fundamentally change how the tool should operate, then they are added to the hints list and other CWL compatible engines can ignore them. However, if the extension is required to properly run the tool being described, e.g., due to the need for some specialized hardware, then the extension is listed under requirements and CWL compatible engines can recognize and explicitly declare their inability to execute that CWL description.

The CWL Workflow Description Standard 17 builds upon the CWL Command Line Tool Standard: it has the same YAML-or JSON-style syntax, with explicit workflow level inputs, outputs, and documentation (see Figure 2C ). The workflow descriptions consists of a list of steps, comprised of CWL CommandLineTools or CWL sub-workflows, each re-exposing their tool's required inputs. Inputs for each step are connected by referencing the name of either the common workflow inputs or particular outputs of other steps. The workflow outputs expose selected outputs from workflow steps, making explicit which intermediate step outputs will be returned from the workflow. All connections include identifiers, which CWL document authors are encouraged to name meaningfully, e.g., reference_genome instead of input7.

CWL workflows form explicit data flows, as required for the particular computational analysis. The connectivity between steps defines the partial execution order. Parallel execution of steps is permitted and encouraged whenever multiple steps have all of their inputs satisfied, e.g., in Figure 1 , find_16S_matches and find_S5_matches are at the same data dependency level and can execute concurrently or sequentially in any order. Additionally, a scatter construct allows the repeated execution of a CWL step (perhaps overlapping in time, depending on the resources available) where most of the inputs are the same except for one or more inputs that vary. This is done without requiring the modification of the underlying tool description. Starting with CWL version 1.2, workflows can also conditionally skip execution of a (tool or workflow) step, based upon a specified intermediate input or custom boolean evaluation. Combining these features allows for a flexible branch mechanism that allows workflow engines to calculate data dependencies before the workflow starts, and thus retains the predictability of the data flow paradigm.

In contrast to hard-coded approaches that rely on implicit filepaths particular for each workflow, CWL workflows are more flexible, reusable, and portable (which enables scalability). The use in the CWL standards of explicit runtime environments, combined with explicit inputs/outputs to form the data flow, enables step reordering and explicit handling of iterations. The same features enable scalable remote execution and, more generally, flexible use of runtime environments. Moreover, individual tool definitions from multiple workflows can be reused in any new workflow.

CWL workflow descriptions are also future-proof. Forward compatibility of CWL documents is guaranteed, as each CWL document declares which version of the standards it was written for and minor versions do not alter the required features of the major version. A stand-alone upgrader 18 can automatically upgrade CWL documents from one version to the next, and many CWL-aware platforms will internally update user-submitted documents at runtime.

CWL is a set of standards, not a particular software product to install, purchase, or rent. The CWL standards need to be implemented to be useful; a list of some implementations of the CWL standards is in Table 1 . Workflow/tool runners that claim compliance with the CWL standards are allowed significant flexibility in how and where they execute a user's CWL documents as long as they fulfill the requirements written in those documents. For example, they are allowed (and encouraged) to distribute execution of a workflow 18 https://pypi.org/project/cwl-upgrader/ across all available computers that can fulfill the resource requirements specified by the user. Aspects of execution not defined by the CWL standards include (web) APIs for workflow execution and real-time monitoring.

For example details about when a step should be considered ready for execution are available in §4 of CWL Workflow Description standard 19 but once all the inputs are available the exact timing is up to the workflow engine itself.

Step execution may result in a temporary or permanent failure, as defined in §4 of CWL Workflow Description standard 20 . It is up to the workflow engine to control any automatic attempts to recover from failures, e.g., to re-execute a Workflow step. Most workflow engines that implement the CWL standards offer the feature of attempting a number of re-executions, as set by the user, before reporting permanent failure.

The CWL community has developed the following optimizations without requiring that users re-write their workflows to benefit:

(1) Automatic streaming of data inputs and outputs instead of waiting for all the data to be downloaded or uploaded (where those data inputs or outputs are marked with "streamable: true") (2) Workflow step placement based upon data location [18] , resource needs, and/or cost of data transfer [19] (3) The re-use of the results from previously computed steps, even from a different workflow, as long as the inputs are identical. This can be controlled by the user via the "WorkReuse" directive 21 .

Real world usage at scale: routinely CWL users and vendors report that they analyze 5000 whole genome sequences in a single workflow execution; one customer of a commercial vendor reported a successful run of a workflow that contained an 8,000-wide step; the entire workflow had 25,000 container executions. By design, the CWL standards do not impose any technical limitations to the size of files processed or to the number of tasks run in parallel. The major scalability bottlenecks are hardware-related -not having enough machines with enough memory, compute or disk space to process more and more data at a larger scale. As these boundaries move in the future with technological advances, the CWL standards should be able to keep up and not be a cause of limitations.

The CWL standards were designed for a particular style of commandline tool based data analysis. Therefore, the following situations are out of scope and not appropriate (or possible) to describe using CWL syntax:

(1) Safe interaction with stateful (web) services (2) Real-time communication between workflow steps (3) Interactions with command line tools beside 1) constructing the command line and making available file inputs (both user provided and synthesized from other inputs just prior to execution) and 2) consuming the output of the tool once its execution is finished, in the form of files created/changed, the POSIX standard output and error streams, and the POSIX exit code of the tool (4) Advanced control-flow techniques beyond conditional steps (5) Runtime workflow graph manipulations: dynamically adding or removing new steps during workflow execution, beyond any predefined conditional step execution tests that are in the original workflow description (6) Workflows that contain cycles: "repeat this step or subworkflow a specific number of times" or "repeat this step or sub-workflow until a condition is met. " 22 (7) Workflows that need particular steps run at or during a specific day/time-frame

Given the numerous and diverse set of potential users, implementers, and other stakeholders, we posit that a project like CWL requires the combined development of code, standards, and community. Indeed, these requirements were part of the foundational design principles for CWL (Section 4.1); in the long run, these have fostered free and open source software (Sidebar B, in Section 4.2), and a vibrant and active ecosystem (Section 4.3).

The CWL project is based on a set of five principles: Principle 1: The core of the project is the community of people who care about its goals.

Principle 2: To achieve the best possible results, there should be few, if any, barriers to participation. Specifically, to attract people with diverse experiences and perspectives, there must be no cost to participate.

Principle 3: To enable the best outcomes, project outputs should be used as people see fit. Thus, the standards themselves must be licensed for reuse, with no acquisition price.

Principle 4: The project must not favor any one company or group over another, but neither should it try to be all things to all people. The community decides.

Principle 5: The concepts and ideas must be tested frequently: tested and functional code is the beginning of evaluating a proposal, not the end.

In time, the CWL project-members learned that this approach is a superset of the OpenStand Principles 23 , a joint "Modern Paradigm for Standards" promoted by the IAB, IEEE, IETF, Internet Society, and W3C. The CWL project additions to the OpenStand Principles are: (1) to keep participation free of cost, and (2) the explicit choice of the Apache 2.0 license for all its text, conformance tests, and reference implementations.

Necessary and sufficient: All these principles have proven to be essential for the CWL project. For example, the free cost and open source license (Principles 2 and 3) has enabled many implementations of the CWL standards, several of which re-use different 22 Supporting cycles/loops as an optional feature has been suggested for a future version of the CWL standards, but it has yet to be put forth as a formal proposal with a prototype implementation. As a work around, one can launch a CWL workflow from within a workflow system that does support cycles, as documented in the eWaterCycle case study with Cylc [27] . 23 By 2021, the CWL standards have gained much traction and are currently widely supported in practice. In addition to the implementations in Table 1 , Galaxy [2] 25 and Pegasus [10] 26 have in-development support for the CWL standards as well.

Wide adoption benefits from our principles: The CWL standards include conformance tests, but the CWL community does not yet test or certify implementations of the CWL standards, or specific technology stacks. Instead, the authors and service provides of workflow runners and workflow management systems self-certify support for the CWL standards, based on a particular technology configuration they deploy and maintain.

CWL plugins for text/code editors exist for Atom, vim, emacs, Visual Studio Code, IntelliJ, gedit, and any text editor that support the "language server protocol" 28 standard.

There are tools to generate CWL syntax from Python (via argparse/click or via functions), ACD 29 , CTD 30 , and annotations in IPython Jupyter Notebooks. Libraries to generate and/or read CWL documents exist in many languages: Python, Java, R, Go, Scala, Javascript, Typescript, and C++.

Beyond the ratified initial and updated CWL standards released over the last six years, the CWL community has developed many tools, software libraries, connected specifications, and has shared CWL descriptions for popular tools. For example, there are software development kits for both Python 31 and Java 32 that are generated automatically from the CWL schema; this allows programmers to load, modify, and save CWL documents using an object oriented model that has direct correspondence to the CWL standards themselves. CWL SDKs for other languages are possible by extending the code generation routines 33 . (See Sidebar B in Section 4.2.2 for practical details.)

The CWL standards support well the acute need to reuse (and, correspondingly, to share) information on workflow execution, and on authoring and provenance. The CWLProv 34 prototype was created to show how existing standards [4, 21, 25] can be combined to represent the provenance of a specific execution of a CWL workflow [20] . Although, to-date, CWLProv has only been implemented in the CWL reference runner, interest is high in additional implementation and further development.

The problem of standardizing computational reuse is only increasing in prominence and impact. Addressing this problem, various domains in science, engineering, and commerce have already started to migrate to workflows, but efforts focusing on the portability and even definition of workflows remain scattered. In this work we raise awareness to this problem and propose a community-driven solution.

The Common Workflow Language (CWL) is a family of standards for the description of command line tools and of workflows made from these tools. It includes many features developed in collaboration with the community: support for software containers, resource requirements, workflow-level conditional branching, etc. Built on a foundation of five guiding principles, the CWL project delivers open standards, open-source code, and an open community.

For the past six years, the community around CWL has developed organically. Organizations looking to write, use, or fund data analysis workflows based upon command-line tools should adopt or even require the CWL standards, because the CWL standards offer a common yet reduced set of capabilities that are both used in practice and implemented in many popular workflow systems. CWL is further valuable because it is supported by a large-scale community, diverse fields have already adopted it, and its adoption is rapidly growing. Specifically,

(1) By using a reduced set of capabilities, the CWL standards limit the complexity encountered by users when they start to use it, and by operators when they have to implement it.

(Feedback from the community indicates these are appreciated.) (2) By using declarative syntax, CWL allows users to specify workflows even if they do not know exactly where the workflows would (later) run. (3) The CWL project is governed in the public interest and produces freely available open standards. The CWL project itself is not a specific workflow management system, workflow runner, or vendor. This allows potential users, operators, and vendors, to avoid lock-in and be more flexible in the future. (4) By offering standards, the CWL project distinguishes itself especially for the complex interactions that appear in scientific and engineering collaborations. These interactions include defining workflows from many different tools (or steps), sharing workflows, long-term archiving, fulfilling requirements of regulators (e.g., US FDA), making workflow executions auditable and reproducible. (This is particularly useful in cooperative environments, where groups that compete with each other need to collaborate, or in scientific papers where the paper results can be reused very efficiently if the analysis is described in a CWL workflow with publicly available software containers for all steps.) (5) The CWL standards are already implemented, adopted, and used; with many production-grade implementations available as open source and with zero-cost. Thus, the different communities of users of the CWL standards already offer numerous workflow and tool descriptions. (This is akin to how the Python ecosystem of shared libraries, code, and recipes is already helpful.)

To conclude: this is a call for others to embrace workflow thinking and join the CWL community in creating and sharing portable and complete workflow descriptions. With the CWL standards, the methods are included and ready to (re)use!

The CWL project is immensely grateful to the following self-identified CWL Community members and their contributions to the project: 

Alexander (Sasha) Wait Zaranek (Conceptualization, Funding Acquisition)

Business Development, Content, Documentation, Examples, Event Organizing, Maintenance, Packaging, Answering Questions

Bogdan Gavrilovic (Conceptualization, Software, Validation, Bug Reports, Blogposts, Maintenance, Tools, Answering Questions, Reviewed Contributions, User Testing), Carole A. Goble (Conceptualization, Funding Acquisition, Resources, Supervision, Audio, Business Development, Content, Examples, Event Organizing, Tools, Talks, Videos). Funding acknowledgements: European Commission grants BioExcel-2 (SSR)

ELIXIR-EXCELERATE (SSR, HM) H2020

ELIXIR the research infrastructure for lifescience data

Various universities have also co-sponsored this project; we thank Vrije Universiteit of Amsterdam, the Netherlands, where the first three authors have their primary affiliation. REFERENCES [1] 2020. IEEE Standard for Bioinformatics Analyses Generated by High-Throughput Sequencing (HTS) to Facilitate Communication

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses

Parsl: Pervasive Parallel Programming in Python

Using a suite of ontologies for preserving workflow-centric research objects

KN-IME -the Konstanz information miner: version 2.0 and beyond

Workflow Management in Condor

Scientific Workflows and Provenance: Introduction and Research Opportunities

From the desktop to the grid: scalable bioinformatics via workflow conversion

Pegasus, a workflow management system for science automation

From Repeatability to Reproducibility and Corroboration

Bionitio: demonstrating and facilitating best practices for bioinformatics command-line software

POSIX.1-2008 (IEEE Std 1003.1™-2008 and The Open Group Technical Standard Base Specifications

Workflows and Provenance: Toward Information Science Solutions for the Natural Sciences

Alexander (Sasha) Wait Zaranek, and Pjotr Prins. 2020. COVID-19 PubSeq: Public SARS-CoV-2 Sequence Resource

Reproducibility in Scientific Computing

TR-19-01: A Cloud-Agnostic Framework for Geo-Distributed Data-Intensive Applications

PIVOT: Cost-Aware Scheduling of Data-Intensive Applications in a Cloud-Agnostic System

Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv

The BagIt File Packaging Format (V1.0). RFC 8493

The Cancer Genomics Cloud: Collaborative, Reproducible, and Democratized-A New Paradigm in Large-Scale Computational Research

rRNASelector: A computer program for selecting ribosomal RNA encoding sequences from metagenomic and metatranscriptomic shotgun libraries

The W3C PROV family of specifications for modelling provenance metadata

MGnify: the microbiome analysis resource in 2020

Workflow Automation for Cycling Systems: The Cylc Workflow Engine

Workflow systems turn raw data into scientific knowledge

Ten recommendations for creating usable bioinformatics command line software

An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics

Nationaal plan open science