Making Computational Workflows FAIR




Recipe Overview
Reading Time
15 minutes
Executable Code
Yes
Difficulty
FAIR Computational Workflows
FAIRPlus logo
Recipe Type
Hands-on
Maturity Level & Indicator
hover me Tooltip text

Main Objectives

The main purpose of this recipe is:

Provide guidance on resources available to help developers and data scientists make the various workflows used for daily tasks (for extract-load-transform, quality control, deployment or analytical workflow) available in open format and reusable.

Provide guidance for regulatory submissions for nucleic acid sequence analysis using the BioCompute Object (BCO) specification.

Remind the active nature of the field and the fast evolving environment and platforms developed for these tasks.

Provide an example using the Apache Airflow framework to illustrate the process.


Graphical Overview

FAIRification Objectives, Inputs and Outputs

Actions.Objectives.Tasks

Input

Output

validation

Common Workflow Language (CWL)

report

text annotation

EDAM

annotated text

Table of Data Standards

Data Formats

Terminologies

Models

CWL

EDAM

Biocompute Object - IEEE 2791-2020

Tools:

Name

URL

type

Apache Airflow

https://airflow.apache.org/

workflow engine

Galaxy

https://galaxy.aws.biochemistry.gwu.edu/root/login?redirect=%2F

workflow engine

Hive

https://hive.aws.biochemistry.gwu.edu/dna.cgi?cmd=main

workflow engine

BioCompute Platform

https://portal.aws.biochemistry.gwu.edu/sign-in

workflow engine

SevenBridges BioCompute App

https://sbg.github.io/biocompute/

workflow engine

CWL-Airflow

https://barski-lab.github.io/cwl-airflow/

adapter


Main Content

Workflows are ubiquitous in the data science ecosystem. The ability to automate repetitive tasks to build complex pipelines, schedule and distribute tasks to cloud infrastructures have popularized the use of workflow engine and somehow contributing to reducing the risk of errors associated with human operator fatigue. Workflow engines such as Galaxy 2, Snakemake5, Cromwell7, Knime3, Apache Airflow1, and Toil 6 to name a few offerings, have popularized the use of workflows in the field of life science computational applications. This however be can also become a source of difficulty when buying-in in a particular platform and then trying to exchange information with other platforms or migration away from the initial choice. Hence, a community of experts has dedicated efforts to define open specifications for the description of workflows as well as supporting tools, such as converters.

Using an example based on Next Generation Sequencing (NGS) application, the present content will show the reader how to
make workflow more interoperable and reusable thanks to the use of existing, off-the-shelf tools.

1. CWL: Common Workflow Language - A brief overview

  • CWL, short for Common Workflow Language, is an open standard developed by a consortium of experts, including workflow engine developers, data scientists, data analysts and bioinformaticians.

  • CWL specifications are available from: https://www.commonwl.org/v1.2/Workflow.html

  • CWL use YAML syntax to describe workflow steps, tools, input, output and parameters.

  • CWL is meant to provide for platform-independent workflow description, meaning that people should ideally describe workflows once to be able to execute them on CWL aware workflow engines.

  • CWL is currently implemented by an increasing number of platforms, which are listed here

  • CWL user guide is available here: http://www.commonwl.org/user_guide/

2. Conditional Workflow and the CWL when keyword:

When describing a protocol, it is often desirable to what to do if a specific situation arises. Computational workflows are no different, and it is in fact quite frequent to have the need to define specific sets of steps if a threshold or condition is met. Therefore, the Common Workflow Language contains a dedicated keyword when to represent such situations. The following block shows how it can be used with a example:

http://www.commonwl.org/user_guide/24_conditional-workflow/index.html

class: Workflow
cwlVersion: v1.2
inputs:
  val: int

steps:

  step1:
    in:
      in1: val
      a_new_var: val
    run: foo.cwl
    when: $(inputs.in1 < 1)
    out: [out1]

  step2:
    in:
      in1: val
      a_new_var: val
    run: foo.cwl
    when: $(inputs.a_new_var > 2)
    out: [out1]

outputs:
  out1:
    type: string
    outputSource:
      - step1/out1
      - step2/out1
    pickValue: first_non_null

requirements:
  InlineJavascriptRequirement: {}
  MultipleInputFeatureRequirement: {}

3. Semantic Markup of CWL workflows

CWL documents can be annotated with Schema.org or EDAM vocabulary elements to support findability.

The blocks of code below shows how this is done with 2 examples.

#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: CommandLineTool

label: An example tool demonstrating metadata.
doc: Note that this is an example and the metadata is not necessarily consistent.

hints:
  ResourceRequirement:
    coresMin: 4

inputs:
  aligned_sequences:
    type: File
    label: Aligned sequences in BAM format
    format: edam:format_2572
    inputBinding:
      position: 1

baseCommand: [ wc, -l ]

stdout: output.txt

outputs:
  report:
    type: stdout
    format: edam:format_1964
    label: A text file that contains a line count

s:author:
  - class: s:Person
    s:identifier: https://orcid.org/0000-0002-6130-1021
    s:email: mailto:dyuen@oicr.on.ca
    s:name: Denis Yuen

s:contributor:
  - class: s:Person
    s:identifier: http://orcid.org/0000-0002-7681-6415
    s:email: mailto:briandoconnor@gmail.com
    s:name: Brian O'Connor

s:citation: https://dx.doi.org/10.6084/m9.figshare.3115156.v2
s:codeRepository: https://github.com/common-workflow-language/common-workflow-language
s:dateCreated: "2016-12-13"
s:license: https://spdx.org/licenses/Apache-2.0 

s:keywords: edam:topic_0091 , edam:topic_0622
s:programmingLanguage: C

$namespaces:
 s: https://schema.org/
 edam: http://edamontology.org/

$schemas:
 - https://schema.org/version/latest/schemaorg-current-http.rdf
 - http://edamontology.org/EDAM_1.18.owl

4. Publishing Workflows as CWL in WorkflowHub.eu:

  • Workflows are digital objects which can and should be preserved.

  • A number of repositories exist and may be used to deposit workflows.

  • One may use a generic repository such as Zenodo to do so (see recipe Depositing in Zenodo generic repository).

  • Preferably, one should use a specialized repository such as Workflowhub.eu, which is presented below.

5. Tools: Apache AIRflow playing with CWL

Apache Airflow is a platform created by the community to programmatically author, schedule and monitor workflows , to quote the project’s site. It has established itself in industry settings and has broad uptake.

Apache Airflow represents workflows as Directed Acyclic Graph (or DAGs) and Airflow allows the serialization of these as JSON documents.

The main thing about Apache Airflow is that code is used to generate the workflows. For more information, refer to this tutorial: https://airflow.apache.org/docs/apache-airflow/stable/tutorial.html.

A tool developed by Michael Kotliar, Andrey V Kartashov, Artem Barski brings CWL support to the Apache Airflow framework, meaning that CWL expressed workflow can now be executed on the platform 4.

A key step in this linkage is the conversion of a CWL expressed workflow into an Apache Airflow DAG, which can then be subsequently executed.

With this example, we aim to bring awareness about the value of having platform independent expression of workflows.

6. Biocompute Object format, an IEEE specification suited for use in regulatory applications.

If computational analyses on sequence data are performed in the context of clinical trials, for instance to demonstrate the transcriptomics response to a drug or to show to safety of a compound in populations of distinct genetic background using genotyping information, it is a regulatory requirements of the US FDA to submit the computational workflows if seeking approval. The availability of such information in this context is a prerequisite for FDA auditors to examine the data.

The IEEE 2791-2020 specifications, also known as BCO for BioCompute Object is a specification to do this.

This has been made possible thanks to the fast-track submission of a new data format specifically tailored to ensure reproducibility and unambiguous description of workflow key descriptors.

What are the main features of a BioCompute Object?

  • a BioCompute Object is serialized as a JSON document. A typical BCO looks like this:

source: https://github.com/biocompute-objects/bco-ro-example-chipseq/blob/main/data/chipseq_20200910.json

  • a BioCompute Object can be packaged as an RO-Crate.

source: https://github.com/biocompute-objects/bco-ro-example-chipseq/blob/main/data/ro-crate-metadata.json

  • a BioCompute Object can be integrated with HL7 FHIR as a Provenance Resource.

{
  "resourceType": "Provenance",
  "id": "example-biocompute-object",
  "text": {
    "status": "generated",
    "div": "<div xmlns=\"http://www.w3.org/1999/xhtml\">\n\t\t\t<p>\n\t\t\t\t<b>Generated Narrative with Details</b>\n\t\t\t</p><p>\n\t\t\t\t<b>id</b>: example-biocompute-object</p><p>\n\t\t\t\t<b>target</b>: <a href=\"http://build.fhir.org/sequence-example.html\">MolecularSequence/example</a>\n\t\t\t</p><p>\n\t\t\t\t<b>period</b>: 2017-6-6 --&gt; (ongoing)</p><p>\n\t\t\t\t<b>recorded</b>: 2016-6-9 8:12:14</p><p>\n\t\t\t\t<b>reason</b>: antiviral resistance detection (Details: [not stated] code null = 'null', stated as\n         'antiviral resistance detection')</p>\n\t\t\t<h3>Agents</h3>\n\t\t\t<table>\n\t\t\t\t<tr>\n\t\t\t\t\t<td>-</td>\n\t\t\t\t\t<td>\n\t\t\t\t\t\t<b>Role</b>\n\t\t\t\t\t</td>\n\t\t\t\t\t<td>\n\t\t\t\t\t\t<b>Who</b>\n\t\t\t\t\t</td>\n\t\t\t\t</tr>\n\t\t\t\t<tr>\n\t\t\t\t\t<td>*</td>\n\t\t\t\t\t<td>Author (Details: http://hl7.org/fhir/provenance-participant-role code author = 'Author',\n             stated as 'null')</td>\n\t\t\t\t\t<td>\n\t\t\t\t\t\t<a href=\"http://build.fhir.org/practitioner-example.html\">Practitioner/example</a>\n\t\t\t\t\t</td>\n\t\t\t\t</tr>\n\t\t\t</table>\n\t\t\t<h3>Entities</h3>\n\t\t\t<table>\n\t\t\t\t<tr>\n\t\t\t\t\t<td>-</td>\n\t\t\t\t\t<td>\n\t\t\t\t\t\t<b>Role</b>\n\t\t\t\t\t</td>\n\t\t\t\t\t<td>\n\t\t\t\t\t\t<b>Reference</b>\n\t\t\t\t\t</td>\n\t\t\t\t</tr>\n\t\t\t\t<tr>\n\t\t\t\t\t<td>*</td>\n\t\t\t\t\t<td>source</td>\n\t\t\t\t\t<td>\n\t\t\t\t\t\t<a href=\"https://hive.biochemistry.gwu.edu/cgi-bin/prd/htscsrs/servlet.cgi?pageid=bcoexample_1\">Biocompute example</a>\n\t\t\t\t\t</td>\n\t\t\t\t</tr>\n\t\t\t</table>\n\t\t</div>"
  },
  "target": [
    {
      "reference": "MolecularSequence/example"
    }
  ],
  "occurredPeriod": {
    "start": "2017-06-06"
  },
  "recorded": "2016-06-09T08:12:14+10:00",
  "activity": {
    "text": "antiviral resistance detection"
  },
  "agent": [
    {
      "type": {
        "coding": [
          {
            "system": "http://terminology.hl7.org/CodeSystem/v3-ParticipationType",
            "code": "AUT"
          }
        ]
      },
      "who": {
        "reference": "Practitioner/example"
      }
    }
  ],
  "entity": [
    {
      "role": "source",
      "what": {
        "identifier": {
          "type": {
            "coding": [
              {
                "system": "https://hive.biochemistry.gwu.edu",
                "code": "biocompute",
                "display": "obj.1001"
              }
            ]
          },
          "value": "https://hive.biochemistry.gwu.edu/cgi-bin/prd/htscsrs/servlet.cgi?pageid=bcoexample_1"
        }
      }
    }
  ]
}
  • a BioCompute Object may allow referencing a CWL expressed workflow thus increasing interoperability.

Several tools currently support the BCO format:

Conclusion

This recipe focused on highlighting important considerations to bear in mind when dealing with workflows as these digital objects have become essential information carriers to assist data science tasks.

While there is no shortage of tools and frameworks for building, saving, executing workflows, making sure these can be found, interpreted by machine without human intervention and executed are essential aspects of reusability and interoperability.

Data Scientists and Information managers should therefore tap into a number of standardization efforts capable of ensure appropriate provenance tracking and information preservation.

This knowledge could be harnessed to decide whether to trust the results of an analysis or a transformation process, or to decide whether to perform new ones.

Reference

Authors