4. Selecting terminologies and ontologies¶

Recipe Overview

Reading Time

15 minutes

Executable Code

No

Difficulty

Selecting terminologies and ontologies

Recipe Type

Guidance

Audience

Principal Investigator, Data Manager, Terminology Manager, Data Scientist, Ontologist

Maturity Level & Indicator

[F+MM-1.1C] [F+MM-1.2C]

Cite me with FCB020

4.1. Main Objectives¶

The main purpose of this recipe is to provide guidance on how to select the most suitable semantic artefacts given a specific research context in general, and when it comes to life and biomedical sciences projects, their main themes, i.e. risk assessment, clinical trial, drug discovery or fundamental research.

4.2. Graphical Overview¶

4.3. Capability & Maturity Table¶

Capability	Initial Maturity Level	Final Maturity Level
Interoperability	minimal	repeatable

4.4. Context is everything¶

The domain of operation will generally dictate the semantic framework that is most suited to a given dataset. This is simply due to the fact that the advances in data standardization in specific fields are such that it is a sound decision to adopt a complete stack of standards, both syntactic and semantic.

Here, we present the three most common scenarios in biomedical research, based on experience garnered during IMI eTRIKS 4:

Clinical Trial Data
Observational Health Data
Basic research context

4.4.1. Clinical Trial Data¶

Operating in the field of Clinical Trials means that datasets are generated during interventional studies, meaning that researchers influence and control the predictor variables, which are usually different intensity levels of therapeutic agents, in order to gain insights in terms of benefits in patient outcomes. In this context, regulatory requirements make it so that data must be recorded in standard forms to allow for review and appraisal by regulators such as FDA reviewers in the US. The CDISC standards are the de-facto standard in this area, which mandates the use of semantics resources such as:

Semantic Resource	Domain	Service
CDISC vocabulary	clinical trial data	EVS
NCI Thesaurus	biomedicine	EVS,Bioportal,OLS
SNOMED-CT	pathology	EVS,Bioportal(§)
UMLS	pathology	EVS,Bioportal(§)
LOINC	laboratory tests	Loinc
RxNORM	drugs	Bioportal
GUDID	instruments	FDA

All available from the NCBI EVS system, LOINC, OLS or Bioportal.

Warning

Some resources are only available under restrictive licences, which prevent derivative work, which may limit access and use. Furthermore, some licenses are expensive.

4.4.2. Observational Health Data¶

This context refers to data collected during observational studies, which in contrast to interventional studies, draw inferences from a sample to a population where the independent variable is not under the control of the researcher because of ethical concerns or logistical constraints [1]. This is typically the case in epidemiological work or exposure follow-up studies in the context of risk assessment and evaluation of clinical outcomes. Observational health data can also include electronic health records (EHR) or administrative insurance claims and allow research around acquiring real world evidence from large corpora of data. In this specific context, one model and associated set of standards has been particularly successful. With several hundred millions of patients’ information structured using the Observational Medical Outcomes Partnership (OMOP), the Observational Health Data Sciences and Informatics (ODHSI) open-science community has laid the foundation for a widely adopted data model. Therefore, building a FAIRification process around the standard stack produced by the ODHSI community needs to be considered if operating in such a data context.

Semantic Resource	Domain	Service
CDISC vocabulary	clinical trial data	EVS
NCI Thesaurus	biomedicine	EVS,Bioportal,OLS
SNOMED-CT	pathology	EVS,Bioportal(§)
UMLS	pathology	EVS,Bioportal(§)
LOINC	laboratory tests	LOINC
RxNORM	drugs	Bioportal

For a more detailed overview and deep-dive into the ODHSI and OMOP semantic support, we recommend the reading of the chapter dedicated to the controlled terminology in the Book of OHDSI 2

4.4.3. Basic research context¶

This refers to datasets and research output being generated using model organisms and cellular systems in the context of basic, fundamental research. In this arena, the regulatory pressure is much less present but this does not rule out data management best practices and proper archival requirements. As a consequence of fewer constraints, researchers are often confronted with a sea of options. This and the next sections aim to provide some guidance when tasked with deciding on which semantic resource to use.

Tip

An important consideration to bear in mind when selecting semantic resources is to assess whether or not data archival in public repositories will be required. For instance, submitting to NCBI Gene Expression Omnibus Data archive places no particular constraints on data annotations but if depositing to EMBL-EBI ArrayExpress, then selecting a resource such as the Experimental Factor Ontology (EFO) for annotating data could ease deposition.

Tip

The FAIRsharing registry 5 is an ELIXIR resource which provides invaluable content as the catalogue offers an overview of the various semantics artefact used by public data repositories.

4.5. Selecting Terminologies¶

4.5.1. Use Cases and General Recommendations¶

The use and implementation of common terminologies enables the normalisation and harmonisation of both variable labels and allowed values for each field. Implementing the use of common terminologies in the data collection or curation workflow will ensure consistency of the annotation across all data. This is particularly important if data is generated at multiple partner sites and/or by multiple individuals.
If data fields are annotated with terms from freely chosen ontologies (rather than those dictated by a common model such as CDSIC), care should be taken to avoid picking terms from ontologies at random. If a set of concepts are all available in one ontology, this ontology should be preferred over a set of ontologies. Mapping services such as OxO are available to verify whether a term of interest in one ontology has an equivalent term in another ontology.
Restrictions of allowed values for a given field should ideally be limited to a single ontology and better yet, to a single branch of a chosen ontology. This will vastly improve the semantic queryability as well as the consistency and interoperability of the data.
Many ontologies and vocabularies reuse concepts from other ontologies, in line with best practice in ontology design, to limit duplication of efforts and proliferation of parallel synonymous concepts. Care should however be taken to use concepts in the most appropriate environment. This is usually their original source unless they are used as part of a larger set of terms. As an example, the Experimental Factor Ontology (EFO) reuses concepts from a range of ontologies, including species from the NCBI taxonomy, assays from OBI, and diseases and phenotypes from MONDO and HPO. If annotating a dataset or resource which covers all of these concepts, it therefore makes sense to use EFO as the primary annotation source. However, if only annotations for species are required, the NCBI taxonomy should be used directly to ensure completeness, since not all species in NCBItaxon will have been imported into EFO.

4.5.2. Selection Criteria¶

A set of widely accepted criteria for selecting terminologies (or other reporting standards) does not exist. There are however a number of excellent publications such as “A sea of standards for omics data: sink or swim?” 7 and “Ten Simple Rules for Selection a Bio-ontology” 3 providing some guidance on the subject. Below are a set of suggested criteria for evaluating the suitability of a terminology resource.

Exclusion criteria:
- 🔸 Absent licence or terms of use (indicator of usability)
- 🔸 Restrictive licences or terms of use with restrictions on redistribution and reuse
- 🔸 Absence of term definitions
- 🔸 Absence of sufficient class metadata (indicator of quality)
- 🔸 Absence of sustainability indicators (absence of funding records)
Inclusion criteria:
- 🔰 Scope and coverage meets the requirements of the concept identified
- 🔰 Unique URI, textual definition and IDs for each term
- 🔰 Resource releases are versioned
- 🔰 Size of resource (indicator of coverage)
- 🔰 Number of classes and subclasses (indicator of depth)
- 🔰 Number of terms having definitions and synonyms (indicator of richness)
- 🔰 Presence of a help desk and contact point (indicator of community support)
- 🔰 Presence of term submission tracker/issue tracker (indicator of resource agility and capability to grow upon request)
- 🔰 Potential integrative nature of the resource (as indicator of translational application potential)
- 🔰 Licensing information available (as indicator of freedom to use)
- 🔰 Use of a top level ontology (as indicator of a resource built for generic use)
- 🔰 Pragmatism (as indicator of actual, current real life practice)
- 🔰 Possibility of collaborating: the resource accepts complaints/remarks that aim to fix or improve the terminology, while the resource organisation commits to fix or improve the terminology in brief delays (one month after receipt?)

4.5.3. Set of Core Terminologies¶

The terminologies presented here have been organized by theme and scope. When possible, sections are organized by granularity levels, progressing from macroscopic scale (organism) to microscopic scale (tissue, cells) and molecular scale (macromolecules, proteins, small molecules, xenobiotic chemicals). Domains also cover processes or actions and their participants or agents but also can be organized from general/generic (disease) to specialized/specific (infectious disease).

4.5.3.1. Organism, Organism Parts and Developmental Stages¶

The resources listed here focus on providing structured vocabularies to describe taxonomic and anatomical information.

Scope	Name	File location	Top-Level Ontology	Licence	Issue Tracker URI	Comment
Organism	NCBITaxonomy	http://purl.obolibrary.org/obo/ncbitaxon.owl	none specified	UMLS license
Vertebrate Anatomy	UBERON	http://purl.obolibrary.org/obo/uberon/ext.owl http://purl.obolibrary.org/obo/uberon/ext.obo	BFO	CC-by 3.0 Unported Licence	https://github.com/obophenotype/uberon/issues	Integrative Resource engineered to go across species
Human Anatomy	Foundational Model of Anatomy (FMA)	http://purl.obolibrary.org/obo/fma.owl		CC-by 3.0 Unported Licence	https://sourceforge.net/p/obo/foundational-model-of-anatomy-fma-requests/	Excellent cross-referencing with Uberon
Human Developmental Stages	Human Developmental Stages	http://purl.obolibrary.org/obo/hsapdv.owl		CC-by 3.0 Unported Licence
Mouse Anatomy	Mouse Anatomy (MA)	http://purl.obolibrary.org/obo/ma.owl		CC-by 4.0	https://github.com/obophenotype/mouse-anatomy-ontology/issues
Strain	Rat Strain Ontology	http://purl.obolibrary.org/obo/rs.owl		CC-by 4.0	https://github.com/rat-genome-database/RS-Rat-Strain-Ontology/issues

In research, many different model organisms are used (e.g. Dogs, Monkeys…) and specialized resources are available for many model organisms, including C. elegans, Drosophila, Xenopus, Zebrafish, plants and fungi. Use the selection criteria introduced earlier to gauge their value in the data management workflow and their impact on data integration tasks.

4.5.3.2. Diseases and Phenotype¶

Biology is a complex field and observable manifestations of biological processes in living organisms vary, dependant on genetic background and environmental factors. Working on correlating genetic features with observable (phenotypic) ones, biologists rely heavily on such variables in the quest of disease biomarkers, which could be used to identify possible therapeutic targets. The main challenge is to ensure efficient machine actionable descriptions of these observable features.

Scope	Name	File location	Top-Level Ontology	Licence	Issue Tracker URI
Pathology/Disease (generic)
	SNOMED-CT	View on Bioportal		SNOMED license - part of the UMLS license
	NCI Thesaurus	http://evs.nci.nih.gov/ftp1/NCI_Thesaurus		NCI license
	International Classification of Diseases (ICD-10)	View on WHO site		WHO license
	Unified Medical Language System (UMLS)	https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html		UMLS license
	Disease Ontology Identifiers (DOID)	http://purl.obolibrary.org/obo/doid.owl	BFO	CC0 1.0 Universal	https://github.com/DiseaseOntology/HumanDiseaseOntology/issues
	MONDO Disease Ontology^*	http://purl.obolibrary.org/obo/mondo.owl	BFO	CC-BY 4.0	https://github.com/monarch-initiative/mondo/issues
	Infectious Disease Ontology (IDO)	https://code.google.com/p/infectious-disease-ontology/source/browse/trunk/src/ontology/ido-core/ido-main.owl	BFO	CC-by 3.0 Unported Licence	https://code.google.com/p/infectious-disease-ontology/issues/list
Phenotype
	Human Phenotype (HP)	http://purl.obolibrary.org/obo/hp.owl	BFO	HPO Licence	https://github.com/obophenotype/human-phenotype-ontology/issues/
	Medical Dictionary for Regulatory Activities Terminology (MedDRA)	View on Bioportal		Academic: Free accessible Commercial contact MSSO	https://mssotools.com/webcr/ login required
	Mammalian Phenotype (MP)	http://purl.obolibrary.org/obo/mp.owl		CC-BY 4.0	https://github.com/obophenotype/mammalian-phenotype-ontology/issues

^*MONDO was born of an effort to harmonise disease definitions from a number sources, includig OMIM (Online Mendelian Inheritance in Man), Orphanet, EFO and DOID, with work in progress to include NCIt. The OWL version includes axiomatisation using CL, Uberon, GO, HP, RO & NCBITaxon. The ontology is under active development by a range of ontology and domain experts. If no other limiting requirements dictate the use of an alternative ontology (e.g. use of NCItaxon as part of a CDISC-compliant dataset), it is therefore the most recommended open source ontology from the above list.

As with anatomy in the previous section, there is a growing body of organism-specific phenotype resources, such as C. elegans, Drosophila, Fission Yeast, Xenopus and Zebrafish.

4.5.3.3. Pathology and Disease Specific Resources¶

There is a wide range of ontologies available for specific diseases or disease types. Some examples are given below but this list is by no means exhaustive. Check ontology repositories such as OLS, Bioportal or the OBO Foundry for up-to-date lists of available ontologies

Scope	Name	File location	Top-Level Ontology	Licence
Malaria	Malaria Ontology (IDOMAL)		BFO	CC0 1.0 Universal
Alzheimer Disease	Alzheimer’s Disease Ontology (ADO)	https://www.scai.fraunhofer.de/content/dam/scai/de/downloads/bioinformatik/ontologies/ADO/ADO.zip	BFO
Rare disorder	Orphanet Rare Disease Ontology (ORDO)	View on Bioportal		CC-BY 4.0

4.5.3.4. Cellular entities¶

Following on through our review of semantic resources by granularity levels, this section details a number of reference resources which provide coverage for the describing cell types, cell lines 1 and cellular phenotypes.

Scope	Name	File location	Top-Level Ontology	Licence	Issue Tracker URI
Cell	Cell Ontology (CL)	http://purl.obolibrary.org/obo/cl.owl http://purl.obolibrary.org/obo/cl.obo	BFO	CC-by 4.0	https://code.google.com/p/cell-ontology/issues/list
Cell Lines
	Cellosaurus	ftp://ftp.expasy.org/databases/cellosaurus/cellosaurus.obo ftp://ftp.expasy.org/databases/cellosaurus		CC-by 4.0
	Cell Line Ontology (CLO)	https://github.com/CLO-ontology/CLO/blob/master/src/ontology/clo.owl	BFO	CC-by 3.0 Unported Licence	https://github.com/CLO-ontology/CLO/issues
Cell Molecular Phenotype	Cell Molecular Phenotype Ontology (CMPO)	https://github.com/EBISPOT/CMPO/releases/			https://github.com/EBISPOT/CMPO/issues

4.5.3.5. Molecular Entities¶

This section highlights the major and most widely used OBO Foundry resources for molecules of biological relevance as well as molecular structures, biological processes and cellular components

Scope	Name	File location	Top-Level Ontology	Licence	Issue Tracker URI
Chemicals and Small Molecules	Chemical Entities of Biological Interest (ChEBI)	ChEBI	BFO	CC-by 4.0	https://github.com/ebi-chebi/ChEBI/issues
Gene Function, Molecular Component, Biological Process	Gene Ontology (GO)	http://purl.obolibrary.org/obo/go.obo http://purl.obolibrary.org/obo/go.owl	BFO	CC-by 4.0	http://sourceforge.net/p/geneontology/ontology-requests/
Protein/peptide	Protein Ontology (PRO)	https://proconsortium.org	BFO	CC-by 4.0	https://github.com/PROconsortium/PRoteinOntology/issues

Besides, these open ontologies, in the context of clinically relevant work where drug formulation require recording and description, the following resources are relevant.

Scope	Name	File location	Top-Level Ontology	Licence	Issue Tracker URI
Drug
	National Drug File	View on Bioportal		NIH license
	The Drug Ontology (DRON)	http://purl.obolibrary.org/obo/dron.owl	BFO	CC-by 3.0 Unported Licence	https://ontology.atlassian.net/browse/DRON
	RxNORM	View on Bioportal		RxNORM license - part of the UMLS license

4.5.3.6. Assays and Technologies¶

The resources listed in this section are providing key descriptors bridging data acquisition procedures (as used in a clinical setting and wet lab work) with instruments, units of measurements, endpoints as well as sometimes the biological process or molecular entities of biological significance. Some of the resources are specialized semantic artefacts developed to support the standardized reporting of data modalities.

Scope	Name	File location	Top-Level Ontology	Licence	Issue Tracker URI
Radiology	Radiology Lexicon (RADLex)	View on Bioportal
Medical Imaging	DICOM	http://dicom.nema.org/medical/dicom/current/output/chtml/part16/chapter_D.html
Sample Processing/Reagents/Instruments Assay Definition	Ontology for Biomedical Investigations (OBI)	http://purl.obolibrary.org/obo/obi.owl	BFO	CC-by 4.0	https://github.com/obi-ontology/obi/issues
Biological screening assays and their results including high-throughput screening (HTS)	BioAssay Ontology (BAO)	http://www.bioassayontology.org/bao/bao_complete_bfo_dev.owl	BFO	CC-by-SA 4.0 International
Mass Spectrometry (instrument/acquisition parameter/spectrum related information)	HUPO Proteomics Standards Initiative-Mass Spectrometry controlled vocabulary (PSI-MS)	https://github.com/HUPO-PSI/psi-ms-CV	none specified	CC-by 4.0	https://github.com/HUPO-PSI/psi-ms-CV/issues
NMR Spectroscopy (instrument/acquisition parameter/spectrum related information)	Nuclear Magnetic Resonance Controlled Vocabulary (NMR-CV)	http://nmrml.org/cv/v1.0.rc1/nmrCV.owl	BFO	CC0 1.0 Universal	https://github.com/nmrML/nmrML/issues?state=open
Laboratory test	Logical Observation Identifier Names and Codes (LOINC)	LOINC and RELMA Complete Download File https://loinc.org/downloads/	none specified	RELMA license
Units	Units Ontology (UO)	http://purl.obolibrary.org/obo/uo.owl		CC-by 3.0 Unported Licence	https://github.com/bio-ontology-research-group/unit-ontology/issues

Some multi-domain ontologies such as the NCI Thesaurus (NCIt) and the Experimental Factor Ontology (EFO) also cover aspects of the above domains such as assays and sample collection and processing. Depending on the overall context of a resource selection process, it can make more sense to use a multi-domain ontology with suitable coverage to improve consistency and interoperability within a resource or dataset.

Finally, a resource exists that describes statistical measures, statistical tests or methods as well as statistically relevant graphical representations. It may be used for reporting results and annotating experimental results.

Scope	Name	File location	Top-Level Ontology	Licence	Issue Tracker URI
Experimental Design, Statistical Methods and Statistical Measures	Statistical Methods Ontology (STATO)	http://stato-ontology.org	BFO	CC-by 3.0 Unported Licence	https://github.com/ISA-tools/stato/issues?state=open

4.5.4. Relations¶

Also known as OWL Properties, their importance may be overlooked by data scientists who are not knowledge engineers or ontologists. These are essential components as, when correctly crafted with a proper understanding of the logical constraints available to semantic languages such as OWL, are exploited by tools known as reasoners to carry the following key tasks:

Ontology logical consistency checks
Automatic classification and inference tasks
Entailments, i.e. detection of logical consequences resulting from axiomatic definitions (closely related to the point above)

This is particularly important when processing billions of facts expressed as RDF statements.

One also needs to understand the current limitations in expressivity afforded by the current semantic web languages and the associated axiomatics as well as computational constraints associated with inference. For more in-depth review of such topics, the reader is invited to consults the following work 6 .

In the field of Biology and Biomedicine, the OBO Foundry coordinates the development of interoperable ontologies. At the core of this interoperation lies the Relation Ontology released under the CC0 1.0 Universal license.

Relation Ontology	File	Variant
Relation Ontology	ro.owl	Canonical edition
Relation Ontology in obo format	ro.obo	Has imports merged in
RO Core relations	ro/core.owl	Minimal subset intended to work with BFO-classes page
RO base ontology	ro/ro-base.owl	Axioms defined within RO and to be used in imports for other ontologies page
Interaction relations	ro/subsets/ro-interaction.owl
Ecology subset	ro/subsets/ro-eco.owl	For use in ecology and environmental science
Neuroscience subset	ro/subsets/ro-neuro.owl	For use in neuroscience page

As knowledge graphs and property graphs gain importance, we can expect the range and depth of relations to mature and expand as more expressivity is needed and progress is made by reasoner technology to fully exploit their benefits. This would also have to be placed in the context of advances in Text Mining and Machine Learning, where unsupervised methods start to demonstrate strong potential to detect relations between entities.

The following is an example of how a defined class may be created in an ontology. The code snippet shows one such class being expressed to create a type by specifying a number of axioms. These use relations (aka OWL Properties), which may be set to

'B cell, CD19-positive'
equivalentClass :
    'lymphocyte of B lineage, CD19-positive' 
    and ( 'has plasma membrane part' some 'CD19 molecule') 
    and ( 'in taxon' some Mammalia) 
    and ( 'capable of' some 'B cell mediated immunity')

Any class satisfying these patterns may be classified by an OWL reasoner as a child of that class. So the following class, with such properties that they all satisfy the requirements of the defined class declared above (e.g. “Homo sapiens” is_a type of “Mammalia”, etc…), will be classified automatically (i.e. without human intervention) by a reasoner such as ELK or Hermit as a child of ‘B cell, CD19-positive’ .

'human B cell, CD19-positive'
Class:
    ( 'has plasma membrane part' some 'B-lymphocyte antigen CD19 isoform h2')
    and ( 'in taxon' some 'Homo sapiens') 
    and ( 'capable of' some 'B cell tolerance induction in mucosal-associated lymphoid tissue')

The notion is important to grasp as it also explains why not all ontologies are compatible, because they may significantly differ in the underlying axioms they rely on to establish their hierarchies using reasoners.

4.6. Conclusions¶

Selecting semantic resources depends on many different factors. However, the most important factor remains the context of the data and associated landscape of data standards as well as the ultimate integration goal, which will dictate the final choice.

The selection process remains guided by the need to maximize the potential of data integration with datasets of similar nature and similar value. It also requires a good understanding of the technical and sometimes legal implications these choices will have.

4.6.1. What should I read next?¶

How to build an application ontology? Building an application ontology with ROBOT
How to select on ontology service? Selecting an ontology lookup service
How to deploy an ontology server? Portals and lookup services
[How to establish a minimal metadata profile?] Metadata profile validation in RDF

4.7. References¶

4.8. Authors¶

Authors

Name	Affiliation	Contribution
Philippe Rocca-Serra	University of Oxford	Writing - Original Draft
Susanna-Assunta Sansone	University of Oxford	Writing - Review & Editing, Funding Acquisition
Danielle Welter	University of Luxembourg	Writing - Review & Editing
Alasdair J G Gray	Heriot Watt University	Writing - Review & Editing