13. Extraction, transformation, and loading process¶

Recipe Overview

Reading Time

15 minutes

Executable Code

No

Difficulty

Extraction, transformation, and loading process

Recipe Type

Survey / Review

Audience

Data Curator, Data Manager, Data Scientist

Maturity Level & Indicator

[F+MM-1.1C] [F+MM-1.2C]

Cite me with FCB031

13.1. Main Objectives¶

This recipe identifies tools for data extraction, transformation, and loading (ETL). ETL is the process of collecting data from one source to persist it in another designated system in which the data is represented differently 1. A common use case in data science is to build a scalable and portable ETL system to extract data from various sources, transform said-data into a cohesive dataset, and load the data to an internal or public database to support data exploration.

This recipe aims to serve as a start point for designing ETL workflows, rather than provide a comprehensive list covering all available tools.

The lists of tools are generated either by manual curation, which reflects what is being used in the industry, or automatally discovered from the bio.tools repository.

Warning

The content in these tables was generated in March 2021. For updated content, please check the FAIR tooling repository. To provide feedback on this content or report issues, please do so via the FAIR Cookbook github issue tracker

13.2. Graphical Overview¶

The figure below shows different ETL-related operations and their relationships, together with related tools and recipes.

The table below is an overview of ETL tools identified. Details of each tools are also provided in this doc. For ETL tools for RDF data model, please check the dedicated recipe.


Data query and retrieval	Data transformation	Data deposition
TAMR	SDTM	REDCap
Bert	TriFacta	TransMART
Termite	OMOP	Stardog
Oracle clinical	Collibra	Postgresql
GeoBoost2	Apache Hadoop	Chemotion
MartView	Talend
CLOBNET	Informatica
PDBUSQLExtractor	OpenRefine-metadata-extension
repo	DISQOVER
SNPator	Query Tabular
PyGMQL	ms-data-core-api
PlantPIs
BAGET
NCBI Resources
CPAD

Warning

The tools list above aims to provide a basic overview of what is available on the market. It is not a formal recommendation. If you think key tools are missing or the list needs an update, please contact us via the github issue tracker.

13.3. Requirements¶

Knowledge requirement:
- General familiarity with data cleaning, data loading tasks.

13.4. Capability & Maturity Table¶

Capability	Initial Maturity Level	Final Maturity Level
Interoperability	minimal	automated

13.5. FAIRification Objectives, Inputs and Outputs¶

Actions.Objectives.Tasks	Input	Output
data retrieval
validation
data deposition
data submission, annotation and curation

13.6. Operations¶

Data query and retrieval is the process of extracting structured or unstructured data from different sources. The process of identifying and obtaining data from data management systems.

Manually curated list of tools

Tool	Description	License	Topics	Resource type
TAMR	A cloud-native data mastering solution (cloud MDM) accelerate analytics through machine learning (ML), available on Google Cloud, Azure and AWS.	Commercial license	information integration, data unification
Bert	Bidirectional Encoder Representations from Transformers (BERT) is a Transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google, aiming to extractcontext-sensitive features from an input text.	Apache License 2.0	relationship extraction	Data model
Termite	TERMite (TERM identification, tagging & extraction) is a fast named entity recognition (NER) and extraction engine for semantic analytics.	Commercial license	information extraction
Oracle clinical	a single application and infrastructure for electronic data capture and clinical data management, while leveraging the renowned Oracle database. Oracle Clinical enables management of all clinical trial data in a single system, improving accuracy, visibility, and data integrity.	Commercial license	Clinical data

Automatically created list of tools by querying Bio.Tools.

Tool	Description	License	Topics	Resource type
GeoBoost2	A natural languageprocessing pipeline for GenBank metadata enrichment for virus phylogeography.	Apache-2.0 License	Biotechnology, Natural language processing, Workflows, Public health and epidemiology, Infectiousdisease	Web application
MartView	Tool for data retrieval and data mining that integrates data from Ensembl. Through the web interface it allows you to apply a series of filters to create custom datasets which can be converted to several useful output formats.	N/A	Pathology, Data integration and warehousing, Database management, Molecular interactions, pathways and networks, Gene transcripts	Web Application
CLOBNET	Cloud-based machine learning system (CLOBNET) that is an open-source, lean infrastructure for electronic health record (EHR) data integration and is capable of extract, transform, and load (ETL) processing	BSD-2-Clause	Machine learning, Medicines research and development	Web application (for internal use)
PDBUSQLExtractor	Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment.	GPL-3.0	Nucleic acid structure analysis	C# library
repo	A data manager meant to avoid manual storage/retrieval of data to/from the file system. It builds one (or more) centralized repository where R objects are stored with rich annotations, including corresponding code chunks, and easily searched and retrieved.	GPL-3.0	Data submission, annotation and curation, Data management, Database management	R library
SNPator	Originally designed to help CeGen users to handle, retrieve, transform and analyze the genetic data generated by the genotyping facilities of the institution. However, it is also open to external users who may want to upload their own genotyped data in order to take advantage its data processing features. Users will be able to perform a set of operations which may range from very simple format transformations to some complex biostatistical calculations.	N/A	Bioinformatics, Data quality management, Genetic variation, GWAS study	Web Application
PyGMQL	Scalable data extraction and analysis for heterogeneous genomic datasets. Based on GMQL.	Apache-2.0	Imaging, Workflows, Database management	Python Library

Data portals with extraction functionalities discovered from Bio.Tools.

Tool	Description	License	Topics	Resource type
PlantPIs	Web querying system for a database collecting plant protease inhibitors data.	Plant biology, Gene and protein families	Database Portal
BAGET	Web service designed to facilitate extraction of specific gene and protein sequences from completely determined prokaryotic genomes. Query results can be exported as a rich text format file for printing, archival or further analysis.	N/A	DNA, Sequencing, Model organisms, Sequence analysis	Web application
NCBI Resources	Analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI web site.	Open Access	Data mining, Data integration and warehousing, Database management, Molecular interactions, pathways and networks, Rare diseases	Web Application, Database portal
CPAD	Curated Protein Aggregation Database (CPAD) is an integrated database which contains information related to amyloidogenic proteins, Aggregation prone regions, aggregation kinetics and structure of proteins relevant to aggregation.	N/A	Proteomics, Data submission, annotation and curation, Proteins, Small molecules, Literature and language	Data Portal
GenomicsDB	Advancing clinical cohort selection with genomics analysis on a distributed platform. Highly performant data storage in C++ for importing, querying and transforming variant data with Java/Spark. Used in gatk4.	MIT License	Exome sequencing, Systems medicine, Personalised medicine, Complementary medicine, Biobank	Command-line
Epiviz File Server	A library to query and transform genomic data from indexed files	MIT License	Oncology,Workflows, DNA	Python Library

13.6.1. Data transformation¶

Data transformation is the process of converting the data format, structure, and value for a specific goal, which may include meeting a specific data standards.

Below are listed common operations in data transformation:

Data annotation The process of labeling data or metadata.
Data validation The process of ensuring data has met data standards both in terms of data content and data format.
Data aggregation The process of gathering data and presenting it, usually, in a summarized format.

Manually curated list of tools

Tool	Description	License	Topics	Resource type
SDTM	SDTM provides a standard for organizing and formatting data to streamline processes in collection management analysis and reporting.	N/A	data aggregation, data warehousing, Standard for Exchange of Non-clinical Data	Data standard
TriFacta	A data preparation tool for data quality improvement, data transformation, and building data pipelines.	Commercial license
OMOP	The OMOP Common Data Model allows for the systematic analysis of disparate observational databases.	N/A	Data transformation	Data model
Collibra	an enterprise-oriented data governance platform that provides tools for data management and stewardship.	Commercial license	Data catalogue,data governance, data linkage
Apache Hadoop	A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines	Apache License 2.0	Distributed computing
Talend	A unified approach that combines rapid data integration, transformation, and mapping with automated quality checks to ensure trustworthy data in every step.	Commercial license	Data integration
Informatica	Connect & fetch data from different heterogeneous source and processing of data.	Commercial license	Data integration

Automatically created list of tools by querying Bio.Tools.

Tool	Description	License	Topics	Resource type
OpenRefine-metadata-extension	Enables a post-hoc FAIRification workflow: load an existing dataset, perform data wrangling tasks, add FAIR attributes to the data, generate a linked data version of the data and, finally, push the result to an online FAIR data infrastructure to make it accessible and discoverable.	MIT License	Data submission, Annotation and curation, Data identity and mapping	Extension
DISQOVER	Data integration platform for public, licensed and internal data. The Data Ingestion Engine enables transforming data into Linked Data which can be searched, navigated and analysed via the user interface and the API.	N/A	Biomedical science, Omics, Biology, Chemistry, Medicine	–
PGA	Construction of customized protein databases based on RNA-Seq data with/without genome guided, database searching, post-processing and report generation.	GPL-2.0	Proteomics, Proteomics experiment, Data submission, annotation and curation, Data integration and warehousing, Database management, Data governance, RNA-seq	R library
Query Tabular	Galaxy-based tool which manipulates tabular files. Query Tabular automatically creates a SQLite database directly from a tabular file within a Galaxy workflow. The SQLite database can be saved to the Galaxy history, and further process to generate tabular outputs containing desired information and formatting.	CC-BY-4.0	Bioinformatics,Workflows	Galaxy-based
ms-data-core-api	The primary purpose of ms-data-core-api library is to provide commonly used classes and Object Model for Proteomics Experiments. You may also find it useful for your own computational proteomics projects.	Apache-2.0	Proteomics, Proteomics experiment, Software engineering	Java Library

13.6.2. Data Deposition¶

Data deposition is the process of loading data to hosting (end) destinations, such as public data archives and data warehouses.

Manually curated list of tools

Tool	Description	License	Topics	Resource type
REDCap	REDCap is a secure web application for building and managing online surveys and databases. While REDCap can be used to collect virtually any type of data in any environment	REDCap License Terms	Cloud migration	Web application
TransMART	An open-source data warehouse designed to store large amounts of clinical data from clinical trials, as well as data from basic research	GPL-3.0 License	Data storage, clinical data
Stardog	Triple Store Database, Provide an enterprise knowledge graph as FAIR+ data catalogue	Commercial license	Data integration,data catalog
Postgresql	A free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance.	The PostgreSQL Licence	Relational database

Automatically created list of tools by querying Bio.Tools.

Tool	Description	License	Topics	Resource type
Chemotion	Repository for chemistry research data that provides solutions for current challenges to store research data in a feasible manner, allowing the conservation of domain specific information in a machine readable format.	CC-BY-4.0	Data submission, annotation and curation, Workflows, Chemistry	Data repository

13.6.3. Example use case:¶

To show how some of these tools may be used, the following related recipes provide further guidance with real life examples.

File format validation, FASTQ example

13.7. References¶

13.8. Authors¶

Authors

Name	Affiliation	Contribution
Fuqi Xu	EMBL-EBI	Writing - Original Draft
Eva Marin del Pico	Barcelona Supercomputing Centre	Writing - Original Draft
Sukhi Singh	The Hyve	Tool curation
Philippe Rocca-Serra	University of Oxford	Writing - Review & Editing

13. Extraction, transformation, and loading process¶

13.1. Main Objectives¶

13.2. Graphical Overview¶

13.3. Requirements¶

13.4. Capability & Maturity Table¶

13.5. FAIRification Objectives, Inputs and Outputs¶

13.6. Operations¶

13.6.1. Data transformation¶

13.6.2. Data Deposition¶

13.6.3. Example use case:¶

13.7. References¶

13.8. Authors¶

13.9. License¶