Datasets of relevance¶

Recipe Overview

Reading Time

15 minutes

Executable Code

No

Difficulty

Datasets of relevance

Recipe Type

Guidance

Audience

Principal Investigator, Data Manager, Terminology Manager, Data Scientist

Maturity Level & Indicator

[F+MM-1.1C] [F+MM-1.2C]

Cite me with FCB0XX

Main Objectives¶

The FAIR cookbook aims to provide hands-on, practical advice on how to deliver FAIR data through interactions with Innovative Medicine Initiative projects. These research projects, by nature often involve patient-centric information but dealing with real-world data and human-centric information, clinical data, in particular, is challenging. It most often mandates interacting with DACs, i.e. Data Access Committees, and undergoing a vetting process, which can be lengthy and convoluted. This can become a hindrance if the focus of the work is to deliver training on the computational methods available to deal with such data rather than data custody-related tasks, however important these are.

This FAIR cookbook recipe aims to provide a list of relevant resources belonging to the realm of clinical data so readers can, with the minimal hassle :

familiarize with the data types (for instance, how do Electronic Health records look like).
familiarize with the procedures to gain access to sensitive data.
obtain datasets with which to work and hone computational skills.

The recipe will cover two types of datasets:

real datasets such as the MIMIC-III dataset 2, which corresponds to actual medical notes data for which data access requests must be made but which are made available to computational scientists for research purposes.
synthetic datasets, which are available without restrictions since produced by computational methods and independent of any real patient. While handy, this type of data may come with a number of limitations prospective users need to be aware of.

Electronic Health Records: The MIMIC-III Critical Care Database¶

Electronics Medical Notes: The EBM NLP¶

Synthean Electronic Health Records¶

One of the main bottlenecks for data miners is the lack of dataset availability of electronic health records, due to, as we saw it to HIPAA concerns. To bypass these roadblocks, several tools have been developed to generate synthetic datasets, free of any restrictions. Below, we provide information about one such tool.

https://github.com/synthetichealth/synthea/wiki

Synthetic Electronic Medical Notes: the OMOP CDMv5 Test Data¶

Clinical Trial Data in CDISC SDTM format:¶

Observational Data in OMOP CDM format:¶

Conclusions¶

This content provides you with a set of resources to kick start your exploration of unstructured text in clinical context. These are useful resources for gaining familiarity with these data types. Remember to understand the data stewardship requirements that go along with handling real clinical data but also the limitations associated with some synthetic datasets.

What should I read next?¶

How to request data access and deal with data access committees?
How to do NER on EHR with NLP?
How to deal with unstructured text?

References¶

Authors¶

Authors

Name

ORCID

Affiliation

Type

ELIXIR Node

Contribution

Philippe Rocca-Serra

University of Oxford

Writing - Original Draft

Susanna-Assunta Sansone