Creating a community annotation compliant ISA descriptor: Dataset Maturity Level 2¶
Abstract:¶
With the recipe, we will continue using the ISA model to capture study metadata and dataset description, but we will be moving up on the scale of data set maturity by showing how to implement four important aspects:
community agreed experimental design descriptors
community agreed annotation levels for description of biological materials
extensive semantic annotation and reduction of free text annotation
machine-readable and more interoperable format: serialization to ISA-JSON
Let’s get started! Loading the ISA-API¶
from isatools.model import (
Comment,
Investigation,
Study,
StudyFactor,
FactorValue,
OntologyAnnotation,
Characteristic,
OntologySource,
Material,
Sample,
Source,
Protocol,
ProtocolParameter,
ParameterValue,
Process,
Publication,
Person,
Assay,
DataFile,
plink
)
import os
Creating an ISA investigation¶
investigation = Investigation()
Declaring Semantic Resources
used to annotated ISA objects¶
ncbitaxon = OntologySource(name="NCIBTaxon",
description="NCBI Taxonomy")
efo = OntologySource(name="EFO",
description="Experimental Factor Ontology")
obi = OntologySource(name='OBI',
description="Ontology for Biomedical Investigations")
chebi = OntologySource(name="CHEBI",
description="Chemical Entity of Biological Interest")
pato = OntologySource(name='PATO',
description="Phenotype and Trait Ontology")
investigation.ontology_source_references = [chebi,efo,obi,pato,ncbitaxon]
Building and Annotation ISA objects to a Community Compliant
level¶
The annotation requirements we are dealing with are:
those of ENA Short Read Archive Repository on the one hand
those of a Minimal Information checklist: MINSEQE minimum information for a sequencing experiment.
study = Study(filename="s_BII-S-3-synthesis.txt")
study.identifier = "BII-S-3-synth"
study.title = "Metagenomes and Metatranscriptomes of phytoplankton blooms from an ocean acidification mesocosm experiment"
study.description = "Sequencing the metatranscriptome can provide information about the response of organisms to \
varying environmental conditions. We present a methodology for obtaining random whole-community mRNA from a complex \
microbial assemblage using Pyrosequencing. The metatranscriptome had, with minimum contamination by ribosomal RNA,\
significant coverage of abundant transcripts, and included significantly more potentially novel proteins than in the\
metagenome. This experiment is part of a much larger experiment. We have produced 4 454 metatranscriptomic datasets \
and 6 454 metagenomic datasets. These were derived from 4 samples."
study.submission_date = "2008-08-15"
study.public_release_date = "2008-08-15"
These NCBI SRA related ISA Comments fields are required and must be present for the ISA SRAconverter
to be invoked later
src_comment_sra1 = Comment(name="SRA Broker Name", value="OXFORD")
src_comment_sra2 = Comment(name="SRA Center Name", value="OXFORD")
src_comment_sra3 = Comment(name="SRA Center Project Name", value="OXFORD")
src_comment_sra4 = Comment(name="SRA Lab Name", value="Oxford e-Research Centre")
src_comment_sra5 = Comment(name="SRA Submission Action", value="ADD")
study.comments.append(src_comment_sra1)
study.comments.append(src_comment_sra2)
study.comments.append(src_comment_sra3)
study.comments.append(src_comment_sra4)
study.comments.append(src_comment_sra5)
These ISA Comments
are optional and may be used to report funding information
src_comment_st1 = Comment(name="Study Funding Agency", value="")
src_comment_st2 = Comment(name="Study Grant Number", value="")
study.comments.append(src_comment_st1)
study.comments.append(src_comment_st2)
Declaring all the protocols used in the ISA study.¶
tip
Note also the declaration of ISA Protocol Parameters
when needed.
study.protocols = [
Protocol(name="environmental material collection - standard procedure 1",
description="Waters samples were prefiltered through a 1.6 um GF/A glass fibre filter to reduce Eukaryotic \
contamination. Filtrate was then collected on a 0.2 um Sterivex (millipore) filter which was frozen in \
liquid nitrogen until nucelic acid extraction. CO2 bubbled through 11000 L mesocosm to simulate ocean \
acidification predicted conditions. Then phosphate and nitrate were added to induce a phytoplankton bloom.",
protocol_type=OntologyAnnotation(term="sample collection"),
parameters=[
ProtocolParameter(parameter_name=OntologyAnnotation(term="filter pore size"))
]
),
Protocol(name="aliquoting-0",
description="aliquoting",
protocol_type=OntologyAnnotation(term="sample collection")),
Protocol(name="nucleic acid extraction",
description="Total nucleic acid extraction was done as quickly as possible using the method of \
Neufeld et al, 2007.",
protocol_type=OntologyAnnotation(term="nucleic acid extraction")
),
Protocol(name="mRNA extraction - standard procedure 3",
description="RNA MinElute + substrative Hybridization + MEGAclear For transcriptomics, total RNA\
was separated from the columns using the RNA MinElute clean-up kit (Qiagen) and checked for integrity of\
rRNA using an Agilent bioanalyser (RNA nano6000 chip). High integrity rRNA is essential for subtractive\
hybridization. Samples were treated with Turbo DNA-free enzyme (Ambion) to remove contaminating DNA.\
The rRNA was removed from mRNA by subtractive hybridization (Microbe Express Kit, Ambion), and absence\
of rRNA and DNA contamination was confirmed using the Agilent bioanalyser. The mRNA was further\
purified with the MEGAclearTM kit (Ambion). Reverse transcription of mRNA was performed using\
the SuperScript III enzyme (Invitrogen) with random hexamer primers (Promega).\
The cDNA was treated with RiboShredderTM RNase Blend (Epicentre) to remove trace RNA contaminants. \
To improve the yield of cDNA, samples were subjected to random amplification using the \
GenomiPhi V2 method (GE Healthcare). GenomiPhi technology produces branched DNA molecules \
that are recalcitrant to the pyrosequencing methodology. Therefore amplified samples were treated with\
S1 nuclease using the method of Zhang et al.2006.",
protocol_type=OntologyAnnotation(term="labeling") #nucleic acid extraction
),
Protocol(name="genomic DNA extraction - standard procedure 4",
description="superscript+random hexamer primer",
protocol_type=OntologyAnnotation(term="nucleic acid extraction")
),
Protocol(name="reverse transcription - standard procedure 5",
description="",
protocol_type=OntologyAnnotation(term="reverse transcription"),
),
Protocol(name="library construction",
description="",
protocol_type=OntologyAnnotation(term="library construction"),
parameters=[ProtocolParameter(parameter_name=OntologyAnnotation(term="library strategy")),
ProtocolParameter(parameter_name=OntologyAnnotation(term="library layout")),
ProtocolParameter(parameter_name=OntologyAnnotation(term="library selection"))
]
),
Protocol(name="nucleic acid sequencing", #pyrosequencing - standard procedure 6",
description="1. Sample Input and Fragmentation: The Genome Sequencer FLX System supports the sequencing of \
samples from a wide variety of starting materials including genomic DNA, PCR products, BACs, and cDNA. \
Samples such as genomic DNA and BACs are fractionated into small, 300- to 800-base pair fragments. \
For smaller samples, such as small non-coding RNA or PCR amplicons, fragmentation is not required. \
Instead, short PCR products amplified using Genome Sequencer fusion primers can be used for immobilization\
onto DNA capture beads as shown below.",
protocol_type=OntologyAnnotation(term="nucleic acid sequencing"),
parameters=[ProtocolParameter(parameter_name=OntologyAnnotation(term="sequencing instrument"))
]
),
Protocol(name="sequence analysis - standard procedure 7",
description="",
protocol_type=OntologyAnnotation(term="data transformation")
)
]
Adding a Study Design descriptor to the ISA Study object
intervention_design = OntologyAnnotation(term_source=obi)
intervention_design.term = "intervention design"
intervention_design.term_accession = "http://purl.obolibrary.org/obo/OBI_0000115"
study.design_descriptors.append(intervention_design)
Declaring the ISA Study Factors
, i.e. The Independent Variables of the Experiment¶
study.factors = [
StudyFactor(name="compound",
factor_type=OntologyAnnotation(term="chemical substance",
term_accession="http://purl.obolibrary.org/obo/CHEBI_59999",
term_source=chebi)),
StudyFactor(name="dose",
factor_type=OntologyAnnotation(term="dose",
term_accession="http://www.ebi.ac.uk/efo/EFO_0000428",
term_source=efo)),
StudyFactor(name="collection time",
factor_type=OntologyAnnotation(term="time",
term_accession="http://purl.obolibrary.org/obo/PATO_0000165",
term_source=pato))
]
Declaring the ISA FactorValues
i.e. The factor levels to each of the Indepedent Variables or ISA Study Factor
.¶
fv1 = FactorValue(factor_name=study.factors[0], value=OntologyAnnotation(term="carbon dioxide"))
fv2 = FactorValue(factor_name=study.factors[1], value=OntologyAnnotation(term="high"))
fv3 = FactorValue(factor_name=study.factors[1], value=OntologyAnnotation(term="normal"))
fv4 = FactorValue(factor_name=study.factors[2], value="may 13th, 2006")
fv5 = FactorValue(factor_name=study.factors[2], value="may 19th, 2006")
Adding the publications associated to the study¶
study.publications = [
Publication(doi="10.1371/journal.pone.0003042",
pubmed_id="18725995",
title="Detection of large numbers of novel sequences in the metatranscriptomes \
of complex marine microbial communities.",
status=OntologyAnnotation(term="indexed in PubMed"),
author_list="Gilbert JA, Field D, Huang Y, Edwards R, Li W, Gilna P, Joint I."),
Publication(doi="10.1111/j.1462-2920.2008.01745.x",
title="Potential for phosphonoacetate utilization by marine bacteria in temperate coastal waters",
pubmed_id="18783384",
status=OntologyAnnotation(term="indexed in PubMed"),
author_list="Gilbert JA, Thomas S, Cooley NA, Kulakova A, Field D, Booth T, McGrath JW, Quinn JP, Joint I.")
]
Adding the authors of the study¶
study.contacts = [
Person(first_name="Jack",
last_name="Gilbert",
affiliation="Plymouth Marine Laboratory",
email="jagi@pml.ac.uk",
address="Prospect Place, Plymouth, United Kingdom",
comments=[Comment(name="Study Person REF",
value="")],
roles=[OntologyAnnotation(term="principal investigator role"),
OntologyAnnotation(term="SRA Inform On Status"),
OntologyAnnotation(term="SRA Inform On Error")]
),
Person(first_name="Dawn",
last_name="Field",
affiliation="NERC Centre for Ecology and Hydrology",
address="CEH Oxford, Oxford, United Kingdom",
comments=[Comment(name="Study Person REF",
value="")],
roles=[OntologyAnnotation(term="principal investigator role")]
)
]
Now, creating ISA Source and Sample objects and building what is the BII-S-3 Study table
study.sources = [Source(name="GSM255770"), Source(name="GSM255771"),Source(name="GSM255772"),Source(name="GSM255773")]
study.samples = [Sample(name="GSM255770"), Sample(name="GSM255771"), Sample(name="GSM255772"), Sample(name="GSM255773")]
# Note how the treatment groups are defined as sets of factor values attached to the ISA.Sample object
study.samples[0].factor_values=[fv1,fv2,fv4]
study.samples[1].factor_values=[fv1,fv3,fv4]
study.samples[2].factor_values=[fv1,fv2,fv5]
study.samples[3].factor_values=[fv1,fv3,fv5]
characteristic_organism = Characteristic(category=OntologyAnnotation(term="Organism"),
value=OntologyAnnotation(term="marine metagenome",
term_source=ncbitaxon,
term_accession="http://purl.obolibrary.org/obo/NCBITaxon_408172"))
Now creating a Process showing a Protocol Application
using Source as input and producing Sample as output.
for i in range(len(study.sources)):
study.sources[i].characteristics.append(characteristic_organism)
study.process_sequence.append(Process(executes_protocol=study.protocols[0],
inputs=[study.sources[i]],
outputs=[study.samples[i]])
)
Now appending the ISA Study object to the ISA Investigation object
investigation.studies = [study]
Now, creating the ISA objects to represent assays and acquired raw data Starting by declaring the 2 types of assays used in BII-S-3:
assay = Assay(filename="a_gilbert-assay-Gx.txt")
assay.measurement_type = OntologyAnnotation(term="metagenome sequencing",
term_accession="http://purl.obolibrary.org/obo/OBI_0002623",
term_source=obi)
assay.technology_type = OntologyAnnotation(term="nucleotide sequencing",
term_accession="http://purl.obolibrary.org/obo/OBI_0000626",
term_source=obi)
assay_tx = Assay(filename="a_gilbert-assay-Tx.txt")
assay_tx.measurement_type = OntologyAnnotation(term="transcription profiling",
term_accession="http://purl.obolibrary.org/obo/OBI_0000424",
term_source=obi)
assay_tx.technology_type = OntologyAnnotation(term="nucleotide sequencing",
term_accession="http://purl.obolibrary.org/obo/OBI_0000626",
term_source=obi)
Warning
make sure the ISA plink function
connects the first and last protocols found in a chain of protocols.
for i, sample in enumerate(study.samples):
# # create an aliquoting process which creates new samples from existing samples created in a study.processSequence
# aliquoting_process = Process(executes_protocol=study.protocols[1])
# aliquot = Sample(name="aliquot-{}".format(i), derives_from=sample)
# aliquoting_process.inputs.append(sample)
# aliquoting_process.outputs.append(aliquot)
# create an extraction process that executes the extraction protocol
extraction_process = Process(executes_protocol=study.protocols[0])
# extraction process takes as input a sample, and produces an extract material as output
char_ext = Characteristic(category=OntologyAnnotation(term="Material Type"),
value=OntologyAnnotation(term="pellet"))
char_ext1 = Characteristic(category=OntologyAnnotation(term="quantity"),
value=OntologyAnnotation(term="loads"))
# extraction_process.inputs.append(aliquot)
extraction_process.inputs.append(sample)
material = Material(name="extract-{}".format(i))
material.type = "Extract Name"
material.characteristics.append(char_ext)
material.characteristics.append(char_ext1)
extraction_process.outputs.append(material)
# create a sequencing process that executes the sequencing protocol
sequencing_process = Process(executes_protocol=study.protocols[7])
sequencing_process.name = "assay-name-{}".format(i)
sequencing_process.inputs.append(extraction_process.outputs[0])
# sequencing_process.inputs.append(material)
# Sequencing process usually has an output data file
datafile = DataFile(filename="sequenced-data-{}".format(i), label="Raw Data File")
data_comment = Comment(name="data_comment",value="data_value")
datafile.comments.append(data_comment)
sequencing_process.outputs.append(datafile)
# Ensure Processes are linked forward and backward. plink(from_process, to_process) is a function to set
# these links for you. It is found in the isatools.model package
# plink(aliquoting_process, sequencing_process)
plink(extraction_process, sequencing_process)
# make sure the extract, data file, and the processes are attached to the assay
assay.samples.append(sample)
# assay.samples.append(aliquot)
assay.other_material.append(material)
assay.data_files.append(datafile)
assay.process_sequence.append(extraction_process)
assay.process_sequence.append(sequencing_process)
# create an extraction process that executes the extraction protocol
extraction_process_tx = Process(executes_protocol=study.protocols[2])
# extraction process takes as input a sample, and produces an extract material as output
extraction_process_tx.inputs.append(sample)
# extraction_process_tx.outputs.append(sample)
# extraction_process_tx.outputs=[]
# material_tx = Material(name="extract-{}".format(i))
# material_tx.type = "Extract Name"
# extraction_process_tx.outputs.append(material_tx)
# labeling process takes as input an extract, and produces a labeled extract material as output
labeling_process_tx = Process(executes_protocol=study.protocols[3])
# labeling_process_tx.inputs.append(sample)
# labeling_process_tx.inputs=[]
material_tx = Material(name="le-{}".format(i))
material_tx.type = "Labeled Extract Name"
labeling_process_tx.outputs.append(material_tx)
# create a sequencing process that executes the sequencing protocol
# seq_pv1 = ParameterValue(category="library layout", value="SINGLE")
seq_pv2 = ParameterValue(category=ProtocolParameter(parameter_name=OntologyAnnotation(term="library layout")),
value=OntologyAnnotation(term="SINGLE"))
sequencing_process_tx = Process(executes_protocol=study.protocols[6], parameter_values=[seq_pv2])
sequencing_process_tx.name = "assay-name-tx-{}".format(i)
sequencing_process_tx.inputs.append(labeling_process_tx.outputs[0])
# Sequencing process usually has an output data file
datafile_tx = DataFile(filename="sequenced-data-tx-{}".format(i), label="Raw Data File")
sequencing_process_tx.outputs.append(datafile_tx)
# Ensure Processes are linked forward and backward. plink(from_process, to_process) is a function to set
# these links for you. It is found in the isatools.model package
# plink(extraction_process_tx, labeling_process_tx)
# plink(labeling_process_tx, sequencing_process_tx)
plink(extraction_process_tx, sequencing_process_tx)
# make sure the extract, data file, and the processes are attached to the assay
assay_tx.samples.append(sample)
assay_tx.other_material.append(material_tx)
assay_tx.data_files.append(datafile_tx)
assay_tx.process_sequence.append(extraction_process_tx)
assay_tx.process_sequence.append(labeling_process_tx)
assay_tx.process_sequence.append(sequencing_process_tx)
Adding both assays to the ISA Study object¶
study.assays.append(assay)
study.assays.append(assay_tx)
# assay_g = investigation.studies[0].assays[1]
# assay_t = investigation.studies[0].assays[0]
# assay_t.samples=investigation.studies[0].samples
# extract1 = Material(name="GSM255770.e1")
# extract1.type = "Extract Name"
# extract2 = Material(name="GSM255771.e1")
# extract2.type = "Extract Name"
# extract3 = Material(name="GSM255772.e1")
# extract3.type = "Extract Name"
# extract4 = Material(name="GSM255773.e1")
# extract4.type = "Extract Name"
# extract5 = Material(name="GSM255774.e1")
# extract5.type = "Extract Name"
# assay_t.other_material.append(extract1)
# assay_t.other_material.append(extract2)
# assay_t.other_material.append(extract3)
# assay_t.other_material.append(extract4)
# assay_t.other_material.append(extract5)
# for i in range(len(study.samples)):
# assay_t.process_sequence.append(Process(
# executes_protocol=study.protocols[1],
# inputs=study.samples[i],
# outputs=assay_t.other_material[i]
# ))
# data=DataFile(filename="sequenced-data-{}".format(i), label="Raw Data File")
# assay_t.process_sequence.append(Process(
# executes_protocol=study.protocols[3],
# inputs=assay_t.other_material[i]
# ))
# plink(assay_t.process_sequence[0], assay_t.process_sequence[1])
# assay_t.process_sequence[-1].outputs.append(data)
# assay_t.data_files.append(data)
Writing the ISA object representation to file with the ISA-API dump
function¶
from isatools.isatab import dump
# note the use of the flag for explicit serialization on factor values on assay tables
dump(investigation, "./output/BII-S-3-synth/", write_factor_values_in_assay_table=True)
Reading the ISA document from disk back in, loading it into memory and writing to disk again to check that the ISA-API load function works nominally¶
from isatools.isatab import load
with open(os.path.join("./output/BII-S-3-synth/","i_investigation.txt")) as bii3stest:
roundtrip = load(bii3stest)
from isatools.convert import isatab2json
from isatools import isajson
import json
isa_json = isatab2json.convert('./output/BII-S-3-synth/', validate_first=False, use_new_parser=True)
isa_j = json.dumps(
isa_json, cls=isajson.ISAJSONEncoder, sort_keys=True, indent=4, separators=(',', ': ')
)
et Voilà!
Authors¶
Authors
Name |
ORCID |
Affiliation |
Type |
ELIXIR Node |
Contribution |
---|---|---|---|---|---|
University of Oxford |
Writing - Original Draft |