Perhaps the first question should be, which genomes have been measured for the SARS-CoV-2 virus:
SPARQL sparql/genomes.rq (run, edit)
SELECT ?genome WHERE {
wd:Q82069695 wdt:P527/wdt:P6800 ?genome .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,en". }
}
Which lists these genome URLs:
genome |
https://gisaid.org/CoV2020 |
https://www.ncbi.nlm.nih.gov/assembly/GCF_009858895.2 |
https://www.ncbi.nlm.nih.gov/genome/86693 |
https://www.ncbi.nlm.nih.gov/nuccore/1798174254 |
Multiple variants of the virus genome made it into the international news. Originally thes were known as a Danish variant, a South-African variant, and a South-England variant. But the variants were only first discovered there, and the variant is not caused by anything related to the region. The following variants are listed in Wikidata, and includes the PANGO lineage code:
These were found in Wikidata with this query:
SPARQL sparql/sarscov2Variants.rq (run, edit)
SELECT DISTINCT ?variant ?variantLabel ?pango WHERE {
VALUES ?variantType { wd:Q15304597 wd:Q75913269 }
{ ?variant p:P31 [ ps:P31 ?variantType ; pq:P642 wd:Q82069695 ] . }
UNION
{ ?variant wdt:P31 wd:Q104450895 }
OPTIONAL { ?variant wdt:P9632 ?pango }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,en". }
}
These variants are all SARS-CoV-2, but a common type of combinations or sequence variants found in them gives them different properties. For examples, VUI–202012/01 (also known as B.1.1.7) has a combination of 17 sequence variants, see this write up. It must be noted that many of these 17 sequence variants are found in other SARS-CoV-2 variants too.
We can list all sequence variants listed in Wikidata (out of a few thousand!) with this query:
SPARQL sparql/sequenceVariants.rq (run, edit)
SELECT ?variant ?variantLabel ?sequence ?sequenceLabel ?taxon ?taxonLabel WHERE {
?variant wdt:P3433 ?sequence .
?sequence wdt:P703 / wdt:P171* wd:Q82069695 .
?variant wdt:P703 ?taxon .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
This gives us this list:
Each sequence variant is a change in the genes encoded by the viral DNA and cause a change in the protein encoded by that gene. The following two sections lists all genes and proteins. An interestion online book is found online under the title A sequence alignment and analysis of SARS-CoV-2 spike glycoprotein [1].
The RNA of SARS-CoV-2 has been sequenced. Therefore, the open reading frames are known and identified. We can query for the gene information in Wikidata with thie query:
SPARQL sparql/virusGenes.rq (run, edit)
SELECT ?gene ?geneLabel ?ncbigene WHERE {
?gene wdt:P703 wd:Q82069695 ; wdt:P31 wd:Q7187 .
OPTIONAL { ?gene wdt:P351 ?ncbigene }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,en". }
}
Which gives us these genes:
Alternatively, we may be interested in the proteins of the coronaviruses. We can get those with this query:
SPARQL sparql/virusProteins.rq (run, edit)
SELECT ?protein ?proteinLabel ?short ?refseq ?uniprot ?guideToPharma WHERE {
?protein wdt:P703 wd:Q82069695 ; wdt:P31 wd:Q8054 .
OPTIONAL { ?protein wdt:P637 ?refseq }
OPTIONAL { ?protein wdt:P352 ?uniprot }
OPTIONAL { ?protein wdt:P5458 ?guideToPharma }
OPTIONAL { ?protein wdt:P1813 ?short }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,en". }
} ORDER BY ASC(?protein) ASC(?uniprot)
Which gives us these proteins:
Thanks to work done by a team at the online BioHackathon in April 2020, macromolecular structures from the Complex Portal [2,3] have been entering Wikidata:
SPARQL sparql/complexes.rq (run, edit)
SELECT ?cpx ?complex ?complexLabel WHERE {
?complex wdt:P7718 ?cpx ;
wdt:P703 wd:Q82069695
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,en". }
}
Listing these complexes:
For the proteins, we can then query for the PDB structures [4]:
SPARQL sparql/virusProteinsPDB.rq (run, edit)
SELECT ?protein ?proteinLabel ?refseq ?uniprot ?pdb WHERE {
?protein wdt:P703 wd:Q82069695 ; wdt:P31 wd:Q8054 .
?protein wdt:P638 ?pdb .
OPTIONAL { ?protein wdt:P637 ?refseq }
OPTIONAL { ?protein wdt:P352 ?uniprot }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,en". }
}
The full list can be found on the linked sparql/complexes.rq
page, which has become quite long
so we will just visualize the number of PDB entries per protein here:
Which was created with this query:
SPARQL sparql/virusProteinsPDBBubbleChart.rq (run, edit)
#defaultView:BubbleChart
SELECT ?protein ?proteinLabel (COUNT(?pdb) AS ?count) WHERE {
?protein wdt:P703 wd:Q82069695 ; wdt:P31 wd:Q8054 .
?protein wdt:P638 ?pdb .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,en". }
} GROUP BY ?protein ?proteinLabel