This is a preliminary prototype for working with IEDB-style data in a git repository. Requirements are GNU Make, Python 3 (see requirements.txt
), and Java 8.
The data is in a separate repository: https://github.com/jamesaoverton/covic-db-data-prototype
A demo is here: https://cvdb.ontodev.com
Supporting Google Sheet is here: https://docs.google.com/spreadsheets/d/11ItLLoXY7_r2lDazY4prMERHMb-7czrim91UABnvbYM
This repository serves two purposes:
- build an ontology to support the CoVIC-DB project
- provide a library and command-line tool for working with CoVIC-DB data
Building the ontology requires GNU Make, Java 8+, and xlx2csv. Run make ontology
.
We use this ontology to build the src/covicdbtools/config.json
file used by the library and command-line tool, and after that point Java is not required.
The library is a Python 3 module. We suggest using virtualenv
and installing with pip install -e .
. Please set the CVDB_DATA
environment variable to the directory where your git data repositories will be stored, e.g. data/
. Common functionality is exposed through covicdbtools.api
:
api.initialize()
create the required git data repositories in$CVDB_DATA
api.fetch_template("antibodies")
fetch the empty antibodies Excel templateapi.submit_antibodies(name, email, organization, request)
submit an Excel template filled with antibody entriesapi.create_dataset(name, email, assay_type)
create a new dataset for the given assay typeapi.fetch_template(dataset_id)
fetch the empty Excel template for this datasetapi.submit_assays(name, email, dataset_id, request)
submit an Excel template filled with assay entriesapi.promote_dataset(name, email, dataset_id)
promote a dataset from staging to public
All these functions return response
dictionaries, which include a "status"
indicating success or failure and a "message"
. Failed responses may include "errors"
. The fetch_template
and submit_*
responses will usually include "content"
with BytesIO
for an Excel file.
The API functionality is also accessible through a command-line client at src/covicdbtools/cli.py
, which pip
should alias to cvdb
. See cvdb --help
for more information.
One of our goals is to create Linked Data. At the most basic level, this means:
- all of our public identifiers can be expanded to public URLs
- we distinguish individual things from classes of things, and use ontologies to identify the classes
IRIs are a superset of URLs that allow for URNs and for Unicode characters. We'll use "IRI" throughout, but all of our IRIs are in the HTTP(S) URL subset.
IRIs are globally unique, but can be inconveniently long, so we use the CURIE standard for our IDs. For example, if we want to talk about the class of all mice, we can use the term 'Mus musculus' from the NCBI Taxonomy. The IRI http://purl.obolibrary.org/obo/NCBITaxon_10090 has been assign to 'Mus musculus'. We can define a prefix 'NCBITaxon' with base 'http://purl.obolibrary.org/obo/NCBITaxon_' for use throughout our system. Then we can refer to 'Mus musculus' by the ID 'NCBITaxon:10090', which is short and convenient, and expands to a globally unique IRI.
We define a list of all our prefixes in ontology/prefixes.tsv
. We use IDs wherever possible. If a column of a table contains an ID, we adopt the convention of appending '_id'. If a column contains the ID for a class, we append '_type_id'.
ab_id | host_type_id | isotype |
---|---|---|
ab:1 | NCBITaxon:10090 | foo |
Our convention is to store tables as TSV on the filesystem. In Python, we use the csv.DictReader
and represent tables as lists of OrderedDict
s.
We usually want to associate a human-readable label with each ID. We define functions for adding '_label' columns, with results like this:
ab_id | ab_label | host_type_id | host_type_label | isotype |
---|---|---|---|---|
ab:1 | mAb 1 | NCBITaxon:10090 | Mus musculus | foo |
We also define functions for removing the '_label' columns. If a table does not include labels, we call it a "concise table". If a table does include labels, we call it a "labelled template".
To help display tables as pretty HTML and Excel, we define a "grid", consisting of header rows and data rows. Each cell is represented by a Python dictionary, containing a "value" key, which may also include "iri", "label", "status", "comment", and other keys. We provide functions for converting labelled tables to grids, and grids to HTML and Excel. Our example table can be converted to this HTML:
Antibody | Host | Isotype |
---|---|---|
mAb 1 | Mus musculus | foo |