Authors: Thoba Lose thoba@sanbi.ac.za and Peter van Heusden pvh@sanbi.ac.za
The early 2000s saw an evolution of the storage of genome annotation from ad-hoc file based stores to storage in relational databases. While many genome projects construct their own ad-hoc storage schemas, the Chado schema for PostgreSQL relational databases stands out as one of the best documented and most commonly used relational schemas for genome annotation storage.
At SANBI we considered the Chado schema when working on the Asian seabass (L.calcarifer) genome annotation project, but chose not to use it, in a large part due to the lack of code examples for using the schema and also in part due to the lack disconnect between the highly relational-centric model of Chado and the object-oriented design we had for the entities in the seabass genome annotation. We crafted our own schema linked to a PostgreSQL database using SQLAlchemy but it was incomplete and had poor performance for some use cases. Drawing on that experience and SANBI experience with the Neo4j graph database we set about creating a genome annotation schema for a Neo4j graph database.
The modeling of a set of known entity classes in a relational database is a well understood problem: classes correspond to relations (tables), entities (instances) correspond to rows in tables, and relationships between objects are either mapped as foreign key relationships or using join tables (for many to many relationships). In recent years object-relational mappers (ORMs) such as SQLAlchemy and Hibernate have made mapping between an object model and its relational representation automatic. These have been supplemented with schema migration systems that add a version number to the schema and records changes to the schema in a reproducible fashion.
The Neo4j graph database differs from relational databases in that it is schema-less, and applies the property graph model where entities (nodes) and relationships (edges) have properties, thus allowing for fast search operations and act as a guide to, but not a prescription for, type. In a sense this corresponds more closely to a system of prototype objects than a class system. Object-graph mappers (OGMs) such as neomodel, the Neo4j-OGM, and that in version 3 of py2neo allow a certain degree of domain modeling in the object model however largely without the ability to impose constraints.
In both relational and graph databases a choice has to be made as to where to "split" an object. In general this is guided by cardinality: a subset of attributes which is in a one-to-one relationship with an object tends to be merged with the object. A difficulty arises when the degree of specialisation determines the cardinality of a relation. For example, relationship between a protein entity and a Uniprot ID might be 1 to 1. However, the relationship between a protein entity and external database IDs is probably not one to one as the same protein entity is identified in multiple databases.
In a relational database each entity (e.g. an external database ID) consists of a row in a database, with performance being optimised at retrieving such rows using indexes etc. Complete "documents" (e.g. all info about a protein) are constructed by joining multiple tables. Relational database schema designers reduce the burden of constructing such documents by creating "denormalised" schemas that merge multiple potentially independent entities into a smaller number of tables. In a graph database each entity is a node and the relationships and properties represented by edges.
Conceivably a query optimised graph database would also be constructed by merging entities to add attributes from multiple nodes to a single node that more closely resembles the final "document" being presented as a result of the query.
For commentary on performance and optimisation of Neo4j databases set:
This description takes the Chado sequence table as a starting point.
At the core of the Chado-derived graph model is the Feature
entity that corresponds
on the one hand with a feature as described in GFF3
(although Chado features are a superset of GFF3 features, including both
the GFF3 "reference sequence" as a feature and non-located features)
and on the other hand with a type of biological sequence of a type as described
the Sequence Ontology (SO). The terms of
the sequence ontology are stored in Chado as CvTerms that are associated with
features.
In addition features have identifiers in external databases. Chado distinguishes
between a primary identifier and secondary (e.g. synonym) identifiers, something
that is probably not necessary in a graph database, where e.g. relationship
properties could be used to identify the primary
DbXref.
In addition, the schema-less nature of graph databases, alleviates the need
to model each type of sequence as an abstract "feature" and instead nodes can
be created for type Exon
, mRNA
, etc. Again, due to the schema-less nature, the
correctness of the contents of a graph database cannot be assured and it might
be useful to write a "test suite" that tests for the biological validity (as
guided for example by the SO) of nodes and relationships in the database.
(Neo4j does support constraints on attributes to a certain extent, although the property existence constraints that might be used to specify the required attributes of a class are only available in Neo4j Enterprise Edition, see Cypher query-constraints. The support for constraints is not exposed in py2neo (yet).)
In addition or as a partial alternative, SO terms such as mRNA
can be
inserted into the database to allow the statement "there shall be no mRNA node
that is not related to the mRNA term" to be asserted. However, since retrieving
all nodes with a particular label is a rapid operation, the SO and validation
of database entities might be kept outside of the database.
The organism
entity quite naturally maps to a type of graph node. In the current
Chado model there is scope for arbitrary attributes, for example a strain name,
to be added as key/value pairs in the organism_prop
table. Arbitrary attributes
can be added to nodes in a Neo4j graph database, so such an arrangement is
not necessary. On the level of the model, however, some form of consistent
specification is necessary, perhaps through subclassing a particular node class.
(py2neo supports subclassing: Subclassing in py2neo)
Locations in Chado are entities independent of but related to features. Again
this can be adopted for a Neo4j model fairly directly. While there is no widely
accepted standard for describing locations on biological sequences, the
FALDO ontology gives one additional
model to incorporate. Since genome assemblies are treated as features, the relationship
Feature -[LOCATED_AT]-> Location -[ON]-> Feature
describes a feature's location relative to another feature (for example an Exon on a Contig).
Publications
in Neo4j also map quite naturally to nodes in a Neo4j database,
with the caveat that the author
entity should include either an ORCID ID or
a cross reference to relate the author
entity to external databases (Dbxref
) such
as ORCID, Google Scholar and ResearchGate. As with other relationship tables
in Chado the pub_relationship
table maps to a Neo4j relationship type
with corresponding label. Again, the range of possible type labels poses a
challenge. In Chado these are stored as CvTerms
which in turn relate to a
controlled vocabulary (CV
). There is no way in Neo4j to link a relationship
to a relationship, so one cannot import the relationship type labels as
Nodes that relate to the relationships they are used in. This forces the
controlled vocabulary for relationship types to be specified outside the
database.
In general, however, CvTerms
for annotation sources such as the Gene Ontology
(GO) can be imported as nodes in Neo4j, with the node label referring to their
annotation source (e.g. GO
, Interpro
, etc). Where semantically richer
ways of expressing relationships, such as the LEGO
model, exist, these could be incorporated into the graph database.
This topic requires further discussion.
The below image depicts a graph model based on the CV
, Sequence
, and Publication
Chado modules.
The discussion above should illustrate that, given a sufficiently detailed modeling effort, it is practical to map the genome annotation currently stored in a Chado schema to a Neo4j database model using the following method:
- Each Entity table is represented by a Label
- Taking note of JOIN tables and replacing them wih relationships
- Columns in JOIN tables become relationship properties
- Each row in a Entity table is a Node
- Columns on these tables become node properties
The Neo4j model would then have the same expressive power as Chado. Two further points require discussion
Schema versioning in relational databases has can be expressed as a set of operations on the schema model. For object relational mappers such as SQLAlchemy there are extensions such as Alembic that assist in automatically detecting the differences between versions of a model and expressing these differences in terms of schema update operations. A similar module for a Neo4j object graph mapper would have to update all instances of a node type if a node changed attributes, a much more heavyweight operation than a relational table update. As OGMs mature there might be a community effort in this regard. Right now, to our knowledge, no such effort exists.
As annotation is edited it will be useful to store a list of
changes, similar to git
commits, that describe what has been updated
and allow a database version to be specified. As far as we are aware,
Chado does not support this, but it should be possible to support it
in Neo4j if changes can be encoded in some change specification language
and arranged into a set of related update nodes.
As databases are, in our design, often stored using Docker containers, we might be able to use dvol.
To allow re-use of code built to target the Chado database schema it will be useful to have some tool to translate a Chado database into its Neo4j equivalent. Since this discussion has focused primarily on the sequence module of Chado it has not considered the full scape of the Chado specification. However, for the part it has considered, the most practical way to migrate from Chado to Neo4j would be to extend a tool that already maps Chado to an object model and then map that object model to the proposed graph model. The Apollo annotation editor is one tool that might be adapted by adding a database interface based on Neo4j-OGM and the schema currently being developed at SANBI.
For a draft implementation of this model using py2neo see this.