-
Notifications
You must be signed in to change notification settings - Fork 7
Spike: Versioning
Aperta has long lacked a coherent concept and model of versions for papers. We have multiple different models for versions depending on whether we are storing versions for a papers content, for its attachments, or for the answers to questions. Some things are not versioned at all, or if they are, the old version data is stored in opaque JSON blobs. Retrieving the state of a paper in the past is very hard and complicated.
This spike is an attempt to bring us into a world where we have a sensible mental model, an appropriate data model, and steps to achieve that data model incrementally, bringing the different data models into a coherent data model piece by piece.
- A consistent data model for versions
- Retrieving information about an older version should be as easy as retrieving data about the current version: they should use the same structure
- It should be possible to lock down older version data
- While our current use case should include numbered versions, e.g. v0.0, v0.1, v1.0, etc., we should allow the possibility of storing intermediary versions in the future
I created a testbed rails application with the data models we can use. https://github.com/Tahi-project/versioning-spike
The idea is to have a one-to-many relationship between a paper and its
versions. A row in the versions
table represents one and only one
version, and each version is represented by one row in the versions
table. This table can be used to pull together all the information about
a version by relations with the versions
table.
We can then use a many-to-many or many-to-one intermediary model between
this versions
model and anything we want to version.
- Guidelines for database design
- If the thing (some piece of information about a paper) never
changes, add a row to the
papers
table (e.g. creator). - If there can be multiple things on a given version, e.g.
answers, create a join table from version to answers to
represent a many-to-many relationship
- This is necessary because all, any, or none of the items could change between versions. A many to many join means that, e.g. an author could be shared across multiple versions (because it does not change) and any version can have many authors
- See the answer model in the spike repository
- If the thing to be versioned can have multiple “versions” within
a version, e.g. we want to keep each docx that is uploaded, we
can define a many-to-many join table with a “sub_version”
(better names requested) integer column
- When a new version is created, we simply copy the latest
sub_version over to the new version and give
it
sub_version: 1
- Each additional uploaded docx has an
incremented
sub_version
column - This allows us to save all attachments uploaded, associate them with a version and start new versions with the a good starting point
- I have not yet spiked this out, but it should be easy to do
- When a new version is created, we simply copy the latest
sub_version over to the new version and give
it
- If the thing (some piece of information about a paper) never
changes, add a row to the
To create a new version of an article:
- Create new row in version table
- For every row in join tables where version_id == the previous
version
- Create a new row in versioned thing join table pointing at the new version and the existing thing
- For every row in a sub versioned join table where version_id == the
previous version
- Create a new new row with sub_version 1 in the sub versioned join table pointing at the new version and the existing thing
- Mark new version as latest version
To edit a deduplicated versioned has-many thing:
- If this thing belongs to old versions only, prevent change
- If this thing belongs to a latest version:
- If this thing is shared by multiple versions (that is, if the
versioned thing join table contains > 1 entry where thing_id
== this things id)
- Create a copy of this thing and update the thing_id for the old versions to point at that new thing_id
- Go to next step
- If this thing is used only by the latest version (if the
versioned thing join table contains 1 entry where thing_id ==
this things id)
- Modify the thing table
- If this thing is shared by multiple versions (that is, if the
versioned thing join table contains > 1 entry where thing_id
== this things id)
To find an articles deduplicated versioned has-many thing for a given version:
- Query version join where version_id == the version you want
Advantages:
- Fast versioning
- Fast saves once the copy is made
- Reasonably fast saves even when we need to create a new "thing" row
- Possible to prevent changes to previous versions
- Reasonably easy reconstitution of a previous version (as easy as finding the latest version)
- Extendable to any existing data model (we can put a join table in between any of them and a version)
Disadvantages:
- Complicates the data model
- Will require hooking into before_update or similar hooks in active record in a possible complex way
- New infrastructure
- Rename “versioned_text” to “paper_version”
- We will need to migrate S3 asset locations
- Rename “versioned_text” to “paper_version”
- Funder migration
- Move funders model through versions model
- Backfill old versions with snapshots
- Continue creating snapshots, but also use new versions+funder model
- Serialize new versioned funder to frontend
- Change frontend diffing to use new funder version model
- Move funders model through versions model
- Refine migration based on lessons learned from funder migration
- Author migration
- Consolidate author/group author models, moving position column
to authors
model - Repeat steps of funder migration
- Consolidate author/group author models, moving position column
to authors
- Repeat for answers
- Repeat for figures
- Repeat for reviewer recommendations
- Clean up
- Remove old snapshot code
- Attachment migrations
- Create new intermediary join for version_attachments
- Move latest attachments to that
- Migrate snapshots to the new model
- Flatten out `versions` (paper_trail) into the new model
- This work is probably going to be pretty complicated, since we have both paper_trail, and snapshots, and our existing deduplication system for attachments
- Create new intermediary join for version_attachments
[Versioned Thing
Versioned Thing
Schema (application/gliffy+json)
Versioned Thing
Schema.png (image/png)