Skip to content

Find a way to rebuild the arc indices in version order #31

@jrdh

Description

@jrdh

The idea behind the arc indices was that they would hold the old record versions in roughly version order, so *-arc-0 would contain the oldest record versions and (for example) *-arc-19 would hold newer versions.

This is only the case when starting from scratch and indexing after each new version of a set of records is created. If you resync, the records are iterated over and all their versions are added to the arc indices before moving on to the next record. This means *-arc-0 will have all the old versions of a given record.

Actually achieving the desired result from a resync is going to fiddly. My first idea for a process to do this would be:

  1. read every record from the resource's data collection
  2. count how many documents need to be indexed at each version (taking into account any options changes)
  3. calculate how many arcs are needed based on the version counts
  4. assign each version to an arc (or arcs if needed)
  5. reread every record from the resource's data collection and as the elasticsearch documents are generated, assign them to an arc based on their version as per step 4

This will be pretty labour intensive (you have to read the records twice!) and will be fiddly to write ensuring that options and deletions are accounted for correctly.

Another possible methodology would be:

  1. read every record from the resource's data collection
  2. create the elasticsearch documents for each version and write them to temporary files on disk, one for each version
  3. read the files in version order, adding the documents to elasticsearch and incrementing the arc index as normal

This is much easier to understand and implement but relies on you having enough disk space to store all the documents (or some clever way of clearing down data after certain thresholds have been reached, e.g. once an arc's worth of data has been added to the oldest version file, index it and remove it). I've had some good compression results from storing JSON documents in gzipped jsonl files so maybe that's a route to go down that won't use many GBs of disk space.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions