Find a way to rebuild the arc indices in version order

The idea behind the arc indices was that they would hold the old record versions in roughly version order, so `*-arc-0` would contain the oldest record versions and (for example) `*-arc-19` would hold newer versions.

This is only the case when starting from scratch and indexing after each new version of a set of records is created. If you resync, the records are iterated over and all their versions are added to the arc indices before moving on to the next record. This means `*-arc-0` will have all the old versions of a given record.

Actually achieving the desired result from a resync is going to fiddly. My first idea for a process to do this would be:

1. read every record from the resource's data collection
2. count how many documents need to be indexed at each version (taking into account any options changes)
3. calculate how many arcs are needed based on the version counts
4. assign each version to an arc (or arcs if needed)
5. reread every record from the resource's data collection and as the elasticsearch documents are generated, assign them to an arc based on their version as per step 4

This will be pretty labour intensive (you have to read the records twice!) and will be fiddly to write ensuring that options and deletions are accounted for correctly.

Another possible methodology would be:

1. read every record from the resource's data collection
2. create the elasticsearch documents for each version and write them to temporary files on disk, one for each version
3. read the files in version order, adding the documents to elasticsearch and incrementing the arc index as normal

This is much easier to understand and implement but relies on you having enough disk space to store all the documents (or some clever way of clearing down data after certain thresholds have been reached, e.g. once an arc's worth of data has been added to the oldest version file, index it and remove it). I've had some good compression results from storing JSON documents in gzipped `jsonl` files so maybe that's a route to go down that won't use many GBs of disk space.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Find a way to rebuild the arc indices in version order #31

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Find a way to rebuild the arc indices in version order #31

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions