Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistence: purge unreferenced Objs (WIP) #9688

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

snazy
Copy link
Member

@snazy snazy commented Oct 2, 2024

This is an attempt to implement the algorithm mentioned in the PR #9401.

The Obj.referenced() attribute contains the timestamp when the object was last "referenced" (aka: attempted to be written). It is ...

  • set when an object is first persisted via a storeObj()
  • updated in the database, when an object was not persisted via storeObj()
  • set/updated via upsertObj()
  • updated via updateConditional()

Let's assume that there is a mechanism to identify the IDs of all referenced objects (it would be very similar to what the export functionality does). The algorithm to purge unreferenced objects must never delete an object that is referenced at any point of time, and must consider the case that an object that was unreferenced when a purge-unreferenced-objects routine started, but became referenced while it is running.

An approach could work as follows:

  1. Memoize the current timestamp (minus some wall-clock drift adjustment).
  2. Identify the IDs of all referenced objects. We could leverage a bloom filter, if the set of IDs is big.
  3. Then scan all objects in the repository. Objects can be purged, if ...
        * the ID is not in the set (or bloom filter) generated in step 2 ...
        * AND have a referenced timestamp less than the memoized timestamp.

Any deletion in the backing database would follow the meaning of this pseudo SQL: DELETE FROM objs WHERE obj_id = :objId AND referenced < :memoizedTimestamp.

Noting, that the referenced attribute is rather incorrect when retrieved from the objects cache (aka: during normal operations), which is not a problem, because that referenced attribute is irrelevant for production accesses.

There are two edge cases / race conditions:

  • (for some backends): A storeObj() operation detected that the object already exists - then the purge routine deletes that object - and then the storeObj() tries to upddate the referenced attribute. The result is the loss of that object. This race condition can only occur, if the object existed but was not referenced.
  • While the referenced objects are being identified, create a new named reference (branch / tag) pointing to commit(s) that would be identified as unreferenced and being later purged.

@snazy snazy force-pushed the purge-unreferenced-objs branch 5 times, most recently from 2eec23c to bed28a3 Compare October 7, 2024 12:07
... to delete an `Obj` only if its `referenced()` timestamp has the expected value.
This is an attempt to implement the algorithm mentioned in the PR projectnessie#9401.

The `Obj.referenced()` attribute contains the timestamp when the object was last "referenced" (aka: attempted to be written). It is ...
* set when an object is first persisted via a `storeObj()`
* updated in the database, when an object was not persisted via `storeObj()`
* set/updated via `upsertObj()`
* updated via `updateConditional()`

Let's assume that there is a mechanism to identify the IDs of all referenced objects (it would be very similar to what the export functionality does). The algorithm to purge unreferenced objects must never delete an object that is referenced at any point of time, and must consider the case that an object that was unreferenced when a purge-unreferenced-objects routine started, but became referenced while it is running.

An approach could work as follows:

1. Memoize the current timestamp (minus some wall-clock drift adjustment).
2. Identify the IDs of all referenced objects. We could leverage a bloom filter, if the set of IDs is big.
3. Then scan all objects in the repository. Objects can be purged, if ...
    * the ID is not in the set (or bloom filter) generated in step 2 ...
    * _AND_ have a `referenced` timestamp less than the memoized timestamp.

Any deletion in the backing database would follow the meaning of this pseudo SQL: `DELETE FROM objs WHERE obj_id = :objId AND referenced < :memoizedTimestamp`.

Noting, that the `referenced` attribute is rather incorrect when retrieved from the objects cache (aka: during normal operations), which is not a problem, because that `referenced` attribute is irrelevant for production accesses.

There are two edge cases / race conditions:
* (for some backends): A `storeObj()` operation detected that the object already exists - then the purge routine deletes that object - and then the `storeObj()` tries to upddate the `referenced` attribute. The result is the loss of that object. This race condition can only occur, if the object existed but was not referenced.
* While the referenced objects are being identified, create a new named reference (branch / tag) pointing to commit(s) that would be identified as unreferenced and being later purged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant