-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filestore Improvements #108
base: main
Are you sure you want to change the base?
Conversation
|
||
```python | ||
class FileBlob2(Model): | ||
organization_id = BoundedBigIntegerField(db_index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also use files outside of organization contexts in control silo. Currently we've cloned the File model relations into control silo models so that we would have similar storage/interfaces for user & sentry app avatars.
Would you want to align file usage in control silo as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User avatars are indeed an interesting problem. Do we have different limits on non-debug-files?
As in: A debug-file can be up to 2G right now, and it is internally chunked.
Can we get away with using a different model for avatars altogether? I would argue they are a lot smaller (limited to 1M maybe?), and it does not make sense to chunk those.
So if I read this correctly, you would still want to chunk files, but not saving those chunks deduplicated, thus avoiding the atomic reference counting problem? How would you manage the migration from the old system to the new one? Will there ever be a cut-off date at which you can just hard-drop the old tables and GCS storage? As this whole blob-related discussion started off with my discovery of a race-condition between blob upload and blob deletion, would this be solved by splitting off the staging area for uploads as you suggested from the long term storage? As a reminder, the race condition is actually two separate TOCTOU (Time-of-Check-to-Time-of-Use) problems:
I believe the first problem can be solved by a dedicated per-org staging area, one that will refresh a chunks TTL on query by chunk-hash. The second problem can be either solved by not storing blobs deduplicated, like suggested.
Not sure if that complexity would be worth it, or we can just store duplicated blobs. Deletions would be trivial, and also possible for older files and blobs correctly if we do not have concurrent writes and deletes. |
Co-authored-by: Mark Story <mark@mark-story.com>
I don't know. I think I would allow chunking as part of the system but I would force the chunk to be associated with the offset. But honestly for most of the stuff we probably want to do, one huge chunk for the entirety of the file is probably preferable in practical terms. |
Can we get a histogram of chunk reuse reasonably? I would love to have some real data on how that reuse looks like. |
This is a meta RFC to cover some of the potential improvements to our filestore system.
Rendered RFC