Triox file storage #22

AaronErhardt · 2021-02-08T15:37:10Z

AaronErhardt
Feb 8, 2021
Maintainer

This is a suggestion for implementing a file storage API that Triox can build upon.

The key tasks the API should handle are the following:

Store user data
Manage Quotas
Manage shares (shared folders and files) and external storage
Handle file synchronization
Optional per-user encryption
Federation with other storages
(classic file system access with FUSE for example)

Any feedback is welcome!

Storage layout

User data

User data is stored at a specific location similar the current model (e.g. in data/users/:id). The API should be able to access user data identified by the user id and the relative path in the user directory. All file operations should be asynchronous and directories should be readable as a stream of data in a format like zip (to allow directory downloads).

A read-write lock mechanism should be used for all read and write operations. This means if one clients wants to write all other clients have to wait before performing operations themselves on the same location. This mechanism could be very coarse at first, locking per user and per share and made more fain grained in the future where locks only affect operations on the same location.

Quotas are stored with the user's information and automatically handled by the API. On each write and delete operation the API keeps track of the bytes used by a user and returns an error if the Quota is exceeded. Shares count as data of the owner and affect the owner's quota.

Shares

Shared folders and files are stored in a separate directory, e.g. in data/shares/:id. Each share has it's unique id and information that specifies the users and groups that are allowed to access the shared folder and with which permissions. Inside the user data shares are represented only as files that point to the id of the share and are automatically resolved when accessing relative paths through the API. Also the references to shared data are freely movable inside the users data.

The API also stores the paths where users store their references to the share so that they can be easily deleted when the share is removed. Alternatively a garbage collection mechanism could delete invalid share references from time to time.

Sync

The sync mechanism is based on trees that store the state of the synchronization. Taking advice from the new Dropbox sync engine this would result in three trees:

A remote tree stores the latest state of the user’s data in the cloud.
A local tree stores the last observed state of the user’s data on disk.
A synced tree expresses the last known fully synced state between the remote tree and the local tree.

Each user has his own sync trees. Also each shared folder has its own sync trees which are part of the sync trees of users that have access to the shared folder.

The sync tree must store uniquely identifiable information about a file's state. This can be done by storing a hash of the file path and the timestamp of the latest change (or retrieving it from the OS). Storing a hash of the files content could be beneficial as well as this allows to skip syncing moved of copied files.

Delta-sync could be archived in a lightweight manner by storing a binary diff of only the latest changes. Rare conflicts where two clients have made local changes at the same location should be deterministically solved by accepting the first client's changes and discarding the other changes (of course a sync client could notify the user before discarding changes).

Open Questions

I don't expect to the current elaboration to be complete and I hope the feedback will help in finding and resolving more question. Yet already some questions about implementation details remain open:

Should the data be stored as binary chunks (blocks) without using a regular file system structure but an internal database to map paths to files and directories? This would probably help implementing encryption on top of this structure and allow syncing chunks of data instead of whole files.
How should the sync trees be stored on the server and on the client?

File encryption

(For those interested how Nextcloud handles encryption.)

General design:

Encryption shouldn't be global but per user and optional, expect the admin forces users to encrypt.
Triox never stores unencrypted private keys. Private keys must be encrypted by the user's credentials.
Nobody except the user has access to encrypted data (this also means there's no recovery if the user looses the password and recovery codes).

User data

Each user has his own public and private key, generated when the user logs in for the first time after encryption was enabled. The private key is protected by the user's password.

Each file has its own (slightly smaller) key that is encrypted against the public key of the user. When the user uploads a file, a new key is generated, the file is encrypted with the new key. Then the key is encrypted against the user's public key and stored together with the file. If the user wants to read this file, the key of the file is encrypted with the private key of the user and then the key can be used to read the contents of the file.

Having a separate key per file is more complex at first look but has the advantage that it enables encrypted shares.

Auth process

On login the private key is temporarily decrypted with the password the user submitted. Then a copy of the private key is created that is protected by a new randomly generated password. This new password is then sent as encrypted JWT claim to the client who then can use its JWT to decrypt the private key and then the files. The copy of the private key is automatically deleted by a background process when the JWT becomes invalid after a few hours.

Shared folders

Shared files also have their own keys but these keys are encrypted against the public keys of all users that are allowed to access the files. Then each user can encrypt the files using his private key.

Open Questions

Again, I don't expect to the current elaboration to be complete and I hope for feedback on this section.
Some open questions are:

Should the encryption be purely client-side instead? This would make things more efficient on the server side but protecting secrets like the private key inside a web application running in the web browser isn't easy.
How should recovery codes or a similar mechanism be implemented to recover data after loosing the password?
Can the en-/decryption process be optimized or simplified?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Triox file storage #22

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Triox file storage #22

AaronErhardt Feb 8, 2021 Maintainer

Storage layout

User data

Shares

Sync

Open Questions

File encryption

User data

Auth process

Shared folders

Open Questions

Replies: 0 comments

AaronErhardt
Feb 8, 2021
Maintainer