tracking issue: File ID Manager

# File ID Manager design

Right now, there is no universal way in Jupyter for developers to track a file as it is created, modified, and renamed. This is essential when developers need data to be associated with a specific file across its lifetime. The **File ID Manager** (**FIM**) is a proposed Jupyter manager to associate immutable file IDs with files that tracks the path of a file across its entire lifetime. That is, a file ID always be solved to the current path of the corresponding file even when the file is moved.

## Use cases

<details>
<summary>Use case: Comments</summary>

Comments is a feature request that allows users to attach comments to JupyterLab documents. This offers a richer user collaboration experience, in alignment with Jupyter’s larger goal of providing stronger features along the same vein. Right now, the comments feature request has several proposed implementations for storing comments:

* Metadata within files (e.g. storing comments data within `.ipynb` metadata fields)
* Sidecar file adjacent to the target file which stores comments data
* SQLite database that maintains a mapping between the target file’s path and comments data

Leveraging metadata fields limits support of this feature to file types that support hidden metadata within the blob. This feature cannot work on raw text files. Furthermore, this implementation is not extensible as it requires a custom implementation for each file type.

Sidecar files mitigate this, but this implementation requires us to pollute the directory structure with sidecar files. Furthermore, users must remember to move or copy both files in tandem or else the comments data is no longer associated with the target file or with the new copy of the target file, respectively. This breaks existing scripts that perform such filesystem operations, and potential data loss leads to a very poor developer experience.

A SQLite database that abstracts the implementation away from end-users is the most promising implementation. Here we are able to leverage the benefits of using sidecar files without polluting directory structure or compromising on user and developer experience. However, there are two glaring issues with maintaining a database between the file path and the comments data:

* Users can move files. If a file is moved, then all the associated comments data is lost.
* Users can copy files. If a file is copied, then the associated comments data is not copied.
</details>

## Feature specification

The FIM is a new manager in Jupyter Server that supports the following key features:

* Generating a unique ID per file path
* Supporting retrieval of ID given a file path and retrieval of file path given an ID
* Maintaining this relationship by tracking a file across its lifetime (creation, movement, deletion)
    * On file moves, the same file must maintain the same ID. Given the ID of a file, this ID should resolve to the new path of said file after a move operation.
* Notifying other Jupyter Server services and the JupyterLab client on copies
    * This can be done via [event bus](https://github.com/jupyter-server/jupyter_server/pull/820) introduced in Jupyter Server 2.

Furthermore, all of these features must be implemented agnostic of the underlying operating system, filesystem, and kernel. This problem is surprisingly challenging however deceptively simple. In this document, we wish to outline the design of the FIM such that these features can be achieved in such a way. 

## Terminology

We will use a few custom terms to discuss the design on a more specific and granular basis.

* **Op:** A filesystem operation, mainly creates, moves, copies and deletes.

* **In-band**/**Out-of-band:** In-band ops are any ops performed through the Jupyter server Contents API, which is called by the JupyterLab UI. Out-of-band edits are other ops done through some other method, e.g. shell commands or drag-and-drop in Windows File Explorer.
* **Stat info**: file metadata returned from the `stat()` system call.
* **Ino**: inode number. An integer associated with each file that points to where its inode is located on disk. The inode stores all relevant metadata about the file, and thus the ino is preserved across file moves within the same filesystem.
* **Crtime**: the file’s creation time. May not be available on all platforms.
* **Similar stat info**: when the stat info of a previously deleted file and a newly created file have the same `ino` and `crtime` (falling back to `mtime` if `crtime` is not available on the platform). This indicates an out-of-band move.
* **Indexing**: creating a record that associates a file path with a unique, immutable, and non-reused file ID. We say that the file ID manager “indexes” a file when it stores an association between the file path with a file ID.
* **Indexed-but-moved**: a file which was previously indexed but moved out-of-band.
* **Disjoint move**: an out-of-band op involving deleting a file and creating a file with identical contents at a different path rather than moving a file with `mv`. Out-of-band disjoint moves are impossible to track without storing file contents in an object database like Git. Disjoint moves include:
    * `cp` the original file and then `rm` the original file
    * Moving to a different filesystem and back

## Takeaway: In-band ops are easy to track

Because the Contents API manager can directly call the FIM’s methods, it’s easy to track a file across its lifetime, since the FIM is informed of all ops happening to all files in a JupyterLab session. The rest of this document focuses on a strategy to track out-of-band ops.

## Looking to Git for inspiration

One key idea to note here is that tracking a file across its lifetime in a platform-agnostic manner is exactly what Git does. Git does not rely on a filesystem event daemon like `inotify` to do this, and relies purely on files themselves. This method ensures that Git works on pretty much every platform used today.

Git uses an index file `.git/index` to track all of the files under the root, and stores the original copies of each file in the objects database at `.git/objects`. Running `git init` on a new directory indexes all files (including directories) within the root, recursively. This can be shown with the `git ls-tree` command:

```
% git ls-tree -rt @
100644 blob c13c5f627fd143c64d90f2d8c730dc752d89ace7    .babelrc
100644 blob db228f0817ece3981ad5938f6cb6a8c79f83d472    .eslintignore
...
040000 tree 2eb004453e870badeb9c48f35066ab78eb78065a    docs
...
040000 tree e49014df8009c41124298a33a9011f182d99978c    examples
...
```

Each index entry has an object type. `blob` types represent files, while `tree` types represent directories.

Tracking a file across by reading its contents is very expensive as it requires disk reads. Hence Git relies on a heuristic obtained from file metadata. [`stat()`](https://docs.oracle.com/cd/E19048-01/chorus4/806-3328/6jcg1bm34/index.html) is a system call available to all POSIX-compliant platforms, along with some “mostly POSIX-compliant” platforms including Windows. It exposes file metadata that can be employed as a heuristic for tracking file moves. For a full list of metadata types see the [system call documentation](https://man7.org/linux/man-pages/man2/open.2.html). The most relevant ones used by [Git](https://git-scm.com/docs/racy-git/en) are the following:

* `st_mtime`: time last modified
* `st_ctime`: 
    * UNIX-like/BSD-like: time last changed (includes all file metadata such as permissions)
    * Windows: time created
* `st_ino`: inode number (preserved across moves)
* `st_uid`: owner user ID
* `st_gid`: owner group ID
* `st_size`: size in bytes

Git stores this stat info within each index entry and employs this as a heuristic. If the metadata is identical, then the file or directory is almost certainly unchanged.

Git also uses this stat info to detect new and deleted files under the Git root. When adding or deleting a file, the `mtime` of the immediate parent directory is changed.

* Note that this relation is not recursive; editing `foo/bar/baz.txt` only updates the `mtime` of `foo/bar` and not `foo`.

Thus if the stat info for the directory is different, Git can read the current contents of that directory and compare it to the old contents of the directory. Doing this across all directories under the root **allows Git to detect any created or deleted file under the git root**.

This functionality is deceptively powerful because it allows Git to **track moves very efficiently**. When you think about it, a file move is almost like deleting a file and creating a new file with similar stat info. Because this stat info is preserved across moves, whenever Git detects a new file, it can compare the stat info to any deleted files. If the stat info is identical, then the file was just moved. Otherwise, Git falls back to reading the contents of the deleted file (retained in the objects database) and diffing it against the new file. If the difference is [less than 50%](https://git-scm.com/docs/git-diff#Documentation/git-diff.txt--Mltngt), then Git considers the file to be renamed.

However, note that Git does not care about file copies at all. To Git, a new copy of a file is just a new file, with no history associated with it. There is also no way we can detect copies efficiently (purely from stat info) without running a diff against every single file under the index. Hence, this strategy does not detect out-of-band copies.

## Implementation proposal

We maintain a single table: `Files`. This has the following schema:

```
Files
id     | int
path   | string
ino    | int
crtime | int
mtime  | int
is_dir | int
```

`path`, `ino`, `is_dir` are indexed to speed lookups.

* `id` : the file ID 
* `path`: the file path
* `ino`: the inode number of the file
* `crtime`: the time the file was created
    * determined via `ctime` if on windows, `birthtime` on MacOS and other BSD-likes
* `mtime`: the time the file contents were last modified
* `is_dir`: 1 if the file is a directory, 0 otherwise.

### FIM.__init__()

Create SQLite tables and indices if necessary. Then index all directories under the server root.

### FIM._stat(path: str)

Retrieves a file’s stat info and returns it in a `StatStruct`:

```
class StatStruct:
    ino: int
    crtime: Optional[int]
    mtime: int
    is_dir: bool
```

### FIM._sync_file(path: str, stat_info: StatStruct)

This private method is what detects out-of-band moves. The key idea is:

* When we notice a file is no longer present in the filesystem, we do NOT delete the corresponding record in the Files table.
* When we index a new path, we first check the Files table for a record with the same `ino` and `crtime`. If `crtime` is not available, we fallback to verifying `mtime`.

If there is a record with similar stat info, we update the existing record with the new path and stat info, then return the file ID. Otherwise this method returns `None`.

```
old_path = "old"
new_path = "new"
Path(old_path).touch()
id = FIM.index(old_path)
os.rename(old_path, new_path)
assert id == FIM._sync_file(new_path, FIM._stat(new_path))
```

### FIM.index(path: str): number

First, call `_stat()` on `path` to make sure file exists. Otherwise return `None`.
Then, call `_sync_file` on `path` to check if file was indexed-but-moved. Return ID if so.
Finally, create a new record for the file at `path`. Return file ID.

### FIM.get_id(path: str): number

Same as `index()` except returns `None` if the file was not indexed-but-moved. Does not create a new record and file ID for the file at `path`.

### FIM._sync_all()

Syncs all new files under the entire server root. Files moved out-of-band can only appear under dirty directories, which are:

* Indexed directories with a different mtime.
* Unindexed directory

We iterate through all dirty directories under the server root and call `_sync_file()` on all of their contents. This ensures that the correct file path is associated with each file ID.

### FIM.get_path(id: number): str

Call `_sync_all()`. Then find the `path` associated with the file ID.
Next, verify if the file at the path exists. If not, then return `None`.
Otherwise return the path.

### FIM.[move, copy, delete]

More straightforward and not worth discussing here as these methods handle in-band ops that are easier to reason about.

## Summary of out-of-band ops handling

* Out-of-band moves: detected by `_sync_file()` and `_sync_all()` which are called by `get_id()` and `get_path()` respectively
* Out-of-band deletes: handled in `get_id()` and `get_path()` methods by verifying file exists before returning
* Out-of-band copies: not handled at all

## Known issues

* Disjoint moves are untrackable.
    * This is because when you delete a file and create a new file with the same contents, all the stat info is possibly completely changed. The only similarity is in the contents. Without maintaining an object database storing said contents, there is no way to track this.
* Out-of-band copies are untrackable.
    * Similarly to the above scenario, copies don’t have any shared stat info between the original and the copy. In fact, `inotify` doesn’t emit copy events at all.
* `get_path()` can be slow if you move a very large directory.
    * Very informal benchmark: Takes roughly 150 ms on a m5.12xlarge to call `get_path()` after moving a very large directory (`/arch`) in the Linux source tree.
* Possible race condition where quickly deleting a file and creating an unrelated file gets detected as a “move”.
    * `crtime` has a certain precision depending on the underlying filesystem/kernel. This can be 1 nanosecond (ext* with 256-byte inodes), 100 nanoseconds (NTFS), one second (ext* with 128-byte inodes), or two seconds (FAT/FAT32).
    * Thus, you can fool FIM’s heuristic by quickly deleting a file and then creating a new one (preserving the `ino`) without changing the `crtime`.
    * Likely not an issue; this is an already rare edge case that becomes even rarer with newer filesystems supporting higher timestamp precision.
* Filesystems/kernels that don’t have a `crtime` implementation are unable to detect moves followed by edits.
    * This is even worse for directories, since `mtime` for directories changes whenever a file underneath is added, deleted, or renamed.
    * This mainly affects UNIX-likes. Windows, OS X, and BSD derivatives are not affected.
    * The alternative is to ignore `mtime`. This is discussed further in the Open Questions section.

## FAQ

* Why not just use inos to identify a file?
    * Inos, although unique and preserved across moves, are reused if a file is deleted. **Inos guarantee uniqueness but not identity**. For example, if file `foo` has a ino of 1 and `boo` has a ino of 2, then if `foo` gets deleted and a new file `baz` gets created, then `baz` has a ino of 1. This is inappropriate for our use case as a file ID should never be reused; it should track the path of one and only one file across its lifetime.
    * In the above scenario, any comments attached to `foo` would get attached to `baz` after the ops execute.
    * Thus we should rely on some other data (ideally `crtime`) to give us more confidence in a file’s identity.

* Why not just use a filesystem event daemon to watch the contents of the server root?
    * These are not necessarily platform-agnostic, and don’t work when the server is offline. Furthermore, this implementation would run into many of the same problems already being tackled. For example, if a daemon is watching the server root, how will it detect me moving a file out of the server root and back in? It would have to rely on the same heuristics to determine file identity that we’re using here, namely `ino` and `crtime`.
* Why is the logic for `get_path()` so complex? Do we really need to sync all the possible dirty directories under the server root to associate the correct path to a given ID?
    * We can only detect a move after syncing the new file. `get_id()` is simple because we’re given the path, and hence can easily detect an out-of-band move. `get_path()` is more tricky because we’re not given the path. Hence, we need to sync every file under all dirty directories to do so.

## Open questions

* Is there any way for us to clean up records of a deleted file without losing track of moves? If not, will this become a performance bottleneck?
    * Right now, we only delete records if we detect a duplicate ino or when the FIM delete method is explicitly called.
    * This is because we don’t really have a way of distinguishing between an out-of-band move and an out-of-band delete until we index the new file. We need to retain the deleted file’s stat info in order to determine if the new indexed file was the result of a move.
    * Example: let’s say I move a file from an old path to a new path, without indexing the file at the new path. Then if FIM deletes the record associated with the old path, when FIM indexes the new path, it doesn’t have any record of the old file, and thus assigns a new file ID. Then all the data associated with the old file ID gets lost.
* How to track birth/creation time better on UNIX-like OSes? Is this a PR blocker?
    * Bit of kernel dev ~~drama~~ lore: https://www.anmolsarma.in/post/linux-file-creation-time/#fn:2
    * Basically ext4 filesystem does actually store birth time, but POSIX standards do not require this in `stat()` syscall.
    * Newer versions of Linux support the `statx()` syscall.
* Should we fall back to `mtime` if `crtime` is not available?
    * However, this leads to new IDs being created for moved files followed by edits and moved directories followed by any change in its entries (adding/removing/renaming an entry).
        * This results in any data associated with the moved-and-edited file to no longer be associated.
    * If we were to ignore `mtime` and only use `ino` to compare file identity on platforms where `crtime` is not available, then newly created files following a delete could be given the same ID.
        * The results in any data associated with the deleted file to be incorrectly associated with the unrelated new file.
    * I chose `mtime` fallback as the default behavior, because it’s possible in the future to warn users if data is associated with a file that no longer exists. It’s much trickier to determine if data was associated with the “wrong file”.
* How to test this service with NFS? Use a custom `ContentsManager` implementation?

## PRs

- https://github.com/jupyter-server/jupyter_server/pull/921

## Future steps

* [ ] Verified & tested NFS/EFS support
* [ ] UUIDs config option by popular demand
* [ ] Config option for SQLite PRAGMAs to improve performance at the cost of durability
    * Most notably: `journal_mode` and `synchronous`
* [ ] Run `_sync_all()` on an interval (e.g. 1s) when the server is on.
    * Question: is there any server performance implication of doing this? Can this be mitigated by offloading the work to a separate thread/process?
* [ ] Emit to event bus to inform client that a move happened in the user’s working directory. (users should be able to see updates in left panel file browser)
* [ ] Better benchmarks, preferably in CI/CD
* [ ] Documentation in readthedocs
* [ ] CLI options to invoke FID manager methods (e.g. `jupyter mv`, `jupyter cp`, etc.)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tracking issue: File ID Manager #940

File ID Manager design

Use cases

Feature specification

Terminology

Takeaway: In-band ops are easy to track

Looking to Git for inspiration

Implementation proposal

FIM.init()

FIM._stat(path: str)

FIM._sync_file(path: str, stat_info: StatStruct)

FIM.index(path: str): number

FIM.get_id(path: str): number

FIM._sync_all()

FIM.get_path(id: number): str

FIM.[move, copy, delete]

Summary of out-of-band ops handling

Known issues

FAQ

Open questions

PRs

Future steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tracking issue: File ID Manager #940

Description

File ID Manager design

Use cases

Feature specification

Terminology

Takeaway: In-band ops are easy to track

Looking to Git for inspiration

Implementation proposal

FIM.init()

FIM._stat(path: str)

FIM._sync_file(path: str, stat_info: StatStruct)

FIM.index(path: str): number

FIM.get_id(path: str): number

FIM._sync_all()

FIM.get_path(id: number): str

FIM.[move, copy, delete]

Summary of out-of-band ops handling

Known issues

FAQ

Open questions

PRs

Future steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions