Description
File ID Manager design
Right now, there is no universal way in Jupyter for developers to track a file as it is created, modified, and renamed. This is essential when developers need data to be associated with a specific file across its lifetime. The File ID Manager (FIM) is a proposed Jupyter manager to associate immutable file IDs with files that tracks the path of a file across its entire lifetime. That is, a file ID always be solved to the current path of the corresponding file even when the file is moved.
Use cases
Use case: Comments
Comments is a feature request that allows users to attach comments to JupyterLab documents. This offers a richer user collaboration experience, in alignment with Jupyter’s larger goal of providing stronger features along the same vein. Right now, the comments feature request has several proposed implementations for storing comments:
- Metadata within files (e.g. storing comments data within
.ipynb
metadata fields) - Sidecar file adjacent to the target file which stores comments data
- SQLite database that maintains a mapping between the target file’s path and comments data
Leveraging metadata fields limits support of this feature to file types that support hidden metadata within the blob. This feature cannot work on raw text files. Furthermore, this implementation is not extensible as it requires a custom implementation for each file type.
Sidecar files mitigate this, but this implementation requires us to pollute the directory structure with sidecar files. Furthermore, users must remember to move or copy both files in tandem or else the comments data is no longer associated with the target file or with the new copy of the target file, respectively. This breaks existing scripts that perform such filesystem operations, and potential data loss leads to a very poor developer experience.
A SQLite database that abstracts the implementation away from end-users is the most promising implementation. Here we are able to leverage the benefits of using sidecar files without polluting directory structure or compromising on user and developer experience. However, there are two glaring issues with maintaining a database between the file path and the comments data:
- Users can move files. If a file is moved, then all the associated comments data is lost.
- Users can copy files. If a file is copied, then the associated comments data is not copied.
Feature specification
The FIM is a new manager in Jupyter Server that supports the following key features:
- Generating a unique ID per file path
- Supporting retrieval of ID given a file path and retrieval of file path given an ID
- Maintaining this relationship by tracking a file across its lifetime (creation, movement, deletion)
- On file moves, the same file must maintain the same ID. Given the ID of a file, this ID should resolve to the new path of said file after a move operation.
- Notifying other Jupyter Server services and the JupyterLab client on copies
- This can be done via event bus introduced in Jupyter Server 2.
Furthermore, all of these features must be implemented agnostic of the underlying operating system, filesystem, and kernel. This problem is surprisingly challenging however deceptively simple. In this document, we wish to outline the design of the FIM such that these features can be achieved in such a way.
Terminology
We will use a few custom terms to discuss the design on a more specific and granular basis.
-
Op: A filesystem operation, mainly creates, moves, copies and deletes.
-
In-band/Out-of-band: In-band ops are any ops performed through the Jupyter server Contents API, which is called by the JupyterLab UI. Out-of-band edits are other ops done through some other method, e.g. shell commands or drag-and-drop in Windows File Explorer.
-
Stat info: file metadata returned from the
stat()
system call. -
Ino: inode number. An integer associated with each file that points to where its inode is located on disk. The inode stores all relevant metadata about the file, and thus the ino is preserved across file moves within the same filesystem.
-
Crtime: the file’s creation time. May not be available on all platforms.
-
Similar stat info: when the stat info of a previously deleted file and a newly created file have the same
ino
andcrtime
(falling back tomtime
ifcrtime
is not available on the platform). This indicates an out-of-band move. -
Indexing: creating a record that associates a file path with a unique, immutable, and non-reused file ID. We say that the file ID manager “indexes” a file when it stores an association between the file path with a file ID.
-
Indexed-but-moved: a file which was previously indexed but moved out-of-band.
-
Disjoint move: an out-of-band op involving deleting a file and creating a file with identical contents at a different path rather than moving a file with
mv
. Out-of-band disjoint moves are impossible to track without storing file contents in an object database like Git. Disjoint moves include:cp
the original file and thenrm
the original file- Moving to a different filesystem and back
Takeaway: In-band ops are easy to track
Because the Contents API manager can directly call the FIM’s methods, it’s easy to track a file across its lifetime, since the FIM is informed of all ops happening to all files in a JupyterLab session. The rest of this document focuses on a strategy to track out-of-band ops.
Looking to Git for inspiration
One key idea to note here is that tracking a file across its lifetime in a platform-agnostic manner is exactly what Git does. Git does not rely on a filesystem event daemon like inotify
to do this, and relies purely on files themselves. This method ensures that Git works on pretty much every platform used today.
Git uses an index file .git/index
to track all of the files under the root, and stores the original copies of each file in the objects database at .git/objects
. Running git init
on a new directory indexes all files (including directories) within the root, recursively. This can be shown with the git ls-tree
command:
% git ls-tree -rt @
100644 blob c13c5f627fd143c64d90f2d8c730dc752d89ace7 .babelrc
100644 blob db228f0817ece3981ad5938f6cb6a8c79f83d472 .eslintignore
...
040000 tree 2eb004453e870badeb9c48f35066ab78eb78065a docs
...
040000 tree e49014df8009c41124298a33a9011f182d99978c examples
...
Each index entry has an object type. blob
types represent files, while tree
types represent directories.
Tracking a file across by reading its contents is very expensive as it requires disk reads. Hence Git relies on a heuristic obtained from file metadata. stat()
is a system call available to all POSIX-compliant platforms, along with some “mostly POSIX-compliant” platforms including Windows. It exposes file metadata that can be employed as a heuristic for tracking file moves. For a full list of metadata types see the system call documentation. The most relevant ones used by Git are the following:
st_mtime
: time last modifiedst_ctime
:- UNIX-like/BSD-like: time last changed (includes all file metadata such as permissions)
- Windows: time created
st_ino
: inode number (preserved across moves)st_uid
: owner user IDst_gid
: owner group IDst_size
: size in bytes
Git stores this stat info within each index entry and employs this as a heuristic. If the metadata is identical, then the file or directory is almost certainly unchanged.
Git also uses this stat info to detect new and deleted files under the Git root. When adding or deleting a file, the mtime
of the immediate parent directory is changed.
- Note that this relation is not recursive; editing
foo/bar/baz.txt
only updates themtime
offoo/bar
and notfoo
.
Thus if the stat info for the directory is different, Git can read the current contents of that directory and compare it to the old contents of the directory. Doing this across all directories under the root allows Git to detect any created or deleted file under the git root.
This functionality is deceptively powerful because it allows Git to track moves very efficiently. When you think about it, a file move is almost like deleting a file and creating a new file with similar stat info. Because this stat info is preserved across moves, whenever Git detects a new file, it can compare the stat info to any deleted files. If the stat info is identical, then the file was just moved. Otherwise, Git falls back to reading the contents of the deleted file (retained in the objects database) and diffing it against the new file. If the difference is less than 50%, then Git considers the file to be renamed.
However, note that Git does not care about file copies at all. To Git, a new copy of a file is just a new file, with no history associated with it. There is also no way we can detect copies efficiently (purely from stat info) without running a diff against every single file under the index. Hence, this strategy does not detect out-of-band copies.
Implementation proposal
We maintain a single table: Files
. This has the following schema:
Files
id | int
path | string
ino | int
crtime | int
mtime | int
is_dir | int
path
, ino
, is_dir
are indexed to speed lookups.
id
: the file IDpath
: the file pathino
: the inode number of the filecrtime
: the time the file was created- determined via
ctime
if on windows,birthtime
on MacOS and other BSD-likes
- determined via
mtime
: the time the file contents were last modifiedis_dir
: 1 if the file is a directory, 0 otherwise.
FIM.init()
Create SQLite tables and indices if necessary. Then index all directories under the server root.
FIM._stat(path: str)
Retrieves a file’s stat info and returns it in a StatStruct
:
class StatStruct:
ino: int
crtime: Optional[int]
mtime: int
is_dir: bool
FIM._sync_file(path: str, stat_info: StatStruct)
This private method is what detects out-of-band moves. The key idea is:
- When we notice a file is no longer present in the filesystem, we do NOT delete the corresponding record in the Files table.
- When we index a new path, we first check the Files table for a record with the same
ino
andcrtime
. Ifcrtime
is not available, we fallback to verifyingmtime
.
If there is a record with similar stat info, we update the existing record with the new path and stat info, then return the file ID. Otherwise this method returns None
.
old_path = "old"
new_path = "new"
Path(old_path).touch()
id = FIM.index(old_path)
os.rename(old_path, new_path)
assert id == FIM._sync_file(new_path, FIM._stat(new_path))
FIM.index(path: str): number
First, call _stat()
on path
to make sure file exists. Otherwise return None
.
Then, call _sync_file
on path
to check if file was indexed-but-moved. Return ID if so.
Finally, create a new record for the file at path
. Return file ID.
FIM.get_id(path: str): number
Same as index()
except returns None
if the file was not indexed-but-moved. Does not create a new record and file ID for the file at path
.
FIM._sync_all()
Syncs all new files under the entire server root. Files moved out-of-band can only appear under dirty directories, which are:
- Indexed directories with a different mtime.
- Unindexed directory
We iterate through all dirty directories under the server root and call _sync_file()
on all of their contents. This ensures that the correct file path is associated with each file ID.
FIM.get_path(id: number): str
Call _sync_all()
. Then find the path
associated with the file ID.
Next, verify if the file at the path exists. If not, then return None
.
Otherwise return the path.
FIM.[move, copy, delete]
More straightforward and not worth discussing here as these methods handle in-band ops that are easier to reason about.
Summary of out-of-band ops handling
- Out-of-band moves: detected by
_sync_file()
and_sync_all()
which are called byget_id()
andget_path()
respectively - Out-of-band deletes: handled in
get_id()
andget_path()
methods by verifying file exists before returning - Out-of-band copies: not handled at all
Known issues
- Disjoint moves are untrackable.
- This is because when you delete a file and create a new file with the same contents, all the stat info is possibly completely changed. The only similarity is in the contents. Without maintaining an object database storing said contents, there is no way to track this.
- Out-of-band copies are untrackable.
- Similarly to the above scenario, copies don’t have any shared stat info between the original and the copy. In fact,
inotify
doesn’t emit copy events at all.
- Similarly to the above scenario, copies don’t have any shared stat info between the original and the copy. In fact,
get_path()
can be slow if you move a very large directory.- Very informal benchmark: Takes roughly 150 ms on a m5.12xlarge to call
get_path()
after moving a very large directory (/arch
) in the Linux source tree.
- Very informal benchmark: Takes roughly 150 ms on a m5.12xlarge to call
- Possible race condition where quickly deleting a file and creating an unrelated file gets detected as a “move”.
crtime
has a certain precision depending on the underlying filesystem/kernel. This can be 1 nanosecond (ext* with 256-byte inodes), 100 nanoseconds (NTFS), one second (ext* with 128-byte inodes), or two seconds (FAT/FAT32).- Thus, you can fool FIM’s heuristic by quickly deleting a file and then creating a new one (preserving the
ino
) without changing thecrtime
. - Likely not an issue; this is an already rare edge case that becomes even rarer with newer filesystems supporting higher timestamp precision.
- Filesystems/kernels that don’t have a
crtime
implementation are unable to detect moves followed by edits.- This is even worse for directories, since
mtime
for directories changes whenever a file underneath is added, deleted, or renamed. - This mainly affects UNIX-likes. Windows, OS X, and BSD derivatives are not affected.
- The alternative is to ignore
mtime
. This is discussed further in the Open Questions section.
- This is even worse for directories, since
FAQ
-
Why not just use inos to identify a file?
- Inos, although unique and preserved across moves, are reused if a file is deleted. Inos guarantee uniqueness but not identity. For example, if file
foo
has a ino of 1 andboo
has a ino of 2, then iffoo
gets deleted and a new filebaz
gets created, thenbaz
has a ino of 1. This is inappropriate for our use case as a file ID should never be reused; it should track the path of one and only one file across its lifetime. - In the above scenario, any comments attached to
foo
would get attached tobaz
after the ops execute. - Thus we should rely on some other data (ideally
crtime
) to give us more confidence in a file’s identity.
- Inos, although unique and preserved across moves, are reused if a file is deleted. Inos guarantee uniqueness but not identity. For example, if file
-
Why not just use a filesystem event daemon to watch the contents of the server root?
- These are not necessarily platform-agnostic, and don’t work when the server is offline. Furthermore, this implementation would run into many of the same problems already being tackled. For example, if a daemon is watching the server root, how will it detect me moving a file out of the server root and back in? It would have to rely on the same heuristics to determine file identity that we’re using here, namely
ino
andcrtime
.
- These are not necessarily platform-agnostic, and don’t work when the server is offline. Furthermore, this implementation would run into many of the same problems already being tackled. For example, if a daemon is watching the server root, how will it detect me moving a file out of the server root and back in? It would have to rely on the same heuristics to determine file identity that we’re using here, namely
-
Why is the logic for
get_path()
so complex? Do we really need to sync all the possible dirty directories under the server root to associate the correct path to a given ID?- We can only detect a move after syncing the new file.
get_id()
is simple because we’re given the path, and hence can easily detect an out-of-band move.get_path()
is more tricky because we’re not given the path. Hence, we need to sync every file under all dirty directories to do so.
- We can only detect a move after syncing the new file.
Open questions
- Is there any way for us to clean up records of a deleted file without losing track of moves? If not, will this become a performance bottleneck?
- Right now, we only delete records if we detect a duplicate ino or when the FIM delete method is explicitly called.
- This is because we don’t really have a way of distinguishing between an out-of-band move and an out-of-band delete until we index the new file. We need to retain the deleted file’s stat info in order to determine if the new indexed file was the result of a move.
- Example: let’s say I move a file from an old path to a new path, without indexing the file at the new path. Then if FIM deletes the record associated with the old path, when FIM indexes the new path, it doesn’t have any record of the old file, and thus assigns a new file ID. Then all the data associated with the old file ID gets lost.
- How to track birth/creation time better on UNIX-like OSes? Is this a PR blocker?
- Bit of kernel dev
dramalore: https://www.anmolsarma.in/post/linux-file-creation-time/#fn:2 - Basically ext4 filesystem does actually store birth time, but POSIX standards do not require this in
stat()
syscall. - Newer versions of Linux support the
statx()
syscall.
- Bit of kernel dev
- Should we fall back to
mtime
ifcrtime
is not available?- However, this leads to new IDs being created for moved files followed by edits and moved directories followed by any change in its entries (adding/removing/renaming an entry).
- This results in any data associated with the moved-and-edited file to no longer be associated.
- If we were to ignore
mtime
and only useino
to compare file identity on platforms wherecrtime
is not available, then newly created files following a delete could be given the same ID.- The results in any data associated with the deleted file to be incorrectly associated with the unrelated new file.
- I chose
mtime
fallback as the default behavior, because it’s possible in the future to warn users if data is associated with a file that no longer exists. It’s much trickier to determine if data was associated with the “wrong file”.
- However, this leads to new IDs being created for moved files followed by edits and moved directories followed by any change in its entries (adding/removing/renaming an entry).
- How to test this service with NFS? Use a custom
ContentsManager
implementation?
PRs
Future steps
- Verified & tested NFS/EFS support
- UUIDs config option by popular demand
- Config option for SQLite PRAGMAs to improve performance at the cost of durability
- Most notably:
journal_mode
andsynchronous
- Most notably:
- Run
_sync_all()
on an interval (e.g. 1s) when the server is on.- Question: is there any server performance implication of doing this? Can this be mitigated by offloading the work to a separate thread/process?
- Emit to event bus to inform client that a move happened in the user’s working directory. (users should be able to see updates in left panel file browser)
- Better benchmarks, preferably in CI/CD
- Documentation in readthedocs
- CLI options to invoke FID manager methods (e.g.
jupyter mv
,jupyter cp
, etc.)