Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracking issue: File ID Manager #940

Open
8 tasks
dlqqq opened this issue Aug 2, 2022 · 6 comments
Open
8 tasks

tracking issue: File ID Manager #940

dlqqq opened this issue Aug 2, 2022 · 6 comments

Comments

@dlqqq
Copy link
Contributor

dlqqq commented Aug 2, 2022

File ID Manager design

Right now, there is no universal way in Jupyter for developers to track a file as it is created, modified, and renamed. This is essential when developers need data to be associated with a specific file across its lifetime. The File ID Manager (FIM) is a proposed Jupyter manager to associate immutable file IDs with files that tracks the path of a file across its entire lifetime. That is, a file ID always be solved to the current path of the corresponding file even when the file is moved.

Use cases

Use case: Comments

Comments is a feature request that allows users to attach comments to JupyterLab documents. This offers a richer user collaboration experience, in alignment with Jupyter’s larger goal of providing stronger features along the same vein. Right now, the comments feature request has several proposed implementations for storing comments:

  • Metadata within files (e.g. storing comments data within .ipynb metadata fields)
  • Sidecar file adjacent to the target file which stores comments data
  • SQLite database that maintains a mapping between the target file’s path and comments data

Leveraging metadata fields limits support of this feature to file types that support hidden metadata within the blob. This feature cannot work on raw text files. Furthermore, this implementation is not extensible as it requires a custom implementation for each file type.

Sidecar files mitigate this, but this implementation requires us to pollute the directory structure with sidecar files. Furthermore, users must remember to move or copy both files in tandem or else the comments data is no longer associated with the target file or with the new copy of the target file, respectively. This breaks existing scripts that perform such filesystem operations, and potential data loss leads to a very poor developer experience.

A SQLite database that abstracts the implementation away from end-users is the most promising implementation. Here we are able to leverage the benefits of using sidecar files without polluting directory structure or compromising on user and developer experience. However, there are two glaring issues with maintaining a database between the file path and the comments data:

  • Users can move files. If a file is moved, then all the associated comments data is lost.
  • Users can copy files. If a file is copied, then the associated comments data is not copied.

Feature specification

The FIM is a new manager in Jupyter Server that supports the following key features:

  • Generating a unique ID per file path
  • Supporting retrieval of ID given a file path and retrieval of file path given an ID
  • Maintaining this relationship by tracking a file across its lifetime (creation, movement, deletion)
    • On file moves, the same file must maintain the same ID. Given the ID of a file, this ID should resolve to the new path of said file after a move operation.
  • Notifying other Jupyter Server services and the JupyterLab client on copies
    • This can be done via event bus introduced in Jupyter Server 2.

Furthermore, all of these features must be implemented agnostic of the underlying operating system, filesystem, and kernel. This problem is surprisingly challenging however deceptively simple. In this document, we wish to outline the design of the FIM such that these features can be achieved in such a way.

Terminology

We will use a few custom terms to discuss the design on a more specific and granular basis.

  • Op: A filesystem operation, mainly creates, moves, copies and deletes.

  • In-band/Out-of-band: In-band ops are any ops performed through the Jupyter server Contents API, which is called by the JupyterLab UI. Out-of-band edits are other ops done through some other method, e.g. shell commands or drag-and-drop in Windows File Explorer.

  • Stat info: file metadata returned from the stat() system call.

  • Ino: inode number. An integer associated with each file that points to where its inode is located on disk. The inode stores all relevant metadata about the file, and thus the ino is preserved across file moves within the same filesystem.

  • Crtime: the file’s creation time. May not be available on all platforms.

  • Similar stat info: when the stat info of a previously deleted file and a newly created file have the same ino and crtime (falling back to mtime if crtime is not available on the platform). This indicates an out-of-band move.

  • Indexing: creating a record that associates a file path with a unique, immutable, and non-reused file ID. We say that the file ID manager “indexes” a file when it stores an association between the file path with a file ID.

  • Indexed-but-moved: a file which was previously indexed but moved out-of-band.

  • Disjoint move: an out-of-band op involving deleting a file and creating a file with identical contents at a different path rather than moving a file with mv. Out-of-band disjoint moves are impossible to track without storing file contents in an object database like Git. Disjoint moves include:

    • cp the original file and then rm the original file
    • Moving to a different filesystem and back

Takeaway: In-band ops are easy to track

Because the Contents API manager can directly call the FIM’s methods, it’s easy to track a file across its lifetime, since the FIM is informed of all ops happening to all files in a JupyterLab session. The rest of this document focuses on a strategy to track out-of-band ops.

Looking to Git for inspiration

One key idea to note here is that tracking a file across its lifetime in a platform-agnostic manner is exactly what Git does. Git does not rely on a filesystem event daemon like inotify to do this, and relies purely on files themselves. This method ensures that Git works on pretty much every platform used today.

Git uses an index file .git/index to track all of the files under the root, and stores the original copies of each file in the objects database at .git/objects. Running git init on a new directory indexes all files (including directories) within the root, recursively. This can be shown with the git ls-tree command:

% git ls-tree -rt @
100644 blob c13c5f627fd143c64d90f2d8c730dc752d89ace7    .babelrc
100644 blob db228f0817ece3981ad5938f6cb6a8c79f83d472    .eslintignore
...
040000 tree 2eb004453e870badeb9c48f35066ab78eb78065a    docs
...
040000 tree e49014df8009c41124298a33a9011f182d99978c    examples
...

Each index entry has an object type. blob types represent files, while tree types represent directories.

Tracking a file across by reading its contents is very expensive as it requires disk reads. Hence Git relies on a heuristic obtained from file metadata. stat() is a system call available to all POSIX-compliant platforms, along with some “mostly POSIX-compliant” platforms including Windows. It exposes file metadata that can be employed as a heuristic for tracking file moves. For a full list of metadata types see the system call documentation. The most relevant ones used by Git are the following:

  • st_mtime: time last modified
  • st_ctime:
    • UNIX-like/BSD-like: time last changed (includes all file metadata such as permissions)
    • Windows: time created
  • st_ino: inode number (preserved across moves)
  • st_uid: owner user ID
  • st_gid: owner group ID
  • st_size: size in bytes

Git stores this stat info within each index entry and employs this as a heuristic. If the metadata is identical, then the file or directory is almost certainly unchanged.

Git also uses this stat info to detect new and deleted files under the Git root. When adding or deleting a file, the mtime of the immediate parent directory is changed.

  • Note that this relation is not recursive; editing foo/bar/baz.txt only updates the mtime of foo/bar and not foo.

Thus if the stat info for the directory is different, Git can read the current contents of that directory and compare it to the old contents of the directory. Doing this across all directories under the root allows Git to detect any created or deleted file under the git root.

This functionality is deceptively powerful because it allows Git to track moves very efficiently. When you think about it, a file move is almost like deleting a file and creating a new file with similar stat info. Because this stat info is preserved across moves, whenever Git detects a new file, it can compare the stat info to any deleted files. If the stat info is identical, then the file was just moved. Otherwise, Git falls back to reading the contents of the deleted file (retained in the objects database) and diffing it against the new file. If the difference is less than 50%, then Git considers the file to be renamed.

However, note that Git does not care about file copies at all. To Git, a new copy of a file is just a new file, with no history associated with it. There is also no way we can detect copies efficiently (purely from stat info) without running a diff against every single file under the index. Hence, this strategy does not detect out-of-band copies.

Implementation proposal

We maintain a single table: Files. This has the following schema:

Files
id     | int
path   | string
ino    | int
crtime | int
mtime  | int
is_dir | int

path, ino, is_dir are indexed to speed lookups.

  • id : the file ID
  • path: the file path
  • ino: the inode number of the file
  • crtime: the time the file was created
    • determined via ctime if on windows, birthtime on MacOS and other BSD-likes
  • mtime: the time the file contents were last modified
  • is_dir: 1 if the file is a directory, 0 otherwise.

FIM.init()

Create SQLite tables and indices if necessary. Then index all directories under the server root.

FIM._stat(path: str)

Retrieves a file’s stat info and returns it in a StatStruct:

class StatStruct:
    ino: int
    crtime: Optional[int]
    mtime: int
    is_dir: bool

FIM._sync_file(path: str, stat_info: StatStruct)

This private method is what detects out-of-band moves. The key idea is:

  • When we notice a file is no longer present in the filesystem, we do NOT delete the corresponding record in the Files table.
  • When we index a new path, we first check the Files table for a record with the same ino and crtime. If crtime is not available, we fallback to verifying mtime.

If there is a record with similar stat info, we update the existing record with the new path and stat info, then return the file ID. Otherwise this method returns None.

old_path = "old"
new_path = "new"
Path(old_path).touch()
id = FIM.index(old_path)
os.rename(old_path, new_path)
assert id == FIM._sync_file(new_path, FIM._stat(new_path))

FIM.index(path: str): number

First, call _stat() on path to make sure file exists. Otherwise return None.
Then, call _sync_file on path to check if file was indexed-but-moved. Return ID if so.
Finally, create a new record for the file at path. Return file ID.

FIM.get_id(path: str): number

Same as index() except returns None if the file was not indexed-but-moved. Does not create a new record and file ID for the file at path.

FIM._sync_all()

Syncs all new files under the entire server root. Files moved out-of-band can only appear under dirty directories, which are:

  • Indexed directories with a different mtime.
  • Unindexed directory

We iterate through all dirty directories under the server root and call _sync_file() on all of their contents. This ensures that the correct file path is associated with each file ID.

FIM.get_path(id: number): str

Call _sync_all(). Then find the path associated with the file ID.
Next, verify if the file at the path exists. If not, then return None.
Otherwise return the path.

FIM.[move, copy, delete]

More straightforward and not worth discussing here as these methods handle in-band ops that are easier to reason about.

Summary of out-of-band ops handling

  • Out-of-band moves: detected by _sync_file() and _sync_all() which are called by get_id() and get_path() respectively
  • Out-of-band deletes: handled in get_id() and get_path() methods by verifying file exists before returning
  • Out-of-band copies: not handled at all

Known issues

  • Disjoint moves are untrackable.
    • This is because when you delete a file and create a new file with the same contents, all the stat info is possibly completely changed. The only similarity is in the contents. Without maintaining an object database storing said contents, there is no way to track this.
  • Out-of-band copies are untrackable.
    • Similarly to the above scenario, copies don’t have any shared stat info between the original and the copy. In fact, inotify doesn’t emit copy events at all.
  • get_path() can be slow if you move a very large directory.
    • Very informal benchmark: Takes roughly 150 ms on a m5.12xlarge to call get_path() after moving a very large directory (/arch) in the Linux source tree.
  • Possible race condition where quickly deleting a file and creating an unrelated file gets detected as a “move”.
    • crtime has a certain precision depending on the underlying filesystem/kernel. This can be 1 nanosecond (ext* with 256-byte inodes), 100 nanoseconds (NTFS), one second (ext* with 128-byte inodes), or two seconds (FAT/FAT32).
    • Thus, you can fool FIM’s heuristic by quickly deleting a file and then creating a new one (preserving the ino) without changing the crtime.
    • Likely not an issue; this is an already rare edge case that becomes even rarer with newer filesystems supporting higher timestamp precision.
  • Filesystems/kernels that don’t have a crtime implementation are unable to detect moves followed by edits.
    • This is even worse for directories, since mtime for directories changes whenever a file underneath is added, deleted, or renamed.
    • This mainly affects UNIX-likes. Windows, OS X, and BSD derivatives are not affected.
    • The alternative is to ignore mtime. This is discussed further in the Open Questions section.

FAQ

  • Why not just use inos to identify a file?

    • Inos, although unique and preserved across moves, are reused if a file is deleted. Inos guarantee uniqueness but not identity. For example, if file foo has a ino of 1 and boo has a ino of 2, then if foo gets deleted and a new file baz gets created, then baz has a ino of 1. This is inappropriate for our use case as a file ID should never be reused; it should track the path of one and only one file across its lifetime.
    • In the above scenario, any comments attached to foo would get attached to baz after the ops execute.
    • Thus we should rely on some other data (ideally crtime) to give us more confidence in a file’s identity.
  • Why not just use a filesystem event daemon to watch the contents of the server root?

    • These are not necessarily platform-agnostic, and don’t work when the server is offline. Furthermore, this implementation would run into many of the same problems already being tackled. For example, if a daemon is watching the server root, how will it detect me moving a file out of the server root and back in? It would have to rely on the same heuristics to determine file identity that we’re using here, namely ino and crtime.
  • Why is the logic for get_path() so complex? Do we really need to sync all the possible dirty directories under the server root to associate the correct path to a given ID?

    • We can only detect a move after syncing the new file. get_id() is simple because we’re given the path, and hence can easily detect an out-of-band move. get_path() is more tricky because we’re not given the path. Hence, we need to sync every file under all dirty directories to do so.

Open questions

  • Is there any way for us to clean up records of a deleted file without losing track of moves? If not, will this become a performance bottleneck?
    • Right now, we only delete records if we detect a duplicate ino or when the FIM delete method is explicitly called.
    • This is because we don’t really have a way of distinguishing between an out-of-band move and an out-of-band delete until we index the new file. We need to retain the deleted file’s stat info in order to determine if the new indexed file was the result of a move.
    • Example: let’s say I move a file from an old path to a new path, without indexing the file at the new path. Then if FIM deletes the record associated with the old path, when FIM indexes the new path, it doesn’t have any record of the old file, and thus assigns a new file ID. Then all the data associated with the old file ID gets lost.
  • How to track birth/creation time better on UNIX-like OSes? Is this a PR blocker?
  • Should we fall back to mtime if crtime is not available?
    • However, this leads to new IDs being created for moved files followed by edits and moved directories followed by any change in its entries (adding/removing/renaming an entry).
      • This results in any data associated with the moved-and-edited file to no longer be associated.
    • If we were to ignore mtime and only use ino to compare file identity on platforms where crtime is not available, then newly created files following a delete could be given the same ID.
      • The results in any data associated with the deleted file to be incorrectly associated with the unrelated new file.
    • I chose mtime fallback as the default behavior, because it’s possible in the future to warn users if data is associated with a file that no longer exists. It’s much trickier to determine if data was associated with the “wrong file”.
  • How to test this service with NFS? Use a custom ContentsManager implementation?

PRs

Future steps

  • Verified & tested NFS/EFS support
  • UUIDs config option by popular demand
  • Config option for SQLite PRAGMAs to improve performance at the cost of durability
    • Most notably: journal_mode and synchronous
  • Run _sync_all() on an interval (e.g. 1s) when the server is on.
    • Question: is there any server performance implication of doing this? Can this be mitigated by offloading the work to a separate thread/process?
  • Emit to event bus to inform client that a move happened in the user’s working directory. (users should be able to see updates in left panel file browser)
  • Better benchmarks, preferably in CI/CD
  • Documentation in readthedocs
  • CLI options to invoke FID manager methods (e.g. jupyter mv, jupyter cp, etc.)
@dlqqq
Copy link
Contributor Author

dlqqq commented Aug 4, 2022

Notes from 8/4 server meeting:

  • Drop ID from model returned by GET Contents API. Build a separate API for requesting a file ID.
  • Migrate file ID service to a separate server extension given that ContentsManager no longer needs to rely on data returned from file ID service
  • Add some abstract base class for FileIdManager to guide developers on what public methods should be implemented
  • Windows getting dropped for Jupyter server 2.0? No need for Windows support?

@kevin-bates
Copy link
Member

Windows getting dropped for Jupyter server 2.0? No need for Windows support?

The only discussion I recall regarding Windows was whether the "mtime approach" holds on Windows. I'm not aware of any notion that the server would no longer support Windows in 2.0.

@Zsailer
Copy link
Member

Zsailer commented Aug 4, 2022

Windows getting dropped for Jupyter server 2.0? No need for Windows support?

No—sorry, this was poorly made joke on my end in the Zoom chat, not a serious comment. (we have been battling some issues with Jupyter Server running on Windows recently, so my comment was a bit tongue-in-cheek 😅 ).

Jupyter Server 2.0 will most certainly support Windows as best as possible. 😎

@kevin-bates
Copy link
Member

Hi @dlqqq - I just finished reading the design section (excellent write-up - thank you!) and had some comments regarding its content.

  • This may sound like a nit, but the references to crtime should probably be ctime. The reason is that, as you had pointed out, this value is change time (although create time on Windows), and using crtime will lead folks to believe this is purely the file creation time, which it is not. I know I tend to think of crtime as create time.
  • In doing local tests with moves (and within the same disk), I see st_ctime changing - which makes sense because the metadata of the file has changed. This implies we can't really use ctime to determine moves, and, given the issue you raised with moved-and-edited files in the Open Questions above (where mtime is problematic), I'm not sure either is reliable.
    • Interestingly, when I move the file from a local FS to a Google Drive, I definitely get new inode information, but ctime takes the value of mtime and mtime remains the same as from the source (and the same goes for moves within the drive). This implies that Google drive is probably one of those sources that don't support st_ctime.
  • If this stat approach is taken, ino should probably be coupled with the device info (st_dev) to get the true uniqueness (within the given server).

(Definite nit: the return value of FIM.get_path(id: number): number above should be str)

@dlqqq
Copy link
Contributor Author

dlqqq commented Aug 22, 2022

@kevin-bates Sorry for not addressing your concerns earlier! I was out on vacation for a week, and had a small mountain of tasks waiting for me when I got back. Let me address your concerns:

This may sound like a nit, but the references to crtime should probably be ctime. The reason is that, as you had pointed out, this value is change time (although create time on Windows), and using crtime will lead folks to believe this is purely the file creation time, which it is not. I know I tend to think of crtime as create time.

Well, we are actually being consistent and not wrongly assigning st_ctime to crtime on UNIX-likes. If the platform does not support creation times, then crtime gets set to NULL. Thus the semantic meaning of crtime is preserved. If crtime exists for a record, it's guaranteed to mean that it's the file's creation time. You can take a look in _parse_raw_stat() to see for yourself.

In doing local tests with moves (and within the same disk), I see st_ctime changing - which makes sense because the metadata of the file has changed. This implies we can't really use ctime to determine moves, and, given the issue you raised with moved-and-edited files in the Open Questions above (where mtime is problematic), I'm not sure either is reliable.

Yup! That's because you renamed the file, which changes the file's associated metadata, which forces an update to st_ctime.

ctime takes the value of mtime and mtime remains the same as from the source (and the same goes for moves within the drive). This implies that Google drive is probably one of those sources that don't support st_ctime.

Well, not necessarily. When you "move" a file to a remote filesystem, you're really just deleting a file on the current FS and creating a new one on the remote FS. That's why your st_ctime == st_mtime. They are really both just the file creation times on the remote FS. There's really no way to preserve a file ID during this operation out-of-band since you're really just creating a new file (this is the "disjoint move" I reference in the design). If you do want to preserve file ID during this disjoint move, you must perform it in-band through the ContentsManager.

If this stat approach is taken, ino should probably be coupled with the device info (st_dev) to get the true uniqueness (within the given server).

Absolutely. Will definitely work on this once I get this work into a separate server extension.

(Definite nit: the return value of FIM.get_path(id: number): number above should be str)

👍

@echarles
Copy link
Member

I may miss something, but I don't understand how https://github.com/jupyter-server/jupyter_server_fileid has to be used by jupyter-server. Could you highlight in a few commands/configs what needs to be done to make the fileid extension be functional with the server?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants