Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

red-knot: VfsFile input ingredient and a Vfs #11802

Merged
merged 4 commits into from
Jun 12, 2024
Merged

Conversation

MichaReiser
Copy link
Member

@MichaReiser MichaReiser commented Jun 8, 2024

This PR adds the foundation for red-knots salsa integration. It starts by defining a virtual filesystem that is Salsa's view of the files on disk (metadata only for now, I'll add support for content in the next PR).

VFS

The Vfs supports operating on files from the local filesystem (disk, editor, WASM, in memory) or vendored stub files (this PR only adds a stub implementation). It is virtual because it supports multiple sources and it is Salsa's view of the file system state.

The most notable change in terms of what we discussed for vendored files is that this PR exposes two methods on the Vfs:

  • file: To "intern" a file system path.
  • vendored: To intern a vendored path

The motivation behind this is that the methods differ in their return-type (and argument, but that's less important). For file, it's important that the implementation returns a File object even for files that don't exist so that salsa can track the fact that this query needs to be re-executed if such a file gets created later. This isn't necessary for vendored files because the vendored file system is readonly. Returning None in that case reduces the number of dependencies that salsa has to track, and in turn, can result in better performance. Splitting the methods also has the benefit that they're easier to call. I suspect that code paths will either exclusively work with file system or vendored path. It would be cumbersome if the callers need to wrap the path in a VfsPath just for the interning.

FileSystem

This PR also introduces a new FileSystem trait that abstracts away how filesystem files are read. The goal of this is that we can support different environments:

  • WASM: Ruff might not have access to a full file system or it's all in memory as it is the case in our playground today
  • LSP: The lsp does support reading from files, but unsaved changes take precedence over the content on disk.
  • tests: Ideally, tests don't need to write the content to disk. Instead, they can use an in-memory file system.

Crate name

I ended up creating a new crate because I couldn't find an existing crate that fits well. I first wanted to use ruff_source_file which is very similar. However, ruff_notebook depends on it and I suspect that this crate might depend on ruff_notebook in the long term (unless we use some Arc<dyn Trait + Eq> to represent file content).

The other reason why I didn't put it in ruff_source_file is that it is intended to store low level data structures for files. A Vfs and a dependency on salsa feels a bit overkill for such a crate

Open Questions

  • dir walking: It's not entirely clear to me how to implement dir walking because walkdir doesn't support virtual filesystems. Ideally, both the memory file system and the real implementation would use the same dir walking mechanism to avoid differences in behavior.
  • File watching: I think file systems need either to automatically watch files for changes or need an API that allows setting up file watchers. They probably need a second API that allows pulling all changes (file added, removed, changed events). The host would call a apply_changes method that takes a mutable self whenever it observed file system changes, so that the Vfs can update its metadata.

@MichaReiser MichaReiser added the red-knot Multi-file analysis & type inference label Jun 8, 2024
@MichaReiser MichaReiser changed the title red-knot: First draft of the File input ingredient and a Vfs red-knot: File input ingredient and a Vfs Jun 8, 2024
permission,
})
}
VfsPath::Vendored(_) => todo!(),
Copy link
Member Author

@MichaReiser MichaReiser Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexWaygood what I have in mind here is that the implementation calls into the vendored crate to read the file metadata (vendored_path.last_modification_time(), vendored_path.exists(), and ruff_venored::read_to_string(vendored_path)?)

Copy link
Contributor

github-actions bot commented Jun 8, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@MichaReiser MichaReiser changed the title red-knot: File input ingredient and a Vfs red-knot: VfsFile input ingredient and a Vfs Jun 9, 2024
Comment on lines 22 to 23
/// Interns a path to a vendored file and returns a salsa `File` ingredient.
fn vendored_file(&self, path: &camino::Utf8Path) -> Option<VfsFile>;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexWaygood I think we ultimately want two different entry functions for regular files, and vendored files. Mainly because I think it's a bit annoying if you're working with a vendored path and you then need to convert it to a VfsPath just to get the VfsFile. Funnily, the first thing that the implementation would do is to dispatch on the path type.

@MichaReiser MichaReiser force-pushed the salsa-files-vfs branch 4 times, most recently from fa3bb86 to 95d1254 Compare June 10, 2024 06:30
Comment on lines +32 to +34
// TODO support untitled files for the LSP use case. Wrap a `str` and `String`
// The main question is how `as_std_path` would work for untitled files, that can only exist in the LSP case
// but there's no compile time guarantee that a [`OsFileSystem`] never gets an untitled file path.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC: @snowsignal I don't plan to support this as part of this PR but something we need to figure out for red-knot/LSP

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pinging me on this!

@MichaReiser MichaReiser marked this pull request as ready for review June 10, 2024 06:32
@MichaReiser MichaReiser changed the base branch from main to set-minimal-rust-to-175 June 10, 2024 11:16
Base automatically changed from set-minimal-rust-to-175 to main June 10, 2024 12:39
Copy link

codspeed-hq bot commented Jun 10, 2024

CodSpeed Performance Report

Merging #11802 will not alter performance

Comparing salsa-files-vfs (ce952b3) with salsa-files-vfs (a98db3d)

Summary

✅ 30 untouched benchmarks

@MichaReiser
Copy link
Member Author

MichaReiser commented Jun 10, 2024

Okay, not having a db.file(VfsPath) method is going to be problematic. I realized this when implementing the module resolver where we have a path_to_module(db, VfsFilePath) function. I could match on the path and call the two different functions but that's rather annoying.

I have to figure out how we handle the signature difference between vendored paths and regular paths. I would very much like the possibility that vendored returns None if the file doesn't exist.

I plan to address this as a separate PR to reduce my rebasing work.

Edit: This is solved in #11826

@MichaReiser MichaReiser changed the title red-knot: VfsFile input ingredient and a Vfs red-knot: VfsFile input ingredient and a Vfs [salsa 1..] Jun 11, 2024
@MichaReiser MichaReiser changed the title red-knot: VfsFile input ingredient and a Vfs [salsa 1..] red-knot(salsa part 1): VfsFile input ingredient and a Vfs Jun 11, 2024
@MichaReiser MichaReiser changed the title red-knot(salsa part 1): VfsFile input ingredient and a Vfs red-knot[salsa part 1]: VfsFile input ingredient and a Vfs Jun 11, 2024
Copy link
Contributor

@carljm carljm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong +1 on the decision to make this its own crate. I think in general making new crates for red-knot functionality should be preferred over squeezing it into existing crates; I think the latter will cause more confusion. In particular I don't think we should try to put red-knot's semantic model into the existing ruff_python_semantic crate.

crates/ruff_db/src/file_system.rs Show resolved Hide resolved
crates/ruff_db/src/file_system.rs Show resolved Hide resolved
crates/ruff_db/src/file_system.rs Show resolved Hide resolved
crates/ruff_db/src/file_system.rs Outdated Show resolved Hide resolved
crates/ruff_db/src/file_system.rs Show resolved Hide resolved
crates/ruff_db/src/file_system.rs Show resolved Hide resolved
crates/ruff_db/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines 186 to 187
/// The unix permissions of the file. Only supported on unix systems. Always 0 on Windows
/// or when the file has been deleted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 or None?

Should this be an Option<NonZeroU32>?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be None.

The FS api uses a u32 as return type. I'm not aware that 0 is ever a valid permission but I want to avoid conversion errors and using a NonZeroU32 doesn't give us much here (other than the size savings)

crates/ruff_db/src/vfs/path.rs Show resolved Hide resolved
Comment on lines +20 to +31
/// ## Why do both the [`Vfs`] and [`FileSystem`](crate::FileSystem) trait exist?
///
/// It would have been an option to define [`FileSystem`](crate::FileSystem) in a way that all its operation accept
/// a [`VfsPath`]. This would have allowed to unify most of [`Vfs`] and [`FileSystem`](crate::FileSystem). The reason why they are
/// separate is that not all operations are supported for all [`VfsPath`]s:
///
/// * The only relevant operations for [`VendoredPath`]s are testing for existence and reading the content.
/// * The vendored file system is immutable and doesn't support writing nor does it require watching for changes.
/// * There's no requirement to walk the vendored typesystem.
///
/// The other reason is that most operations know if they are working with vendored or file system paths.
/// Requiring them to convert the path to an `VfsPath` to test if the file exist is cumbersome.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do find this layering pretty confusing in trying to understand the code. I think partly it's just the naming: what we call FileSystem is already "virtual" in that it can represent on-disk, memory, etc. And then we have another "virtual file system" layer on top of that.

I'm also not sure of the accuracy of this statement:

The other reason is that most operations know if they are working with vendored or file system paths.

Maybe it will turn out to be the case? It's just not clear to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can say from using the API that it's working pretty well so far. But I admit that the terminology is confusing. Do you have any suggestions on how the naming could be improved or what specifically you find confusing.

I can try to improve the documentation. The way I think about FileSystem is less that it is a virtual file system. It's rather more how we access it. Mwhat would you think of FileSystemDriver?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some more time looking at it and I think it's fine, if it seems to you to be working out well. I'm not sure that a rename of FileSystem to FileSystemDriver would make much difference; I think "Driver" ends up being kind of an empty filler word there that doesn't really communicate much.

It was hard for me to figure out the model for how these two things interact, since neither one encapsulates the other. It seems the model is that a Db has both a Vfs and a FileSystem, and it will use the FileSystem for looking up filesystem paths, and manage looking up vendored paths itself, though that hasn't been implemented yet, just stubbed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe what's confusing here is the naming of Vfs? Maybe it should just be VfsFiles?

Anyway, I do think your concern is valid and I went back and forth a couple of times between FileSystem handling all VfsPaths and the FileSystem only handling specific paths. And maybe my reasoning that vendored paths are different enough to not pass them through FileSystem is unjustified, considering that calling write_file with a path pointing to a directory also fails. So there's anyway some extra care that needs to be taken when handling paths that aren't checked at compile time. But I find it kind of nice if we can catch some of them at compile time.

The way I think about it is that FileSystem is a replacement for std::fs, that's it. It doesn't add support for using multiple file systems at once (vendored and the regular one), which is what Vfs does.

An entirely different design (not fully thought through) would be to ditch the FileSystem trait all together. The main motivation for it is that we can support unsaved files and untitled files in the LSP use case. But we could support this in another way by adding a open_file and close_file function to Vfs that allows to manually override the content of a file.

The only functionality we would looe is that we can't "mock out" the file system during tests with an in memory file system. But maybe that's not worth going through all that trouble. The WASM integration story would then require to provide a "stub" FS implementation. It's not a big detail and something that e.g. emscripten provides out of the box but it comes with a bit more boilerplate code than an option to just use a memory file system.

Either way, I don't consider the Vfs / FileSystem design that I proposed here as the final design. It requires some more iterating and I'm open to redesigning it completely.

  • We merge what we have now, with the risk that we might need to refactor some of the calling code
  • We strip out the FileSystem trait and revisit the design when implementing LSP support.

I'm leaning towards keeping what we have here. There's not a lot of downstream code depending on FileSystem, so that a refactor should be fairly painless.

@MichaReiser
Copy link
Member Author

Strong +1 on the decision to make this its own crate. I think in general making new crates for red-knot functionality should be preferred over squeezing it into existing crates; I think the latter will cause more confusion. In particular I don't think we should try to put red-knot's semantic model into the existing ruff_python_semantic crate.

I'm a bit surprised by this comment because that's what I proposed in discord and I didn't see any objection to doing this for ruff_python_semantic

I can see how it can cause confusion. Mainly for imports when you get both Symbols (the ruff and red_knot one).

I don't think we can't avoid mixing them long term. The red knot crates at least will have to depend on their corresponding duff crates to use their functionality and it then becomes less clear to me what the benefit of keeping them separate really is. I also think that it is a motivation function to integrate early and more often compared to building this out entirely in different crates

@carljm
Copy link
Contributor

carljm commented Jun 12, 2024

I'm a bit surprised by this comment because that's what I proposed in discord and I didn't see any objection to doing this for ruff_python_semantic

Oh sorry! I think I do remember that Discord comment now; I guess I didn't think very carefully about it then.

I can see how it can cause confusion. Mainly for imports when you get both Symbols (the ruff and red_knot one).

I think it will just be generally hard to figure out how to structure things within one crate to make it clear what is red-knot and what is "v1". E.g. there is a top-level definition.rs in ruff_python_semantic, but red-knot will have its own version of Definitions; where do they go? Perhaps it will help me envision how this can work if I see the overall structure you have in mind within the crate to avoid just having a random assortment of sub-modules, some of which are v1 and some of which are red-knot, with no obvious way to tell which is which. If this structure is the outcome, I don't think that's a good outcome

I don't think we can't avoid mixing them long term. The red knot crates at least will have to depend on their corresponding duff crates to use their functionality and it then becomes less clear to me what the benefit of keeping them separate really is. I also think that it is a motivation function to integrate early and more often compared to building this out entirely in different crates

I agree that red knot will probably have to depend on v1, but the dependency should not go in the other direction, and to me this is enough reason to keep them separate.

I'm not clear what sort of "integration" we actually envision between the v1 semantic model and the red-knot semantic model that this would encourage. Can you clarify what kind of integration you are thinking of? My understanding is that they will always remain separate, and rules will have to explicitly be ported in some way.

@MichaReiser
Copy link
Member Author

I think it will just be generally hard to figure out how to structure things within one crate to make it clear what is red-knot and what is "v1". E.g. there is a top-level definition.rs in ruff_python_semantic, but red-knot will have its own version of Definitions; where do they go? Perhaps it will help me envision how this can work if I see the overall structure you have in mind within the crate to avoid just having a random assortment of sub-modules, some of which are v1 and some of which are red-knot, with no obvious way to tell which is which. If this structure is the outcome, I don't think that's a good outcome

I agree on this and something I have started thinking about as well. I don't think what I've done in my follow up PRs is ideal and I wanted to iterate on it. For now, the red knot code is gated behind the red_knot feature. That means, by default the ruff_python_semantic crate doesn't compile with any red knot functionality. This prevents that any v1 code access red_knot code from ruff.

I think what could be improved is that we move all red_knot code that isn't shared between v1 and red_knot into a red_knot module. Although it then quickly becomes unclear what should be in there. Should we move the db out as soon as a single v1 API uses any red knot code? What about code that doesn't exist in v1 at all, e.g. the module resolver?

Regardless, this would then be very close to having two separate crates. The difference I see is that moving files in a crate tends to be easier and we can also already make use of the right crate visibilities (we don't need to make anything pub just so that we can access it from the red_knot crate.

I overall don't have a strong preference and I think I would just go with one approach for now. We don't need to get it right just now, it's easy to split the code out later.

I'm not clear what sort of "integration" we actually envision between the v1 semantic model and the red-knot semantic model that this would encourage. Can you clarify what kind of integration you are thinking of? My understanding is that they will always remain separate, and rules will have to explicitly be ported in some way.

I at least want to try to come up with a rule API that would work for both v1 and red_knot, at least for a majority of the rules. I would strongly prefer if we can avoid copying rules from ruff to the red_knot crate because that would come close to a rule-freeze for a couple of months. But this requires that ruff_linter has access to both red_knot and v1 code.

@MichaReiser
Copy link
Member Author

MichaReiser commented Jun 12, 2024

I'll go ahead and merge this PR. This does not mean that we made a decision on how to proceed with FileSystem and Vfs or where we want to place the red_knot code in ruff_python_semantic. I just don't consider these two decisions as merge blocking and I would prefer to do this refactor as separate PRs on top of my entire stack anyway, because I don't enjoy suffering through rebases.

@MichaReiser MichaReiser changed the title red-knot[salsa part 1]: VfsFile input ingredient and a Vfs red-knot: VfsFile input ingredient and a Vfs Jun 12, 2024
@MichaReiser MichaReiser enabled auto-merge (squash) June 12, 2024 07:03
@MichaReiser MichaReiser merged commit 93973b9 into main Jun 12, 2024
19 checks passed
@MichaReiser MichaReiser deleted the salsa-files-vfs branch June 12, 2024 07:06
@carljm
Copy link
Contributor

carljm commented Jun 12, 2024

I do see that visibilities could be a reason to share a crate, if there's a lot of use of "old" internals from red-knot code.

Ultimately I think sharing a crate could work fine, too, as long as we have a structure that makes it clear what is v1 and what is red-knot.

Copy link
Member

@AlexWaygood AlexWaygood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow review here. It took me a little bit of time to get my head round some of this, but it LGTM. Happy to open my own followup PR for some of my docs nits, unless there are any you particularly disagree with!

crates/ruff_db/src/file_system.rs Show resolved Hide resolved
crates/ruff_db/src/file_system.rs Show resolved Hide resolved
crates/ruff_db/src/file_system.rs Show resolved Hide resolved
crates/ruff_db/src/file_system.rs Show resolved Hide resolved

/// Path to a file or directory stored in [`FileSystem`].
///
/// The path is guaranteed to be valid UTF-8.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we won't provide type-checking for Python scripts if their filename contains non-utf8 characters? Is this a limit Ruff already has when linting Python?

I think we can guarantee that vendored files are always going to be valid utf-8 but I'm not sure about non-vendored files

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it limits us to UTF8 paths only but Ruff already assumes UTF8 today (we have so many path.to_str().unwrap() calls. Also, non UTF8 paths are extremely uncommon. I think all modern file system support UTF8 today (or at least, paths can be encoded to UTF8).

crates/ruff_db/src/file_system.rs Show resolved Hide resolved
crates/ruff_db/src/file_system/memory.rs Show resolved Hide resolved
crates/ruff_db/src/vfs.rs Show resolved Hide resolved
crates/ruff_db/src/vfs.rs Show resolved Hide resolved
crates/ruff_db/src/vfs.rs Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
red-knot Multi-file analysis & type inference
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants