-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ergonomics of String handling #26
Comments
One of the quotes from the survey:
|
This RFC extends the |
Is there anything like |
There was easy_strings. For rapid prototyping, I wonder if we should evaluate making "as easy as python" (e.g. internally clone or Arc everything, reducing borrows) versions of common vocabulary terms. Are there any others? |
I agree - but in this particular case your only option is to build it manually, as using
This looks awesome and I wasn't aware of it! There's also
IMO, if there was a single "easy by default" string which I could later tune/drop-in replace with std Strings I'd be all for it. Drop in replace is the hard part though, as all functionality must be replicated (traits, etc.) for the "easy don't care" string. |
Also, I've found |
There is some discussion of the hurdles of mixing Path/PathBuf with String in rust-lang/rust#49868 |
Along these lines, an "ergonomic" string type that supports multiplying against integers would be useful like in Python CC @Aaronepower |
... I always assumed OsStr let you convert to/from bytes.
|
You can, on Unix only though: https://doc.rust-lang.org/std/os/unix/ffi/trait.OsStrExt.html On Windows, all you can get are the UTF-16 bytes (which is the raw representation): https://doc.rust-lang.org/std/os/windows/ffi/trait.OsStrExt.html On Unix, OsStrs are, AIUI, zero cost in that they represent the bytes from the platform as-is. On Windows, OsStrs are always transcoded to and from UTF-16 at the boundaries (with WTF-8 as the internal representation). |
I've written a bit about this topic as it relates to byte strings: https://docs.rs/bstr/0.1.2/bstr/#file-paths-and-os-strings While it seems like we will eventually get string-like APIs on OsStr, that still won't be enough. Consider the perhaps somewhat common case of trying to match a file path against a regex. The regex machinery cannot know the internal representation of an OsStr, so you're only real choice is to lossily convert it to UTF-8 on Windows and use the raw bytes on Unix. But the standard library doesn't make this particular use case easy and currently requires writing platform specific code. |
I also spent a little bit of time looking at how other ecosystems handle this. For example, in the Go world, it will lossily convert Windows file paths to UTF-8 at the very lowest levels, so it's impossible to roundtrip file paths on Windows that contain invalid UTF-16 in Go. Nevertheless, despite searching for it, I could find no practical reports of this being a problem. I'm not sure what, if any, conclusions we can draw from that. |
So I fully admit, this is not an area I've looked into. When I see "lossy" wrt unicode, I expect tofu to be inserted. This makes me concerned that a string search might not match when it should and if I want to add a suffix to a non-UTF-8 file, it'll now look like garbage to the user. Is this accurate? If so, then that is why I'd be interested in (1) ... somehow. If its not and only non-visible bytes are dropped, then 2/3 sound reasonable but I feel clarifying the behavior could be helpful for people concerned like me. |
It's the Unicode replacement codepoint,
In cases like these, it's generally helpful to construct an example. The case where
Notably, this is Windows only. Unix works fine. I'll also note that ripgrep has this bug, and I've never gotten a bug report. (Presumably, ripgrep has a substantial user base on Windows, since it ships with VS Code.) Of course, absence of evidence is not evidence of absence, but I'm an engineer, not a theoretician. :-)
Did you see this part of the docs?
Could you say more about what is confusing here?
(1) would solve the failure case I described above with respect to substring/regex/glob search, presuming your regex/glob is constructed in such a way to match arbitrary bytes. It's a bit tenuous, but I don't think you can do any better for this specific case, other than perhaps dealing with the However, at least for byte strings, I don't think this solves the roundtrip problem elegantly. The problem is that byte strings are arbitrary bytes, so the conversion from With all that said, roundtripping invalid UTF-16 file paths on Windows is a precarious proposition. Consider, for example, a program that has the complex job of merely printing file paths as part of its output. If you are in the unenviable position of needing to deal with invalid UTF-16 file paths, then it's quite possible that you can't even print them correctly as output, because Windows consoles generally barf on anything that isn't valid UTF-16. That's why Rust's standard library will return an error if you attempt to write invalid UTF-8 to stdout. So in Windows, "roundtripping" a file path is really limited to "have a file path, change it in some way, and then use it in file system APIs."
I don't think "non-visible" is the correct characterization here. The bytes that are dropped are only meaningful with respect to a specific encoding, where as "visibility" is really a property of a character itself. TL;DR - The byte string approach is basically arguing to not handle the case of invalid UTF-16 Windows paths by either lossily transcoding them (in which roundtripping can subtlely fail, but searching generally works, modulo the corner case mentioned above) or by returning an error (that is surfaced to the user). Lossy transcoding is basically the acceptance that these file paths are rare, and that an errant substring search is likely even rarer. Returning an error is basically telling the end user to fix their file paths. |
So first, before you pointed it out, I did not notice the large section you wrote in bstr's docs on the topic. I was mostly going off what little I've noticed in the stdlib. And yes, it is important to consider the application and what is (1) the likelihood of running into problems and (2) what is the right solution for it. The concerns in my post were written from remembering the concerns but not fully remembering the application. I've had more time to think on it and my biggest of concern is writing my own path library. I have two goals I'm oscillating between (1) cleanly abstracting the best path-related crates like So from this perspective
|
From the survey, string ergonomics are huge. Many, many CLI applications deal heavily with strings. In Rust, strings can be...difficult.
Granted, (IMO) Rust handles them correctly, sometimes the correctness doesn't actually matter for a given problem domain and just adds unnecessary gyration.
For example, we have:
&
)('static
)str
Cow
&
)String
&
)OsStr
&
)OsString
It's understanding it can be overwhelming. My personal opinion is we should first tackle/discuss the ergonomics of using
OsStr(ing)
as it's heavily used on Linux (where paths may not contain valid UTF-8).IMO
OsStr
should have the same user experience as&str
/String
.We could also probably start by either listing known issues/inconsistencies or any current issue links/RFCs on the matter.
The text was updated successfully, but these errors were encountered: