-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finalize name and location of UnvalidatedX #3546
Comments
I would suggest reading through the "Personal life" section of the Wikipedia article and the references therein before deciding whether to name more stuff after Erwin Schrödinger. |
Thanks for pointing this out, TIL 🤢 |
I would have also been opposed to the Schrödinger name on the grounds that it's too clever and requires thinking around three corners or knowing what the crate does already. I'd rather use something straightforward like |
Has removing the layer of abstraction and using the raw bytes underneath directly already been discussed? AIUI, we use UnvalidatedX in cases where we just need something comparable to use as a key in a map, or where the validation cost only needs to be paid conditionally - map.get("my_key".to_raw()) instead of .to_unvalidated() seems fine for the former, and the latter could follow Rust APIs like https://doc.rust-lang.org/stable/std/str/fn.from_utf8.html# ? |
Previous discussion and background why we want the extra layer: #2489 |
Two main reasons for the abstraction:
|
The current semantics of UnvalidatedStr are:
This is somewhat handy because it means we can rely on an UnvalidatedStr to be the key of a map (or else hit a serializer error). But, another useful semantic for data that is expected to be either string or bytes would be to serialize as a byte sequence in human-readable if the bytes are not ASCII. |
A name I've been using on my branch is |
|
I agree with @robertbastian that the "Or" is too inclusive here. But that does make me think, Otherwise maybe some non-negated antonyms of "trusted": |
Haha I love |
My problem with these alternatives is that (imo) they either require more thinking than |
I kinda like the Unvalidated prefix and don't feel bad about the negation. Could be DeferredStr instead ( I do like the zerovec crate being nice and clean, but I do admit that I'm also not too happy about proliferating more util crates. No strong opinion there. |
I agree with @Manishearth about the naming, |
This was a progression from my suggestion that it may be useful to rethink the semantics:
In practice, there's very little difference between the functionality of UnvalidatedStr and BytesOrStr except for some edge cases around serialization. So why not just embrace the other model that is more flexible. |
The |
Not opposed. |
I still like SchrödingerStr modulo the issues with the namesake historical figure. Can't we come up with some other "clever" term a la elsa::FrozenMap, yoke::Yokeable, ... ? |
Should we consider a name like PotentiallyIllFormedUtf8? Is there a semantic difference? |
I really really dislike all the "clever" names. |
In this case I can't immediately think of anything and |
I am not very happy with that proposal because:
|
Bikeshed, with some help from gemini:
If we don't like any of those, how about choosing a letter of the alphabet, like
Most developers don't need to deeply understand why this type exists. They will most often see it as the key of a map. |
I'm fine with I don't really think any of the other names work well here. I think the un is warranted here. |
When sitting down to vote and take a closer look at the options, I didn't like the choice of word / prefix in those options to indicate the uncertainty of the well-formedness of the UTF-{8, 16} encoding adhered to by the string. Comments per option:
I want to offer a few more options to try to capture different angles. For example, when thinking about the type of element that we have a sequence of, we know more about it than
|
@sffc asked me to comment on the naming here: I think "potentially ill-formed" is the most correct term, but it makes for very long identifiers, so as a matter of type naming, I prefer
I think we shouldn't use
|
What I hear from the discussions here is that everyone is in general agreement that If I may try to summarize the pros and cons:
I'm still searching for the best compromise with the fewest cons. Doing this writeup makes me think that Can anyone verbalize any more downsides to the Also, if I failed to note any pros or cons of the options, please reply to this issue. |
I think the thing is, it still is a string, it's not "potentially not a string". PotentialStr makes me think it's an Option or something. Unvalidated matches my mental model well because it is some type of string and we have not yet validated whether it is the type of string we like. |
The proposed name is Do you feel both |
Maybe is stronger (EDIT @sffc: PotentialUtf8 seems ok to me |
I added this to the "pros" of unvalidated. |
As stated above in #3546 (comment), I'd like to propose the following as a compromise. Crate Name: Primary types:
Potential (🤣) expansion types to be added if needed:
Although this was not anyone's first choice, it is the only option for which a fundamental flaw was not verbalized in this thread (except for perhaps quasi, whose objections I would characterize as more a distaste for the less-precise / clever naming convention). To read up on the flaws in every other option, see my post above. Please check the box if you have no objection to the proposal. |
I still prefer |
Maybe has a strong connotation of being an |
Ok, we'll schedule a follow-up meeting. Alternatively, we could cover this in tomorrow's ICU4X-WG call assuming all the attendees are able to make it (@hsivonen @sffc @Manishearth @echeran). |
I think there are basically two options:
|
How will this relate to https://github.com/BurntSushi/bstr ? |
@echeran is your comment a strong objection to I'm open to trying bstr instead. However it does have a non optional dependency on memchr. |
My take:
|
Yes, strong agree. I'd like us to be talking about pros and cons, not comparing at this stage, because a comparison is not in and of itself a blocking argument.
You address this later, but to explicitly talk about this: We already use DiplomatStr over FFI, and that translates to something natively meaningful on the other side. Which means it's unlikely this abstraction will ever "escape" over FFI. |
After thinking about this more, I'm OK with this. We have IDE autocomplete, etc., to deal with the identifier length. After a look at the |
Summary of discussion with @Manishearth @echeran @sffc:
|
Just to post this somewhere: I think it would not be completely unreasonable or inconsistent with Rust style to introduce the following type pub struct MaybeUtf8(pub [u8]);
impl MaybeUtf8 {
pub unsafe fn assume_utf8(&self) -> &str { ... }
pub fn try_to_utf8(&self) -> Result<&str, Utf8Error> { ... }
} The parallelism of |
I think my main gripe with that is still the usage pattern, where the point of this type is actually that you can often use it without ever having to deal with validating or assuming UTF8, it's quite useful without those two. Because of that it feels very different from the stdlib unsafe helpers, which are more enumlike and the usage pattern is extremely stateful. |
|
Currently we have zerovec::ule::UnvalidatedStr and zerovec::ule::UnvalidatedChar. For a while, we've been meaning to discuss a more final home and name for these types.
There's nothing really zerovec-specific about these types other than zerovec putting their use case more front and center. They are almost as useful for serde as they are for zerovec.
I'm not a huge fan of the "unvalidated" prefix; I would rather we avoid negations.
How about
schrodinger::SchrödingerStr
? (also re-exported without the diacritic)Discuss with:
Optional:
The text was updated successfully, but these errors were encountered: