Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case insensitivity #73

Open
vmx opened this issue Jun 8, 2020 · 15 comments
Open

Case insensitivity #73

vmx opened this issue Jun 8, 2020 · 15 comments

Comments

@vmx
Copy link
Member

vmx commented Jun 8, 2020

Currently we have Base encodings which are currently spec'ed as case-insensitive (Base32 and Base36). I can see that this is convenient from an application perspective. But wouldn't it make sense to be strict at the lowest level (the level of this spec) and reject wrongly cased encodings?

The convenient normalization can always happen at an application level, it would also always have enough information to do so.

Hence I'm in favour in being more strict and making things case sensitive.

PS: Thanks @hugomrdias for bringing this up.

@ribasushi
Copy link
Contributor

I would caution against strictness here. In e.g. DNS there are things like the 0x20 encoding. Yes it is a very old draft, but it is being guaranteed to work in many reference-implementations, and there are cases in the wild where not supporting case sensitivity on the resolution-boundary bleeds into the app-logic, for instance https://forum.mikrotik.com/viewtopic.php?t=128082

While we should (and already are) emit consistent casing at all times, I think we should continue accepting any-case strings in any context where DNS or other case-insensitive-by-the-book systems are being interfaced.

@vmx
Copy link
Member Author

vmx commented Jun 8, 2020

I think we should continue accepting any-case strings in any context where DNS or other case-insensitive-by-the-book systems are being interfaced.

Yes, but wouldn't this be a level higher than the multibase implementations?

@ribasushi
Copy link
Contributor

wouldn't this be a level higher

Technically: yes. But because everything is terrible I am worried that there are other cross-stack parts operating over the DNS boundary that are case sensitive, and more importantly are outside of our control.

Say you could decree: any time we get a b32/36 encoded string from inside a HTTP header, be it a redirect or something cookie related, we invariably lowercase the string first. Then someone comes around and tries to do something cute like the 0x20 encoding I mentioned earlier: we can no longer provide that, because our stack explicitly lc()ed everything across the board.

Does it matter whether the above scenario is left open? Probably not... it is rather implausible

But what is the benefit of not decoding "whatever comes" ? If anything the extra lc() operation introduced everywhere is an extra performance hit.

TLDR: what is the motivation to be lc-strict?

@vmx
Copy link
Member Author

vmx commented Jun 8, 2020

If anything the extra lc() operation introduced everywhere is an extra performance hit.

You could argue that it's a performance hit doing it on the lowest level, even if it might not be needed.

Though this is not about the performance implications.

TLDR: what is the motivation to be lc-strict?

This is based on experience that generally spoken, failing early and being strict reduces hard to find bugs.

@hugomrdias
Copy link
Member

i would love to be strict a follow the spec without blacklist/whitelist's

@ribasushi
Copy link
Contributor

ribasushi commented Jun 8, 2020

Reframing this issue here is how I read the proposal:

  1. Mandate that all inputs for any base <=36 encoding are lower-cased in all surface APIs ( i.e. any string starting with F, V, T, B, C and K is lowercased first and only then fed to the multibase parser )
  2. Mandate that all outputs for any base <=36 encoding are emitted lowercase
    3a. Remove support for the K encoding, before it had a chance to propagate
    3b. Remove support for the F, V, T, B, and C encodings when it is safe to do so

This is definitely doable, but it needs a justification. I am not saying there isn't one, my objection is that this justification hasn't been conveyed clearly yet.

@ribasushi
Copy link
Contributor

I updated the previous comment, cc-ing @lidel in case he hasn't seen this issue yet

@vmx
Copy link
Member Author

vmx commented Jun 8, 2020

No the proposal is to have a strict parser.

I don't think the uppercase versions need to be removed. I'm asking for having them case sensitive, i.e. a Multibase implementation would accept Base encoded string with prefix F, V, T, B, C and Konly if all the letters are uppercased. If you want to have things case-insensitive, you should solve that on the application layer.

@ribasushi
Copy link
Contributor

Hrm... I find it logically inconsistent that we effectively provide support for case insensitive encodings, but only accept them if the casing is the same in the entire string. We should:

  • EITHER reduce complexity by standardizing everything on lowercase
  • OR support mixed case with case-insensitive decoders ( note - in order to decode a mixed-case alphaet one DOES NOT need to use lc() beforehand: instead a proper decode-table should be constructed once and used throughout )

The "same case throughout" feels like strictness-for-strictness-sake. Could we bullet-point a few benefits?

@hugomrdias
Copy link
Member

We support several variations for each base, supporting mix case for all feels wrong and adds complexity. Plus i personally don't see the value besides been nice for hand written typos but maybe im missing something here.

@vmx
Copy link
Member Author

vmx commented Jun 8, 2020

To me it sounded strange to have a uppercase and a lowercase variant, but still being case-insensitive. I was only thinkingabout decoding step (from Base encoded to the original data).

But if I take the encoding (data to Base encoded version) into account, it makes sense. You might want to encode something to specifically uppercase or lowercase. So that would speak for having both variants.

The "same case throughout" feels like strictness-for-strictness-sake. Could we bullet-point a few benefits?

  • straight forward reasoning: the implementation takes one exact alphabet for encoding/decoding, no conversion is happening implicitly
  • less to implement => less bugs
  • roundtripable: decoding a Base encoded string, end encoding it with the same Multibase leads to the same result.

Update: Added bullet point about roundtrips

@ribasushi
Copy link
Contributor

It seems we need to figure out why @whyrusleeping originally added case-insensitive support for what we now refer to as "4648-derived encodings". In fact it wasn't until 2d108367e5e3d30 that @Stebalien separated them into distinct entities.

@ribasushi
Copy link
Contributor

To rephrase: originally the multibase table had noting to do with rfc4648, aside from coincidentally using the same alphabet (which notably is not ideal for CIDs, as it interferes with sorting, and thus indexing).

We seem to be converging to "let's be 4648-strict", whereas the original design rationale appears lost. This is my main objection: insufficient data for meaningful answer :)

@mikeal
Copy link

mikeal commented Jun 8, 2020

From a user perspective, matching the loose (case insensitive) rules found elsewhere would allow people to easily take existing base encoded strings, pre-pend a character for the multibase identifier, and be fine.

All of our own code works with strings that are entirely produced and consumed by multibase, but a lot of people on existing systems will have strings that are already encoded using whatever the dominant rules are in their language/library of choice.

The choice here is really about how much work we’re asking people to do in order to migrate to multibase from existing base encodings. Can they prepend a string or do we need them to do a full decode and then encode again with a proper multibase library.

@ribasushi
Copy link
Contributor

ribasushi commented Jun 9, 2020

Having slept on this, I am still very much 👎 on removing the requirement for a compliant implementation to handle mixed-case decoding where the alphabet permits doing so.

That said, I won't lose sleep if the specs change goes through, and will look into implementing the (non-trivial) extra code in go-b36 to return errors on mixcase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants