Case insensitivity #73

vmx · 2020-06-08T12:26:53Z

Currently we have Base encodings which are currently spec'ed as case-insensitive (Base32 and Base36). I can see that this is convenient from an application perspective. But wouldn't it make sense to be strict at the lowest level (the level of this spec) and reject wrongly cased encodings?

The convenient normalization can always happen at an application level, it would also always have enough information to do so.

Hence I'm in favour in being more strict and making things case sensitive.

PS: Thanks @hugomrdias for bringing this up.

ribasushi · 2020-06-08T13:01:23Z

I would caution against strictness here. In e.g. DNS there are things like the 0x20 encoding. Yes it is a very old draft, but it is being guaranteed to work in many reference-implementations, and there are cases in the wild where not supporting case sensitivity on the resolution-boundary bleeds into the app-logic, for instance https://forum.mikrotik.com/viewtopic.php?t=128082

While we should (and already are) emit consistent casing at all times, I think we should continue accepting any-case strings in any context where DNS or other case-insensitive-by-the-book systems are being interfaced.

vmx · 2020-06-08T13:10:06Z

I think we should continue accepting any-case strings in any context where DNS or other case-insensitive-by-the-book systems are being interfaced.

Yes, but wouldn't this be a level higher than the multibase implementations?

ribasushi · 2020-06-08T13:17:20Z

wouldn't this be a level higher

Technically: yes. But because everything is terrible I am worried that there are other cross-stack parts operating over the DNS boundary that are case sensitive, and more importantly are outside of our control.

Say you could decree: any time we get a b32/36 encoded string from inside a HTTP header, be it a redirect or something cookie related, we invariably lowercase the string first. Then someone comes around and tries to do something cute like the 0x20 encoding I mentioned earlier: we can no longer provide that, because our stack explicitly lc()ed everything across the board.

Does it matter whether the above scenario is left open? Probably not... it is rather implausible

But what is the benefit of not decoding "whatever comes" ? If anything the extra lc() operation introduced everywhere is an extra performance hit.

TLDR: what is the motivation to be lc-strict?

vmx · 2020-06-08T13:38:03Z

If anything the extra lc() operation introduced everywhere is an extra performance hit.

You could argue that it's a performance hit doing it on the lowest level, even if it might not be needed.

Though this is not about the performance implications.

TLDR: what is the motivation to be lc-strict?

This is based on experience that generally spoken, failing early and being strict reduces hard to find bugs.

hugomrdias · 2020-06-08T14:18:14Z

i would love to be strict a follow the spec without blacklist/whitelist's

ribasushi · 2020-06-08T14:23:51Z

Reframing this issue here is how I read the proposal:

Mandate that all inputs for any base <=36 encoding are lower-cased in all surface APIs ( i.e. any string starting with F, V, T, B, C and K is lowercased first and only then fed to the multibase parser )
Mandate that all outputs for any base <=36 encoding are emitted lowercase
3a. Remove support for the K encoding, before it had a chance to propagate
3b. Remove support for the F, V, T, B, and C encodings when it is safe to do so

This is definitely doable, but it needs a justification. I am not saying there isn't one, my objection is that this justification hasn't been conveyed clearly yet.

ribasushi · 2020-06-08T14:28:15Z

I updated the previous comment, cc-ing @lidel in case he hasn't seen this issue yet

vmx · 2020-06-08T14:41:56Z

No the proposal is to have a strict parser.

I don't think the uppercase versions need to be removed. I'm asking for having them case sensitive, i.e. a Multibase implementation would accept Base encoded string with prefix F, V, T, B, C and Konly if all the letters are uppercased. If you want to have things case-insensitive, you should solve that on the application layer.

ribasushi · 2020-06-08T14:50:51Z

Hrm... I find it logically inconsistent that we effectively provide support for case insensitive encodings, but only accept them if the casing is the same in the entire string. We should:

EITHER reduce complexity by standardizing everything on lowercase
OR support mixed case with case-insensitive decoders ( note - in order to decode a mixed-case alphaet one DOES NOT need to use lc() beforehand: instead a proper decode-table should be constructed once and used throughout )

The "same case throughout" feels like strictness-for-strictness-sake. Could we bullet-point a few benefits?

hugomrdias · 2020-06-08T15:00:24Z

We support several variations for each base, supporting mix case for all feels wrong and adds complexity. Plus i personally don't see the value besides been nice for hand written typos but maybe im missing something here.

vmx · 2020-06-08T15:04:12Z

To me it sounded strange to have a uppercase and a lowercase variant, but still being case-insensitive. I was only thinkingabout decoding step (from Base encoded to the original data).

But if I take the encoding (data to Base encoded version) into account, it makes sense. You might want to encode something to specifically uppercase or lowercase. So that would speak for having both variants.

The "same case throughout" feels like strictness-for-strictness-sake. Could we bullet-point a few benefits?

straight forward reasoning: the implementation takes one exact alphabet for encoding/decoding, no conversion is happening implicitly
less to implement => less bugs
roundtripable: decoding a Base encoded string, end encoding it with the same Multibase leads to the same result.

Update: Added bullet point about roundtrips

ribasushi · 2020-06-08T15:10:48Z

It seems we need to figure out why @whyrusleeping originally added case-insensitive support for what we now refer to as "4648-derived encodings". In fact it wasn't until 2d108367e5e3d30 that @Stebalien separated them into distinct entities.

ribasushi · 2020-06-08T15:13:06Z

To rephrase: originally the multibase table had noting to do with rfc4648, aside from coincidentally using the same alphabet (which notably is not ideal for CIDs, as it interferes with sorting, and thus indexing).

We seem to be converging to "let's be 4648-strict", whereas the original design rationale appears lost. This is my main objection: insufficient data for meaningful answer :)

mikeal · 2020-06-08T15:25:28Z

From a user perspective, matching the loose (case insensitive) rules found elsewhere would allow people to easily take existing base encoded strings, pre-pend a character for the multibase identifier, and be fine.

All of our own code works with strings that are entirely produced and consumed by multibase, but a lot of people on existing systems will have strings that are already encoded using whatever the dominant rules are in their language/library of choice.

The choice here is really about how much work we’re asking people to do in order to migrate to multibase from existing base encodings. Can they prepend a string or do we need them to do a full decode and then encode again with a proper multibase library.

ribasushi · 2020-06-09T11:03:30Z

Having slept on this, I am still very much 👎 on removing the requirement for a compliant implementation to handle mixed-case decoding where the alphabet permits doing so.

That said, I won't lose sleep if the specs change goes through, and will look into implementing the (non-trivial) extra code in go-b36 to return errors on mixcase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case insensitivity #73

Case insensitivity #73

vmx commented Jun 8, 2020

ribasushi commented Jun 8, 2020

vmx commented Jun 8, 2020

ribasushi commented Jun 8, 2020

vmx commented Jun 8, 2020

hugomrdias commented Jun 8, 2020

ribasushi commented Jun 8, 2020 •

edited

Loading

ribasushi commented Jun 8, 2020

vmx commented Jun 8, 2020

ribasushi commented Jun 8, 2020

hugomrdias commented Jun 8, 2020

vmx commented Jun 8, 2020 •

edited

Loading

ribasushi commented Jun 8, 2020

ribasushi commented Jun 8, 2020

mikeal commented Jun 8, 2020

ribasushi commented Jun 9, 2020 •

edited

Loading

Case insensitivity #73

Case insensitivity #73

Comments

vmx commented Jun 8, 2020

ribasushi commented Jun 8, 2020

vmx commented Jun 8, 2020

ribasushi commented Jun 8, 2020

vmx commented Jun 8, 2020

hugomrdias commented Jun 8, 2020

ribasushi commented Jun 8, 2020 • edited Loading

ribasushi commented Jun 8, 2020

vmx commented Jun 8, 2020

ribasushi commented Jun 8, 2020

hugomrdias commented Jun 8, 2020

vmx commented Jun 8, 2020 • edited Loading

ribasushi commented Jun 8, 2020

ribasushi commented Jun 8, 2020

mikeal commented Jun 8, 2020

ribasushi commented Jun 9, 2020 • edited Loading

ribasushi commented Jun 8, 2020 •

edited

Loading

vmx commented Jun 8, 2020 •

edited

Loading

ribasushi commented Jun 9, 2020 •

edited

Loading