Description
Before 6.2, UriComponentsBuilder
used regex expressions. Generally, they split on the main component delimiters, ":"
, "/"
, "?"
, and "#"
, but did not enforce deviations from the allowed character set by component. The resulting UriComponents
can then encode any non-conforming characters.
Regular expressions are convenient, but provide limited control and visibility. This is why in #32513 we added an implementation of the URL parsing algorithm from the WhatWg URL Living Standard that browsers use to align on how to handle a wide range of cases leniently. While this provides more robust parsing than before, arguably on a server we can expect URLs that don't deviate from the RFC quite as far as what browsers need to be able to handle.
We can add a new parser that follows RFC syntax along the lines of the java.net.URI
or Jetty's HttpUri
parsers. The new parser should respect the main component delimiters, but otherwise leave some room for leniency within each component to allow some characters like spaces or curly braces (URI variables), similar to what the regex expressions did. UriComponents
can then encode any non-confirming characters that remain after URI variables are expanded.
It should be possible to choose which parser to use, RFC or the WhatWG, when more leniency or alignment with browsers is needed.
The topic of RFC vs WhatWG parsing was first brought up by @joakime in #33542. For broader context, and possible future effort to standardize lenient parsing of user provided URLs, see https://lists.w3.org/Archives/Public/ietf-http-wg/2024JulSep/0281.html.