-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Published IRL for URL parsing #4
Comments
@tomchristie: If the
Which of those two are you thinking of here? @tomchristie: Could we name this project I absolutely have some thoughts on how we might want the API to look throughout, having worked on similar API surface areas in @njsmith: There are a ton of URL packages already though: rfc3986, hyperlink (extracted from twisted), yarl (extracted from aiohttp), and that's just the modern maintained ones I know off the top of my head, there are at least half a dozen older ones. Is there a 2-second summary of what unique niche @tomchristie: (An alternative would be to have @tomchristie: @njsmith: I've found I guess the crucial spec for an HTTP client is https://url.spec.whatwg.org/ Oh heh, and apparently @sethmlarson already implemented it in Python? https://pypi.org/project/whatwg-url/
Okay that's a blocker from my POV, yes. @sethmlarson:
Done
To be wrapped by http3 and whatever client library uses it just like we did previously to rfc3986.
The URL library that knows exactly which sections are capable of taking anything and letting the user do anything but still remaining secure and RFC 3986 compliant. Also securely handles a bunch of stuff that's totally not RFC 3986 compliant but still happens in the wild. I'm currently writing a reply for a feature-set comparison but it'll take some time so I'll let you have these replies first. :)
Gotcha! Well that keeps things simple. There’s a related design question for http3, here... encode/httpx#113 @sethmlarson:
Plus I will say that because it's design specifically for HTTP clients there are a few helpers that are non-trivial to implement yourself. There's another two I'm planning on adding for @sethmlarson:
|
I'll mention here explicitly that I have no problem at all changing anything about the interface. I basically worked in a frenzy over an hour or so last night so I'm not attached to anything yet. ;) |
I'm not sure I find the SSRF argument convincing... it's definitely true that if you use library A to validate a URL against SSRF, and then library B to actually fetch the URL, and libraries A and B disagree on some parsing details, then that might create holes in your SSRF protection. But:
I'm surprised that WHATWG-URLs can't be schemaless... obviously every browser understands schemaless URLs! Do they expect user-agents to do some extra non-standardized normalization or what? |
Yeah the parser difference SSRF argument is one that's just been around and our primary motivator towards making urllib3's URL parser compliant with RFC 3986. This presentation specifically if anyone hasn't seen it: https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-A-New-Era-Of-SSRF-Exploiting-URL-Parser-In-Trending-Programming-Languages.pdf
|
I read a bit more about the WHATWG URL spec and I think I get what's going on. The spec is all about how you handle URLs that appear over the network: in HTML tags like OTOH, there are cases where you have a URL but there's no base URL to resolve against, like when someone types a URL into a browser location bar, or calls Anyway, what does this mean for us? An HTTP client has to follow redirects, and it probably should do this the same way browsers do, which to me seems like a pretty compelling argument for using WHATWG rules there. Resolving links like a browser does is also a good thing to support; it's not quite in scope for a HTTP client itself, but lots of users will want to do this and it'd be nice if they could use the same library. But we also have to handle URLs where there's no base URL, like in When you say schemeless URLs appear "in the wild", what do you mean by that? People expect to be able to type them into interactive sessions, or something more than that? |
The big use-case is the typing into interactive shells, I agree we can remedy that case via heuristics to allow it at the client level. I'll have to take a closer look at WHATWG URL because it's requirements for IPv6 hosts specifically is different than RFC 3986 because it doesn't support zone IDs. I know that those are used frequently, there's been a few issues on both requests and urllib3's tracker about them since we've modified our URL parser. Zone IDs make a lot less sense to a browser but to a programmatic HTTP client they're useful in some situations. |
cc @tomchristie would like your thoughts on this? |
Sure, supporting zone IDs makes total sense. My intuition is that for things browsers support, we want to do the same thing that browsers do. But for situations that browsers don't support, we can do whatever makes sense. Also my sense is that there isn't really any ambiguity about how to support zone IDs, so if browsers do add support later then it'll match what we do anyway, right? |
lol check out RFC 4007 Section 11 and RFC 6874, in which the authors of RFC 4007 forget you can't slap a Joking aside, we can handle them properly regardless of representation. :) So should the IRL package become WHATWG-URL + stuff that browsers don't do yet but we need? Or should I add it to whatwg-url even though it breaks the promise a bit |
NOTE: This discussion started within Team discussion before it was pointed out that the public could not see the discussion thread
Would love some eyes on and thoughts about irl which is essentially urllib3's URL parser in it's own tiny library. Was in the middle of refactoring to remove the rfc3986 dependency from urllib3 due to it's slow import time and wanted to make the work available for others (including http3) to use.
The text was updated successfully, but these errors were encountered: