-
-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Header methods do not work well with repeated headers #125
Comments
If I channel my inner @ikreymer I think I would make a list of headers which are allowed to be repeated by the HTTP standard, and then meditate on that a bit. Can we make sensible defaults that work for the usual use cases?
The first two usually want to change a very limited set of the headers, the last two usually want to keep them the same. My suspicion is that we might be able to make everyone happy without requiring a change in the warcio API... just add some features. |
Yeah, about that... Here's the relevant part from RFC 2616 (in section 4.2):
RFC 7230 phrases it differently but equivalently and adds 'or is a well-known exception'; it only lists Now,
I bet there are numerous servers out there that also repeat other headers that aren't actually allowed to be repeated. In other words, the standard is unfortunately pretty much useless for this, and one must always assume the worst case of 'anything goes'. What headers would crawlers need to change? Requests and responses should be immutable once created, no? I feel like the most reasonable approach is probably to have FWIW, pywb modifies |
StatusAndHeaders
does not fare well when header fields are repeated. Here is a list of some problems I've found in such cases:get_header
always returns the first value.replace_header
only replaces the last occurrence.remove_header
only removes the last occurrence.Apart from the usability impact when working with headers, this could cause problems in pywb if an HTTP response contains more than one
Content-Length
header. That is allowed by the specs, provided all values are equal. But pywb usesreplace_header
on a number of occasions, and that would only replace the last occurrence. If the value is not equal to the original length (which is very likely on link rewriting etc.), the produced HTTP response would be invalid, and I'm not sure what HTTP clients would do with it; some would almost certainly bark.Arguably,
get_header
behaves correctly since there isn't a generic way to merge headers fields into a single value (HTTP specifies one but it doesn't apply to some headers, in particularSet-Cookie
, which can't be merged; WARC doesn't have the concept of merging headers at all). However, there should probably be aget_headers
orget_header_values
method that returns a tuple or list of all values for a header field name. I'm not sure aboutreplace_header
andremove_header
at all.The text was updated successfully, but these errors were encountered: