Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Header methods do not work well with repeated headers #125

Open
JustAnotherArchivist opened this issue Dec 28, 2020 · 2 comments
Open

Header methods do not work well with repeated headers #125

JustAnotherArchivist opened this issue Dec 28, 2020 · 2 comments

Comments

@JustAnotherArchivist
Copy link
Contributor

JustAnotherArchivist commented Dec 28, 2020

StatusAndHeaders does not fare well when header fields are repeated. Here is a list of some problems I've found in such cases:

  • get_header always returns the first value.
  • replace_header only replaces the last occurrence.
  • remove_header only removes the last occurrence.

Apart from the usability impact when working with headers, this could cause problems in pywb if an HTTP response contains more than one Content-Length header. That is allowed by the specs, provided all values are equal. But pywb uses replace_header on a number of occasions, and that would only replace the last occurrence. If the value is not equal to the original length (which is very likely on link rewriting etc.), the produced HTTP response would be invalid, and I'm not sure what HTTP clients would do with it; some would almost certainly bark.

Arguably, get_header behaves correctly since there isn't a generic way to merge headers fields into a single value (HTTP specifies one but it doesn't apply to some headers, in particular Set-Cookie, which can't be merged; WARC doesn't have the concept of merging headers at all). However, there should probably be a get_headers or get_header_values method that returns a tuple or list of all values for a header field name. I'm not sure about replace_header and remove_header at all.

@wumpus
Copy link
Collaborator

wumpus commented Jan 19, 2021

If I channel my inner @ikreymer I think I would make a list of headers which are allowed to be repeated by the HTTP standard, and then meditate on that a bit. Can we make sensible defaults that work for the usual use cases?

  • crawling
  • playback
  • iterating over contents
  • reorganizing warcs to make sensible subsets

The first two usually want to change a very limited set of the headers, the last two usually want to keep them the same. My suspicion is that we might be able to make everyone happy without requiring a change in the warcio API... just add some features.

@JustAnotherArchivist
Copy link
Contributor Author

list of headers which are allowed to be repeated by the HTTP standard

Yeah, about that... Here's the relevant part from RFC 2616 (in section 4.2):

Multiple message-header fields with the same field-name MAY be present in a message if and only if the entire field-value for that header field is defined as a comma-separated list [i.e., #(values)]. It MUST be possible to combine the multiple header fields into one field-name: field-value pair, without changing the semantics of the message, by appending each subsequent field-value to the first, each separated by a comma.

RFC 7230 phrases it differently but equivalently and adds 'or is a well-known exception'; it only lists Set-Cookie as such an exception.

Now, Content-Length is not such a comma-separated list. But 7230 still talks about repeated values in section 3.3.2:

If a message is received that has multiple Content-Length header fields with field-values consisting of the same decimal value, or a single Content-Length header field with a field value containing a list of identical decimal values (e.g., Content-Length: 42, 42), indicating that duplicate Content-Length header fields have been generated or combined by an upstream message processor, then the recipient MUST either reject the message as invalid or replace the duplicated field-values with a single valid Content-Length field containing that decimal value prior to determining the message body length or forwarding the message.

I bet there are numerous servers out there that also repeat other headers that aren't actually allowed to be repeated. In other words, the standard is unfortunately pretty much useless for this, and one must always assume the worst case of 'anything goes'.


What headers would crawlers need to change? Requests and responses should be immutable once created, no?

I feel like the most reasonable approach is probably to have replace_header and remove_header simply replace/remove all occurrences of a header name. More complicated manipulation, if needed, can always be done by directly accessing the headers attribute, of course, but shouldn't normally be needed. I can't really think of any common case where you'd want to remove or replace only the last match.

FWIW, pywb modifies Access-Control-*, Content-Type, and Content-Length, as far as I can see; it also has some code that changes WARC headers here, but I don't immediately understand what that does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants