Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warcio accepts a bare LF everywhere a CRLF is required by the spec #157

Closed
acidus99 opened this issue Sep 6, 2023 · 1 comment
Closed

Comments

@acidus99
Copy link

acidus99 commented Sep 6, 2023

warcio check (and possibly other modes for warcio) silently accepts a bare LF anywhere a CRLF is required per the WARC spec. This includes:

  • Allowing a field to end with only a LF
  • Allowing only a LF to delimit between the header and block
  • Allowing only two LF's between 2 records

See the attached WARC. (Github won't let me upload a plain WARC, so to keep things simple, this is just a gzipped file, not a per-record gzipped WARC)
only-lf-endings.warc.gz

Run warcio check only-lf-endings.warc and you get no warnings or errors. However look at the WARC in a hex editor.

Here we can see the bare LF between the fields in a record's header, instead of the required CRLFs:
image

Here we can see a completely invalid LF CR LF separating the warcinfo record's header from its block:

image

Finally, we can see just two LFs between the warcinfo record and the next record:

image

This is a contrived example for simplicity. I found this bug while troubleshooting a WARC which had a record with an incorrect Content-Length field. The Content-Length value was 1 larger than the actual block length, which caused warcio to include the first CR of the CR LF CR LF separator in the block bytes. This did trigger a "block digest failed" warning. Warcio then read the remaining LF CR LF and parsed that as a valid record separator, and continued parsing the rest of the records as normal. This was super confusing because other WARC parsing code/libraries were throwing an error about a wrong content length, while warcio check said everything was fine 😂 Took me a while to track down the problem!

Expected Behavior: warcio should not accept a bare LF where the spec requires a CRLF. warcio should report any missing CRLFs as errors.

@acidus99
Copy link
Author

Per the discussion with @wumpus and @ikreymer in #158, warcio check is not intended to be a WARC validator / linter. Detecting malformed or illegal characters is more appropriate for the proposed warcio test feature in progress at #66.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant