You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
warcio check (and possibly other modes for warcio) silently accepts a bare LF anywhere a CRLF is required per the WARC spec. This includes:
Allowing a field to end with only a LF
Allowing only a LF to delimit between the header and block
Allowing only two LF's between 2 records
See the attached WARC. (Github won't let me upload a plain WARC, so to keep things simple, this is just a gzipped file, not a per-record gzipped WARC) only-lf-endings.warc.gz
Run warcio check only-lf-endings.warc and you get no warnings or errors. However look at the WARC in a hex editor.
Here we can see the bare LF between the fields in a record's header, instead of the required CRLFs:
Here we can see a completely invalid LF CR LF separating the warcinfo record's header from its block:
Finally, we can see just two LFs between the warcinfo record and the next record:
This is a contrived example for simplicity. I found this bug while troubleshooting a WARC which had a record with an incorrect Content-Length field. The Content-Length value was 1 larger than the actual block length, which caused warcio to include the first CR of the CR LF CR LF separator in the block bytes. This did trigger a "block digest failed" warning. Warcio then read the remaining LF CR LF and parsed that as a valid record separator, and continued parsing the rest of the records as normal. This was super confusing because other WARC parsing code/libraries were throwing an error about a wrong content length, while warcio check said everything was fine 😂 Took me a while to track down the problem!
Expected Behavior: warcio should not accept a bare LF where the spec requires a CRLF. warcio should report any missing CRLFs as errors.
The text was updated successfully, but these errors were encountered:
Per the discussion with @wumpus and @ikreymer in #158, warcio check is not intended to be a WARC validator / linter. Detecting malformed or illegal characters is more appropriate for the proposed warcio test feature in progress at #66.
warcio check
(and possibly other modes forwarcio
) silently accepts a bare LF anywhere a CRLF is required per the WARC spec. This includes:See the attached WARC. (Github won't let me upload a plain WARC, so to keep things simple, this is just a gzipped file, not a per-record gzipped WARC)
only-lf-endings.warc.gz
Run
warcio check only-lf-endings.warc
and you get no warnings or errors. However look at the WARC in a hex editor.Here we can see the bare LF between the fields in a record's header, instead of the required CRLFs:
Here we can see a completely invalid LF CR LF separating the
warcinfo
record's header from its block:Finally, we can see just two LFs between the
warcinfo
record and the next record:This is a contrived example for simplicity. I found this bug while troubleshooting a WARC which had a record with an incorrect Content-Length field. The Content-Length value was 1 larger than the actual block length, which caused warcio to include the first CR of the
CR LF CR LF
separator in the block bytes. This did trigger a "block digest failed" warning. Warcio then read the remainingLF CR LF
and parsed that as a valid record separator, and continued parsing the rest of the records as normal. This was super confusing because other WARC parsing code/libraries were throwing an error about a wrong content length, whilewarcio check
said everything was fine 😂 Took me a while to track down the problem!Expected Behavior: warcio should not accept a bare LF where the spec requires a CRLF. warcio should report any missing CRLFs as errors.
The text was updated successfully, but these errors were encountered: