Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOT READY: warcio test #66

Open
wants to merge 71 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
ff1f543
warcio test
wumpus Jan 26, 2019
ebb721f
documentation
wumpus Jan 26, 2019
7aa060d
tests
wumpus Jan 26, 2019
24f3000
tests
wumpus Jan 26, 2019
40f9fc6
coverage
wumpus Jan 26, 2019
c70e68e
python 2.7 test fix
wumpus Jan 26, 2019
1847633
python 2.7 fixes
wumpus Jan 26, 2019
2c676db
coverage
wumpus Jan 26, 2019
97ee457
py2 testing
wumpus Jan 27, 2019
df50151
py2 windows testing
wumpus Jan 27, 2019
858a752
coverage
wumpus Jan 28, 2019
5bfffea
branch coverage
wumpus Jan 28, 2019
bb31f14
py2 branch coverage
wumpus Jan 28, 2019
cc54259
py2 testing
wumpus Jan 28, 2019
2b8d596
add record ids to test
wumpus Jan 28, 2019
c704fe9
preserve capitalization in messages
wumpus Jan 28, 2019
3839fa1
capitals and colons
wumpus Jan 28, 2019
8b9032d
use valid record ids
wumpus Jan 28, 2019
2a10b23
warc-segment-number cleaner recommendation
wumpus Jan 28, 2019
81c9f0a
segment origin id
wumpus Jan 28, 2019
c78343a
timestamp checking
wumpus Jan 28, 2019
efe0fda
buglet
wumpus Jan 29, 2019
7a26644
global checks
wumpus Jan 30, 2019
1d6fd9d
check -v; capitalize most commentary
wumpus Jan 31, 2019
5b716b7
...
wumpus Feb 1, 2019
fb8e3fa
revisits and global detection with just one file
wumpus Feb 1, 2019
d243632
show errors for decompression and unchunking failures
wumpus Feb 1, 2019
29517c4
make this function reentrant
wumpus Feb 2, 2019
844807e
narrow exception; fix bug not reading to the end of a chunked buffer
wumpus Feb 2, 2019
a55afd3
...
wumpus Feb 2, 2019
a33a5eb
put tester output in external files
wumpus Feb 6, 2019
fec139a
wip
wumpus Apr 4, 2019
417eee1
merge
wumpus Apr 4, 2019
a471222
tweak to match new test files
wumpus Apr 5, 2019
a80a784
merge
wumpus Sep 9, 2019
30a86fe
tests pass
wumpus Sep 9, 2019
19dc8b3
warcio test
wumpus Jan 26, 2019
88dff09
documentation
wumpus Jan 26, 2019
c99bc2e
tests
wumpus Jan 26, 2019
0039335
tests
wumpus Jan 26, 2019
9b7c9ce
coverage
wumpus Jan 26, 2019
903ed1d
python 2.7 test fix
wumpus Jan 26, 2019
68938bd
python 2.7 fixes
wumpus Jan 26, 2019
234468a
coverage
wumpus Jan 26, 2019
e7f88e7
py2 testing
wumpus Jan 27, 2019
8662073
py2 windows testing
wumpus Jan 27, 2019
291460e
coverage
wumpus Jan 28, 2019
69080d5
branch coverage
wumpus Jan 28, 2019
2e1d820
py2 branch coverage
wumpus Jan 28, 2019
bbdb57b
py2 testing
wumpus Jan 28, 2019
fc2d7b4
add record ids to test
wumpus Jan 28, 2019
d1fe18e
preserve capitalization in messages
wumpus Jan 28, 2019
484da9c
capitals and colons
wumpus Jan 28, 2019
4687497
use valid record ids
wumpus Jan 28, 2019
bcfe672
warc-segment-number cleaner recommendation
wumpus Jan 28, 2019
7f715c0
segment origin id
wumpus Jan 28, 2019
2583f19
timestamp checking
wumpus Jan 28, 2019
8eb87e8
buglet
wumpus Jan 29, 2019
3a8747e
global checks
wumpus Jan 30, 2019
f7cd1db
check -v; capitalize most commentary
wumpus Jan 31, 2019
b570b6c
...
wumpus Feb 1, 2019
921e748
revisits and global detection with just one file
wumpus Feb 1, 2019
4265b62
show errors for decompression and unchunking failures
wumpus Feb 1, 2019
08e6bd9
make this function reentrant
wumpus Feb 2, 2019
d1f48ed
narrow exception; fix bug not reading to the end of a chunked buffer
wumpus Feb 2, 2019
6e44a44
...
wumpus Feb 2, 2019
59198eb
put tester output in external files
wumpus Feb 6, 2019
b61878e
wip
wumpus Apr 4, 2019
2d2b7d5
tests pass
wumpus Sep 9, 2019
f4bc076
merge
wumpus Nov 5, 2019
fc19c7d
comments
wumpus Feb 16, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -368,6 +368,14 @@ of WARC records, if possible. An exit value of 1 indicates a failure.
``warcio check -v`` will print verbose output for each record in the
WARC file.

Test
~~~~

The ``warcio test`` command will check one or more WARC files against
the WARC standard, giving commentary about standards violations,
recommendations, and other issues.


Recompress
~~~~~~~~~~

Expand Down
17 changes: 11 additions & 6 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from setuptools import setup, find_packages
from setuptools.command.test import test as TestCommand
import glob
import sys

__version__ = '1.7.1'

Expand All @@ -21,6 +22,15 @@ def run_tests(self):
errcode = pytest.main(['--doctest-modules', './warcio', '--cov', 'warcio', '-v', 'test/'])
sys.exit(errcode)

tests_require = [
'pytest',
'pytest-cov',
'httpbin==0.5.0',
'requests',
]
if sys.version_info < (3, 3):
tests_require.append('ipaddress')

setup(
name='warcio',
version=__version__,
Expand All @@ -44,12 +54,7 @@ def run_tests(self):
""",
cmdclass={'test': PyTest},
test_suite='',
tests_require=[
'pytest',
'pytest-cov',
'httpbin==0.5.0',
'requests',
],
tests_require=tests_require,
classifiers=[
'Development Status :: 5 - Production/Stable',
'Environment :: Web Environment',
Expand Down
File renamed without changes.
22 changes: 22 additions & 0 deletions test/data/example-digest-bad.warc.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
test/data/example-digest-bad.warc
WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Type request
payload digest failed: sha1:1112H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
error: WARC-IP-Address should be used for http and https requests
WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest pass
error: WARC-IP-Address should be used for http and https requests
error: Duplicate WARC-Record-ID: <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest pass
error: WARC-IP-Address should be used for http and https requests
error: Duplicate WARC-Record-ID: <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest pass
error: WARC-IP-Address should be used for http and https requests
error: Duplicate WARC-Record-ID: <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
global Concurrent-To checks
comment: WARC-Concurrent-To not found: <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007> WARC-Concurrent-To <urn:uuid:a9c51e3e-0221-11e7-bf66-0242ac120005>
16 changes: 16 additions & 0 deletions test/data/example.warc.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
test/data/example.warc
WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest not present
error: WARC-IP-Address should be used for http and https requests
WARC-Record-ID <urn:uuid:e6e395ca-0221-11e7-a18d-0242ac120005>
WARC-Type revisit
digest present but not checked (revisit)
recommendation: Missing recommended header: WARC-Refers-To
comment: This Heretrix extension never made it into the standard: WARC-Profile http://netpreserve.org/warc/1.0/revisit/uri-agnostic-identical-payload-digest
comment: Field was introduced after this warc version: 1.0 WARC-Refers-To-Target-URI http://example.com/
comment: Field was introduced after this warc version: 1.0 WARC-Refers-To-Date 2017-03-06T04:02:06Z
WARC-Record-ID <urn:uuid:e6e41fea-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest not present
error: WARC-IP-Address should be used for http and https requests
56 changes: 56 additions & 0 deletions test/data/standard-torture-validate-field.warc
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
WARC/1.0
WARC-Target-URI: <http://example.com/>
WARC-Target-URI: example.com
WARC-Target-URI: ex ample.com
WARC-Target-URI: h<>ttp://example.com/
WARC-Type: does-not-exist
WARC-Type: CAPITALIZED
WARC-Concurrent-To: http://example.com/
WARC-Concurrent-To: <uri:urn:asdf-asdf-asdf>
WARC-Record-ID: <urn:uuid:torture-validate-field>
WARC-Date: 2017-03-06T04:03:53Z
WARC-Date: 2017-03-06T04:03:53.Z
Content-Type: asdf
Content-Type: has space/asdf
Content-Type: asdf/has space
Content-Type: asdf/has space;asdf
WARC-Block-Digest: asdf
WARC-Block-Digest: has space:asdf
WARC-Block-Digest: sha1:&$*^&*^#*&^
WARC-IP-Address: 1.2.3.4.5
WARC-Truncated: invalid
WARC-Warcinfo-ID: asdf:asdf
WARC-Filename: not-yet-tested
WARC-Profile: asdf
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest
WARC-Identified-Payload-Type: asdf
WARC-Segment-Origin-ID: http://example.com
WARC-Segment-Number: not-an-integer
WARC-Segment-Number: 0
WARC-Segment-Number: 1
WARC-Segment-Number: 2
WARC-Segment-Total-Length: 0
WARC-Segment-Total-Length: not-an-integer
WARC-Refers-To-Target-URI: http://example.com
WARC-Refers-To-Date: not-a-date
WARC-Refers-To-Filename: asdf
WARC-Refers-To-File-Offset: 1234
WARC-Unknown-Field: asdf
Content-Length: 0


WARC/1.1
WARC-Date: 2017-03-06T04:03:53Z
WARC-Date: 2017-03-06T04:03:53.Z
WARC-Date: 2017-03-06T04:03:53.0Z
WARC-Type: invalid
Content-Length: 0


WARC/1.1
WARC-Type: request
WARC-Segment-Number: 1
Content-Length: 0


WARC/invalid
80 changes: 80 additions & 0 deletions test/data/standard-torture-validate-field.warc.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
test/data/standard-torture-validate-field.warc
WARC-Record-ID <urn:uuid:torture-validate-field>
WARC-Type does-not-exist
unknown hash algorithm name in block digest
error: uri must not be within <>: WARC-Target-URI <http://example.com/>
error: Duplicate field seen: WARC-Target-URI example.com
error: Invalid uri, no scheme: WARC-Target-URI example.com
error: Duplicate field seen: WARC-Target-URI ex ample.com
error: Invalid uri, no scheme: WARC-Target-URI ex ample.com
error: Invalid uri, contains whitespace: WARC-Target-URI ex ample.com
error: Duplicate field seen: WARC-Target-URI h<>ttp://example.com/
error: Invalid uri scheme, bad character: WARC-Target-URI h<>ttp://example.com/
error: Duplicate field seen: WARC-Type CAPITALIZED
error: uri must be within <>: WARC-Concurrent-To http://example.com/
error: Duplicate field seen: WARC-Date 2017-03-06T04:03:53.Z
error: Invalid timestamp: WARC-Date 2017-03-06T04:03:53.Z
error: WARC versions <= 1.0 may not have timestamps with fractional seconds: WARC-Date 2017-03-06T04:03:53.Z
error: Must contain a /: Content-Type asdf
error: Invalid subtype: Content-Type asdf
error: Duplicate field seen: Content-Type has space/asdf
error: Invalid type: Content-Type has space/asdf
error: Duplicate field seen: Content-Type asdf/has space
error: Invalid subtype: Content-Type asdf/has space
error: Duplicate field seen: Content-Type asdf/has space;asdf
error: Invalid subtype: Content-Type asdf/has space;asdf
error: Missing algorithm: WARC-Block-Digest asdf
error: Duplicate field seen: WARC-Block-Digest has space:asdf
error: Invalid algorithm: WARC-Block-Digest has space:asdf
error: Duplicate field seen: WARC-Block-Digest sha1:&$*^&*^#*&^
error: Invalid ip: WARC-IP-Address 1.2.3.4.5
error: uri must be within <>: WARC-Warcinfo-ID asdf:asdf
error: Duplicate field seen: WARC-Profile http://netpreserve.org/warc/1.0/revisit/identical-payload-digest
error: Must contain a /: WARC-Identified-Payload-Type asdf
error: Invalid subtype: WARC-Identified-Payload-Type asdf
error: uri must be within <>: WARC-Segment-Origin-ID http://example.com
error: Must be an integer: WARC-Segment-Number not-an-integer
error: Duplicate field seen: WARC-Segment-Number 0
error: Must be 1 or greater: WARC-Segment-Number 0
error: Non-continuation records must always have WARC-Segment-Number: 1: WARC-Segment-Number 0
error: Duplicate field seen: WARC-Segment-Number 1
error: Duplicate field seen: WARC-Segment-Number 2
error: Non-continuation records must always have WARC-Segment-Number: 1: WARC-Segment-Number 2
error: Duplicate field seen: WARC-Segment-Total-Length not-an-integer
error: Must be an integer: WARC-Segment-Total-Length not-an-integer
error: Invalid timestamp: WARC-Refers-To-Date not-a-date
comment: Unknown WARC-Type: WARC-Type does-not-exist
comment: WARC-Type is not lower-case: WARC-Type CAPITALIZED
comment: Unknown WARC-Type: WARC-Type CAPITALIZED
comment: Unknown digest algorithm: WARC-Block-Digest asdf
comment: Invalid-looking digest value: WARC-Block-Digest sha1:&$*^&*^#*&^
comment: Unknown value, perhaps an extension: WARC-Truncated invalid
comment: Unknown value, perhaps an extension: WARC-Profile asdf
comment: Field was introduced after this warc version: 1.0 WARC-Refers-To-Target-URI http://example.com
comment: Field was introduced after this warc version: 1.0 WARC-Refers-To-Date not-a-date
comment: This Heretrix extension never made it into the standard: WARC-Refers-To-Filename asdf
comment: This Heretrix extension never made it into the standard: WARC-Refers-To-File-Offset 1234
comment: Unknown field, no validation performed: WARC-Unknown-Field asdf
WARC-Record-ID None
WARC-Type invalid
digest not present
error: Duplicate field seen: WARC-Date 2017-03-06T04:03:53.Z
error: Invalid timestamp: WARC-Date 2017-03-06T04:03:53.Z
error: Duplicate field seen: WARC-Date 2017-03-06T04:03:53.0Z
comment: Unknown WARC-Type: WARC-Type invalid
WARC-Record-ID None
WARC-Type request
digest not present
error: Segmented records must have both WARC-Segment-Number and WARC-Segment-Origin-ID
error: Missing required header: Content-Type
error: Missing required header: WARC-Date
error: Missing required header: WARC-Record-ID
error: Missing required header: WARC-Target-URI
recommendation: Do not segment WARC-Type request
saw exception ArchiveLoadFailed: Invalid WARC record, first line: WARC/invalid
skipping rest of file
global warcinfo checks
comment: WARC-Warcinfo-ID not found: <urn:uuid:torture-validate-field> WARC-Warcinfo-ID asdf:asdf
global Concurrent-To checks
comment: WARC-Concurrent-To not found: <urn:uuid:torture-validate-field> WARC-Concurrent-To <uri:urn:asdf-asdf-asdf>
comment: WARC-Concurrent-To not found: <urn:uuid:torture-validate-field> WARC-Concurrent-To http://example.com/
136 changes: 136 additions & 0 deletions test/data/standard-torture-validate-record.warc
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
WARC/1.0
WARC-Type: warcinfo
Content-Type: application/warc-fields
WARC-Refers-To: probhibited
Content-Length: 146

first line can't start with a space
test: invalid utf8 �(
test: lines should end with \r\n
foo:
bar

no colon
token cannot have a space:


WARC/1.0
WARC-Record-ID: <uri:uuid:test-empty-warc-fields>
WARC-Type: warcinfo
Content-Type: application/warc-fields
Content-Length: 0


WARC/1.0
WARC-Type: warcinfo
WARC-Record-ID: <uri:uuid:test-warcinfo-non-recommended-content-type>
Content-Type: not-application/warc-fields
Content-Length: 5

foo


WARC/1.0
WARC-Type: response
WARC-Record-ID: <uri:uuid:test-response-content-type>
WARC-Target-URI: HtTp://example.com/
Content-Type: text/plain
Content-Length: 0


WARC/1.0
WARC-Type: resource
WARC-Record-ID: <uri:uuid:test-resource-dns-content-type>
WARC-Target-URI: DnS:asdfasdf
Content-Type: text/plain
Content-Length: 0


WARC/1.0
WARC-Type: resource
WARC-Record-ID: <uri:uuid:test-resource-dns-empty>
WARC-Test-TODO: add another with valid block
WARC-Target-URI: DnS:asdfasdf
Content-Type: text/dns
Content-Length: 0


WARC/1.0
WARC-Type: resource
WARC-Record-ID: <uri:uuid:test-resource-not-dns>
WARC-Target-URI: foo:bar
Content-Length: 0


WARC/1.0
WARC-Type: request
WARC-Record-ID: <uri:uuid:test-request-content-type>
WARC-Target-URI: hTtP://example.com/
Content-Type: text/plain
Content-Length: 0


WARC/1.0
WARC-Type: request
WARC-Record-ID: <uri:uuid:test-request-content-type-with-ip>
WARC-Target-URI: hTtP://example.com/
WARC-IP-Address: 1.2.3.4
Content-Type: text/plain
Content-Length: 0


WARC/1.0
WARC-Type: metadata
WARC-Record-ID: <uri:uuid:test-metadata-warc-fields-empty>
Content-Type: application/warc-fields
Content-Length: 0


WARC/1.0
WARC-Type: metadata
WARC-Record-ID: <uri:uuid:test-metadata-not-warc-fields>
Content-Type: not-application/warc-fields
Content-Length: 0


WARC/1.0
WARC-Type: revisit
WARC-Record-ID: <uri:uuid:test-revisit-profile-unknown>
WARC-Profile: none
Content-Length: 0


WARC/1.0
WARC-Type: revisit
WARC-Record-ID: <uri:uuid:test-revisit-profile-future>
WARC-Profile: http://netpreserve.org/warc/1.1/revisit/identical-payload-digest
Content-Length: 0


WARC/1.0
WARC-Type: revisit
WARC-Record-ID: <uri:uuid:test-revisit-profile-good>
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/server-not-modified
Content-Length: 0


WARC/1.0
WARC-Type: conversion
WARC-Record-ID: <uri:uuid:test-conversion>
Content-Length: 0


WARC/1.0
WARC-Type: continuation
WARC-Record-ID: <uri:uuid:test-continuation-segment-1>
WARC-Segment-Number: 1
Content-Length: 0


WARC/1.0
WARC-Type: continuation
WARC-Record-ID: <uri:uuid:test-continuation-segment-valid>
WARC-Segment-Number: 2
Content-Length: 0


Loading