Description
If a URL in a log contains a double quote, s3logparse throws an exception. For example:
f383580bf9c9bd19055832f1bf43164c3e6c444a758287727f8f40ab00d461bc testing-1234xxx [20/Jun/2019:03:51:36 +0000] 1.2.3.4 - 63DAF67A2BC0B6C3 REST.GET.OBJECT %2522 "GET /" HTTP/1.1" 403 AccessDenied 243 - 8 - "referer test" "curl/7.38.0" - uidORkMVUSKEFetTC9FJG1qrSdT9DlxA97GPF8m2IsJlDxHLkV5VGzmkuTb8pXIym7B/J5XZlGU= - ECDHE-RSA-AES128-GCM-SHA256 - testing-1234xxx.s3.amazonaws.com TLSv1.2
throws:
File "s3logparse/s3logparse.py", line 46, in shift_int_fields
yield 0 if i == '-' else int(i)
ValueError: invalid literal for int() with base 10: 'HTTP/1.1"'
The request is "GET /" HTTP/1.1".
The same thing can happen with referer (and probably the user agent too), which can contain quotes and spaces, so you can send a request with a referer like 'foo "bar"' and inject anything you want into later fields. It looks like this was reported to them way back in 2012, so I guess they're not going to fix it.
Silently discarding lines that won't parse would be better than throwing an exception (at least the rest of the log would still parse). That could lead to it discarding real data, though, since S3 files can contain actual quotes. The request URL could be special cased by not treating quotes as a delimiter when they're in the URL portion of the request string. I don't think spaces can be injected there (that would make the request itself invalid).
Handling the referer and UA problem is trickier. Parsing the tokens after them from the end, starting at the TLS version, would keep those fields from breaking. I don't know if this ever happens except with maliciously-formatted requests, but it would prevent garbage from being inserted into the host, etc. fields. This would break if extra fields are added to the end of the logs, though.