Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diffing server exception handling #185

Conversation

ftsalamp
Copy link
Contributor

@ftsalamp ftsalamp commented May 7, 2018

Hello everyone! I am the GSoC student selected by the Internet Archive to work on the project of bringing together the Wayback Machine and the scanner software. I wrote some code trying to improve the exception handling in the diffing_server. I also wrote some tests on the code I added.

I'm very excited to work on this project and looking forward to your comments.

Copy link
Member

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome to the project, @ftsalamp! This is wonderful. Thanks, especially, for including tests. I left a bunch of notes inline about how it could be even better.

A few general style notes, too:

  • Except for docstrings and multiline strings, we use single-quotes everywhere. (We should probably do it for those strings, too, but haven’t been consistent.)

  • Make sure to include spaces around operators.

  • Wherever possible, use format strings, string.format(), or string interpolation instead of string + string

@@ -15,6 +16,8 @@
)
import web_monitoring.html_diff_render
import web_monitoring.links_diff
import json
import sys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please alphabetize the imports!

@@ -67,17 +69,47 @@ def get(self, differ):
try:
func = self.differs[differ]
except KeyError:
self.send_error(404)
self.send_error(404, reason="Diffing method not found.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we’re improving these messages, one important practice we should follow is to, wherever reasonable, include information about what value was wrong and how it could be fixed.

This message would be even better if were something like:

self.send_error(404, reason=f'Unknown diffing method: `{differ}`. You can get a list of supported differs from the `/` endpoint.')

b = query_params.pop('b')
except KeyError:
self.send_error(
400, reason="Malformed request. A URL parameter is missing.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, it’s really helpful to be able to say what parameter was missing. So maybe something like:

self.send_error(
    400,
    reason='Malformed request. You must provide a URL as the value for both `a` and `b` query parameters.')

Even better to be able say specifically whether it was a, b, or both that were missing, but that’s a lot more code and I think the above is probably good enough :)

except (ValueError, IOError):
if res_a.code == 599:
self.send_error(
400, reason=str(res_a.error))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a comment describing why you are re-mapping the 599 code to 400 here? I think I understand why, but this is the sort of thing I think benefits from a clear explanation for others who are working on this code later :)

res_a.code, reason=str(res_a.error))
return

if res_b.error is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two error handling blocks (for res_a and res_b) are exactly the same, so it would be btter to encapsulate this error logic into a function that you can call once for each response.

except UndecodableContentError:
exceptionMessage = str(sys.exc_info()[1])
self.send_error(422, reason=exceptionMessage)
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These errors should already be handled in write_error(). Was that not working for your case?

write_error does some important extra work, like:

  • differentiating between errors that should be publicly visible and errors that should not (because that could leak security-sensitive information),
  • including stack traces only when the server is configured to do so (again, this is a big security concern), and
  • formatting the errors in a JSON structure (all output should be JSON here, though we are not perfect at that yet).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right about this. I think I was just too eager to handle everything, saw this in the stack trace and went ahead and re-handled it without checking the code first. I'm sorry about this.

self.assertEqual(response.code, 422)

def mockDiffingMethod(c_body):
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure all files end with a blank line!

return

def testInvalidURLaFormat(self):
response = self.fetch('http://localhost:'+str(self.get_http_port())+'/html_token?format=json&include=all&a=example.org&b=https://example.org', method='GET')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please use a format string for this? e.g:

port = self.get_http_port()
response = self.fetch(
    f'http://localhost:{port}/html_token?format=json&include=all&a=example.org&b=https://example.org',
    method='GET')

Same for all the other tests here.


def testInvalidURLaFormat(self):
response = self.fetch('http://localhost:'+str(self.get_http_port())+'/html_token?format=json&include=all&a=example.org&b=https://example.org', method='GET')
self.assertEqual(response.code, 400)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven’t done a good job of this everywhere, but we are trying to make sure that this server always responds with JSON. It would be great if these tests didn’t only check the response code, but also checked that they:

  1. Have the application/json content-type
  2. Are parse-able JSON
  3. The JSON data is an object with at least the fields code (a number) and error (a string) present

Same for all the other tests here. You could probably write a method that does those assertions and use it in each of the tests.

app.listen(str(self.get_http_port()))
return

def testInvalidURLaFormat(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use underscore case for method names: def test_invalid_url_a_format. Same for the rest of these tests.

@lightandluck
Copy link
Collaborator

Woot! Welcome @ftsalamp! 🎉 That's awesome that you've taken this on and glad that you're here 😀

@ftsalamp
Copy link
Contributor Author

ftsalamp commented May 8, 2018

I think I have implemented all the requested changes. Thank you for your help and input @Mr0grog !

@weatherpattern
Copy link
Contributor

Hello @ftsalamp ! Thank you for your contribution!
Don't forget to add yourself to as a contributor.

def get_app(self):
app = df.make_app()
app.listen(str(self.get_http_port()))
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just return the app, not start it:

def get_app(self):
    return df.make_app()

Doing that enables you to write each test more simply:

def test_some_request(self):
    response = self.fetch('/html_token?whatever_args')
    # Instead of:
    port = self.get_http_port()
    response = self.fetch(f'http://localhost:{port}/html_token?whatever_args')

Sorry I missed this before.

@@ -2,11 +2,14 @@
from docopt import docopt
import hashlib
import inspect
import json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn’t used anymore, so you should remove it.

import re
import sys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn’t used anymore, so you should remove it.

import tornado.gen
import tornado.httpclient
import tornado.ioloop
import tornado.web
from tornado.stack_context import ExceptionStackContext
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn’t used anymore, so you should remove it.

self.check_response_for_error(res_a)
self.check_response_for_error(res_b)
except tornado.httpclient.HTTPError:
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m pretty sure this is a logic error. In check_response_for_error(), we have:

if response.error is not None:
    try:
        response.rethrow()
    except (ValueError, IOError):
        # send back the right error response
        raise tornado.httpclient.HTTPError(0)
    # But we didn’t handle tornado.httpclient.HTTPError errors!

If response.rethrow() raised a tornado.httpclient.HTTPError, it won’t ever get an error response because we didn’t handle it inside check_response_for_error(), but we still stopped here in DiffHandler.get() as if we did. It basically stops here, but never sends the error message. You can try this by adding a test:

def test_a_is_404(self):
        response = self.fetch(f'/html_token?format=json&include=all&a=http://httpstat.us/404&b=https://example.org')
        print(f'STATUS:: {response.code}')
        print('BODY:::: "%s"' % response.body.decode('utf-8'))
        self.assertEqual(response.code, 400)
        self.json_check(response)

…which fails and prints out:

STATUS:: 200
BODY:::: ""

if response.error is not None:
try:
response.rethrow()
except (ValueError, IOError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but IOError got merged into OSError in Python 3.3, so it might be better to use that instead.

except (ValueError, IOError):
#Response code == 599 means that no HTTP response was received.
#In this case the error code should become 400 indicating that the error was
#raised because of a bad request parameter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is actually a specific HTTP error for this case: 504 GATEWAY TIMEOUT https://httpstatuses.com/504

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no HTTP response is received tornado seems to return 599, according to the docs. http://www.tornadoweb.org/en/stable/httpclient.html#tornado.httpclient.HTTPError
How should we go about this? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, Tornado’s doing the correct thing here. Sorry if I was unclear — I didn’t mean we shouldn’t be checking for a 599; I meant our server should return a 504 when it receives a 599. 504 specifically means “this server timed out when asking another server for the data you requested.” It’s a more specific code than 400 in this case.

That said, it would likely be good to make sure here that we really are dealing with a timeout and not something else, like a malformed address that was never attempted (that should probably still be a 400 or maybe a 422). For example, I think your test_invalid_url_a_format test is triggering a socket.gaierror here (subclass of OSError), which would be a 400/422 because it’s a bad address, not a timeout.

Copy link
Member

@Mr0grog Mr0grog May 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, I’m not too worried about sending exactly the right code here; just handling this case better than we previously were is probably an improvement. We can do a separate PR later for more specific and correct codes here.

So, I guess I’m saying you can skip changing this if you want. I’m much more concerned about the logical error above: #185 (comment)

Copy link
Contributor Author

@ftsalamp ftsalamp May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my latest commit I fixed the logical error. check_response_for_error now also handles tornado.httpclient.HTTPError. I also added the test you gave as an example (thanks for this) and it no longer fails.

@Mr0grog
Copy link
Member

Mr0grog commented May 10, 2018

Sorry I wasn’t able to get back to this yesterday, @ftsalamp!

Copy link
Contributor

@danielballan danielballan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome, @ftsalamp!

@@ -90,7 +102,7 @@ def get(self, differ):
actual_hash = hashlib.sha256(res.body).hexdigest()
if actual_hash != expected_hash:
self.send_error(
500, reason="Fetched content does not match hash.")
500, reason = 'Fetched content does not match hash.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elaborating on @Mr0grog's request for spaces around operators, Python's recommended style does not put spaces around = when it's used in a keyword argument like

def f(x=3):
    ...

f(x=2)

only when it's used in a assignment like

y = 6

You can use a tool like flake8 to check the code style and catch these things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @danielballan ! Having worked for 6 months mostly with Swift, the tool you suggested is a great way to remember Python's coding style.

@danielballan
Copy link
Contributor

The tests are failing on a problem that was removed in #186, so I will rebase this on master to incorporate that fix in this PR.

@danielballan danielballan force-pushed the diffing_server-exception-handling branch from be33a03 to 50ef7db Compare May 11, 2018 03:18
@danielballan
Copy link
Contributor

Rebased.

@Mr0grog Mr0grog merged commit e6f0c43 into edgi-govdata-archiving:master May 16, 2018
@Mr0grog
Copy link
Member

Mr0grog commented May 16, 2018

Sorry I was so slow to get back to this! Thanks, @ftsalamp.

@Mr0grog Mr0grog mentioned this pull request Aug 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants