-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve exception handling for metrics store errors #1703
Conversation
I'd argue that this is a bug in the Elasticsearch Python client: BulkIndexError should be an ApiError, but it's not. If it was an |
I see it both ways, though there is a difference. My interpretation is an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending any concerns from @pquentin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is an explicit decision from the Python client that is mentioned in the 8.x migration guide: https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/migration.html#migration-error-types:
elasticsearch.helpers.errors.BulkIndexError
andelasticsearch.helpers.errors.ScanError
now useException
as a base class instead ofElasticsearchException
.
So that sounds good, I just left a few nits.
esrally/metrics.py
Outdated
except elasticsearch.helpers.BulkIndexError as e: | ||
for err in e.errors: | ||
err_type = err.get("index", {}).get("error", {}).get("type", None) | ||
if err.get("index", {}).get("status", None) not in (502, 503, 504, 429): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move that list of error codes to a constant now that it's use twice?
I know you did not come up with the list, but I'm not sure I find the list very convincing. Even 429 seems OK to retry after having slept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, and I'm not sure I understand your last two sentences:
I know you did not come up with the list, but I'm not sure I find the list very convincing. Even 429 seems OK to retry after having slept.
We do retry on 429s, but I can see that it can be easily misread. A side effect of using a constant is that it's now much more 'readable' (IMHO).
Addressed in 1d88020
tests/metrics_test.py
Outdated
}, | ||
] | ||
|
||
retriable_errors = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also understand this is not your code, but can we please stick to either retriable or retryable? (Both are correct: https://english.stackexchange.com/questions/305273/retriable-or-retryable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's most certainly a mistake to ask an Australian about proper use of the English language.
Addressed in 1d88020
esrally/metrics.py
Outdated
@@ -114,6 +114,27 @@ def guarded(self, target, *args, **kwargs): | |||
|
|||
try: | |||
return target(*args, **kwargs) | |||
except elasticsearch.helpers.BulkIndexError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please move this block just before except ApiError
to highlight that they are similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 1d88020
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating and for your patience! LGTM
Prior to this commit, Rally could hang indefinitely whenever its respective metrics store failed to
flush()
successfully.This is because the respective remote metric store's
close()
method only attempted to set itself as closed after a successful call to theflush()
method. If we failed toflush()
successfully, then we never set the metric store'sopened
attribute tofalse
, eventually leading to a deadlock where Rally couldn't ever shut down successfully:rally/esrally/driver/driver.py
Lines 918 to 921 in 431af5f
This commit ensures we properly close the respective metrics store by setting the
opened
attribute tofalse
prior to attempting a finalflush()
, as well as ensures we capture and log anyBulkIndexError
s raised by the bulk index helper method.Fixes #1697
You can reproduce this locally with a remote metrics store and by running this specific
http_logs
revision to generate a mappings exception:Fails due to metrics store mappings exception:
Output