Improve exception handling for metrics store errors #1703

b-deam · 2023-04-13T04:06:13Z

Prior to this commit, Rally could hang indefinitely whenever its respective metrics store failed to flush() successfully.

This is because the respective remote metric store's close() method only attempted to set itself as closed after a successful call to the flush() method. If we failed to flush() successfully, then we never set the metric store's opened attribute to false, eventually leading to a deadlock where Rally couldn't ever shut down successfully:

rally/esrally/driver/driver.py

Lines 918 to 921 in 431af5f

    
           def close(self): 
        
               self.progress_reporter.finish() 
        
               if self.metrics_store and self.metrics_store.opened: 
        
                   self.metrics_store.close()

This commit ensures we properly close the respective metrics store by setting the opened attribute to false prior to attempting a final flush(), as well as ensures we capture and log any BulkIndexErrors raised by the bulk index helper method.

Fixes #1697

You can reproduce this locally with a remote metrics store and by running this specific http_logs revision to generate a mappings exception:

Fails due to metrics store mappings exception:

$ esrally race --telemetry="gc" --target-host="localhost:9200" --track-revision="cffc8641055233485b9d1c9a773b1303c7c5313f" --track-repository="default" --track="http_logs" --challenge="append-index-only-with-ingest-pipeline" --on-error="abort" --client-options="timeout:240,use_ssl:true,verify_certs:false,basic_auth_user:'elastic',basic_auth_password:'changeme'" --track-params='{"bulk_indexing_clients": 1,"source_enabled": false, "ingest_pipeline": "grok" }' --pipeline="benchmark-only"  --kill-running-processes

Output

$ esrally race --telemetry="gc" --target-host="localhost:9200" --track-revision="cffc8641055233485b9d1c9a773b1303c7c5313f" --track-repository="default" --track="http_logs" --challenge="append-index-only-with-ingest-pipeline" --on-error="abort" --client-options="timeout:240,use_ssl:true,verify_certs:false,basic_auth_user:'elastic',basic_auth_password:'changeme'" --track-params='{"bulk_indexing_clients": 1,"source_enabled": false, "ingest_pipeline": "grok" }' --pipeline="benchmark-only"  --kill-running-processes

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Race id is [1415f777-8547-4193-bb0e-c8d02f05483f]
[INFO] Racing on track [http_logs], challenge [append-index-only-with-ingest-pipeline] and car ['external'] with version [8.6.2].

Running delete-index                                                           [100% done]
Running create-index                                                           [100% done]
Running check-cluster-health                                                   [100% done]
Running create-http-log-grok-pipeline                                          [100% done]
Running index-append-with-ingest-grok-pipeline                                 [  0% done]
[ERROR] Cannot race. Traceback (most recent call last):
  File "/home/b/perf/github.com/b-deam/rally/esrally/metrics.py", line 116, in guarded
    return target(*args, **kwargs)
  File "/home/b/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/elasticsearch/helpers/actions.py", line 524, in bulk
    for ok, item in streaming_bulk(
  File "/home/b/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/elasticsearch/helpers/actions.py", line 438, in streaming_bulk
    for data, (ok, info) in zip(
  File "/home/b/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/elasticsearch/helpers/actions.py", line 355, in _process_bulk_chunk
    yield from gen
  File "/home/b/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/elasticsearch/helpers/actions.py", line 274, in _process_bulk_chunk_success
    raise BulkIndexError(f"{len(errors)} document(s) failed to index.", errors)
elasticsearch.helpers.BulkIndexError: 45 document(s) failed to index.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/b/perf/github.com/b-deam/rally/esrally/actor.py", line 92, in guard
    return f(self, msg, sender)
  File "/home/b/perf/github.com/b-deam/rally/esrally/driver/driver.py", line 303, in receiveMsg_WakeupMessage
    self.coordinator.post_process_samples()
  File "/home/b/perf/github.com/b-deam/rally/esrally/driver/driver.py", line 955, in post_process_samples
    self.sample_post_processor(raw_samples)
  File "/home/b/perf/github.com/b-deam/rally/esrally/driver/driver.py", line 1068, in __call__
    self.metrics_store.flush(refresh=False)
  File "/home/b/perf/github.com/b-deam/rally/esrally/metrics.py", line 933, in flush
    self._client.bulk_index(index=self._index, items=self._docs)
  File "/home/b/perf/github.com/b-deam/rally/esrally/metrics.py", line 92, in bulk_index
    self.guarded(elasticsearch.helpers.bulk, self._client, items, index=index, chunk_size=5000)
  File "/home/b/perf/github.com/b-deam/rally/esrally/metrics.py", line 123, in guarded
    raise exceptions.RallyError(
esrally.exceptions.RallyError: Unretryable error encountered when sending metrics to remote metrics store: [mapper_parsing_exception]


Getting further help:
*********************
* Check the log files in /home/b/.rally/logs for errors.
* Read the documentation at https://esrally.readthedocs.io/en/latest/.
* Ask a question on the forum at https://discuss.elastic.co/tags/c/elastic-stack/elasticsearch/rally.
* Raise an issue at https://github.com/elastic/rally/issues and include the log files in /home/b/.rally/logs.

--------------------------------
[INFO] FAILURE (took 63 seconds)
--------------------------------

pquentin · 2023-05-09T13:52:20Z

I'd argue that this is a bug in the Elasticsearch Python client: BulkIndexError should be an ApiError, but it's not.

If it was an ApiError, then this would solve our problem, right? In the meantime, should we catch both ApiError and BulkIndexError in the same except statement?

inqueue · 2023-05-09T17:31:03Z

I'd argue that this is a bug in the Elasticsearch Python client: BulkIndexError should be an ApiError, but it's not.

I see it both ways, though there is a difference. My interpretation is an ApiError is any non-2xx HTTP request status code whereas a BulkIndexError is an error strictly related to failed actions of the bulk request. The bulk HTTP request could end in a 2xx response even if it contains one or more errors of BulkIndexError. I think the ES Python client is doing the right thing.

inqueue

LGTM pending any concerns from @pquentin.

pquentin

Right, this is an explicit decision from the Python client that is mentioned in the 8.x migration guide: https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/migration.html#migration-error-types:

elasticsearch.helpers.errors.BulkIndexError and elasticsearch.helpers.errors.ScanError now use Exception as a base class instead of ElasticsearchException.

So that sounds good, I just left a few nits.

pquentin · 2023-05-09T13:42:29Z

esrally/metrics.py

+            except elasticsearch.helpers.BulkIndexError as e:
+                for err in e.errors:
+                    err_type = err.get("index", {}).get("error", {}).get("type", None)
+                    if err.get("index", {}).get("status", None) not in (502, 503, 504, 429):


Should we move that list of error codes to a constant now that it's use twice?

I know you did not come up with the list, but I'm not sure I find the list very convincing. Even 429 seems OK to retry after having slept.

Sure, and I'm not sure I understand your last two sentences:

I know you did not come up with the list, but I'm not sure I find the list very convincing. Even 429 seems OK to retry after having slept.

We do retry on 429s, but I can see that it can be easily misread. A side effect of using a constant is that it's now much more 'readable' (IMHO).

Addressed in 1d88020

pquentin · 2023-05-09T13:48:13Z

tests/metrics_test.py

+            },
+        ]
+
+        retriable_errors = [


I also understand this is not your code, but can we please stick to either retriable or retryable? (Both are correct: https://english.stackexchange.com/questions/305273/retriable-or-retryable)

It's most certainly a mistake to ask an Australian about proper use of the English language.

Addressed in 1d88020

pquentin · 2023-05-10T11:29:59Z

esrally/metrics.py

@@ -114,6 +114,27 @@ def guarded(self, target, *args, **kwargs):

            try:
                return target(*args, **kwargs)
+            except elasticsearch.helpers.BulkIndexError as e:


Can you please move this block just before except ApiError to highlight that they are similar?

Addressed in 1d88020

pquentin

Thanks for iterating and for your patience! LGTM

Improve exception handling for metrics store errors

431af5f

b-deam added bug Something's wrong :Metrics How metrics are stored, calculated or aggregated labels Apr 13, 2023

b-deam added this to the 2.8.0 milestone Apr 13, 2023

b-deam requested review from pquentin and inqueue April 13, 2023 04:06

b-deam self-assigned this Apr 13, 2023

Actually log full error

12dc523

inqueue approved these changes May 9, 2023

View reviewed changes

Formatting

a1c4b2f

pquentin requested changes May 10, 2023

View reviewed changes

PR comments

1d88020

b-deam requested a review from pquentin May 11, 2023 00:13

pquentin approved these changes May 11, 2023

View reviewed changes

b-deam merged commit 9d771b7 into elastic:master May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve exception handling for metrics store errors #1703

Improve exception handling for metrics store errors #1703

b-deam commented Apr 13, 2023 •

edited

Loading

pquentin commented May 9, 2023

inqueue commented May 9, 2023

inqueue left a comment

pquentin left a comment

pquentin May 9, 2023

b-deam May 11, 2023

pquentin May 9, 2023

b-deam May 11, 2023

pquentin May 10, 2023

b-deam May 11, 2023

pquentin left a comment

	def close(self):
	self.progress_reporter.finish()
	if self.metrics_store and self.metrics_store.opened:
	self.metrics_store.close()

Improve exception handling for metrics store errors #1703

Improve exception handling for metrics store errors #1703

Conversation

b-deam commented Apr 13, 2023 • edited Loading

pquentin commented May 9, 2023

inqueue commented May 9, 2023

inqueue left a comment

Choose a reason for hiding this comment

pquentin left a comment

Choose a reason for hiding this comment

pquentin May 9, 2023

Choose a reason for hiding this comment

b-deam May 11, 2023

Choose a reason for hiding this comment

pquentin May 9, 2023

Choose a reason for hiding this comment

b-deam May 11, 2023

Choose a reason for hiding this comment

pquentin May 10, 2023

Choose a reason for hiding this comment

b-deam May 11, 2023

Choose a reason for hiding this comment

pquentin left a comment

Choose a reason for hiding this comment

b-deam commented Apr 13, 2023 •

edited

Loading