Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Intermittent uncaught(?) errors causing sentry spam with ValueError: Counters can only be incremented by non-negative amounts #14322

Closed
DMRobertson opened this issue Oct 28, 2022 · 1 comment · Fixed by #14323
Labels
A-Logging Synapse's logs (structured or otherwise). Not metrics. A-Metrics metrics, measures, stuff we put in Prometheus S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. Z-Sentry Issue was discovered by looking at Sentry reports on Matrix.org

Comments

@DMRobertson
Copy link
Contributor

E.g. https://sentry.tools.element.io/organizations/element/issues/27/?project=2&query=is%3Aunresolved&statsPeriod=24h

2022-10-28 21:46:03,088 - twisted - 274 - ERROR - sentinel - Traceback (most recent call last):
2022-10-28 21:46:03,095 - twisted - 274 - ERROR - sentinel -   File "/usr/lib/python3.9/wsgiref/handlers.py", line 137, in run
2022-10-28 21:46:03,098 - twisted - 274 - ERROR - sentinel -     self.result = application(self.environ, self.start_response)
2022-10-28 21:46:03,100 - twisted - 274 - ERROR - sentinel -   File "/home/synapse/env-poetry/lib/python3.9/site-packages/prometheus_client/exposition.py", line 127, in prometheus_app
2022-10-28 21:46:03,101 - twisted - 274 - ERROR - sentinel -     status, headers, output = _bake_output(registry, accept_header, accept_encoding_header, params, disable_compression)
2022-10-28 21:46:03,104 - twisted - 274 - ERROR - sentinel -   File "/home/synapse/env-poetry/lib/python3.9/site-packages/prometheus_client/exposition.py", line 103, in _bake_output
2022-10-28 21:46:03,106 - twisted - 274 - ERROR - sentinel -     output = formatter(registry)
2022-10-28 21:46:03,109 - twisted - 274 - ERROR - sentinel -   File "/home/synapse/env-poetry/lib/python3.9/site-packages/prometheus_client/openmetrics/exposition.py", line 21, in generate_latest
2022-10-28 21:46:03,112 - twisted - 274 - ERROR - sentinel -     for metric in registry.collect():
2022-10-28 21:46:03,115 - twisted - 274 - ERROR - sentinel -   File "/home/synapse/src/synapse/metrics/__init__.py", line 70, in collect
2022-10-28 21:46:03,117 - twisted - 274 - ERROR - sentinel -     for metric in REGISTRY.collect():
2022-10-28 21:46:03,119 - twisted - 274 - ERROR - sentinel -   File "/home/synapse/env-poetry/lib/python3.9/site-packages/prometheus_client/registry.py", line 97, in collect
2022-10-28 21:46:03,122 - twisted - 274 - ERROR - sentinel -     yield from collector.collect()
2022-10-28 21:46:03,124 - twisted - 274 - ERROR - sentinel -   File "/home/synapse/src/synapse/metrics/background_process_metrics.py", line 145, in collect
2022-10-28 21:46:03,130 - twisted - 274 - ERROR - sentinel -     process.update_metrics()
2022-10-28 21:46:03,136 - twisted - 274 - ERROR - sentinel -   File "/home/synapse/src/synapse/metrics/background_process_metrics.py", line 177, in update_metrics
2022-10-28 21:46:03,147 - twisted - 274 - ERROR - sentinel -     _background_process_ru_utime.labels(self.desc).inc(diff.ru_utime)
2022-10-28 21:46:03,151 - twisted - 274 - ERROR - sentinel -   File "/home/synapse/env-poetry/lib/python3.9/site-packages/prometheus_client/me	trics.py", line 285, in inc
2022-10-28 21:46:03,156 - twisted - 274 - ERROR - sentinel -     raise ValueError('Counters can only be incremented by non-negative amounts.')
2022-10-28 21:46:03,160 - twisted - 274 - ERROR - sentinel - ValueError: Counters can only be incremented by non-negative amounts.

Presumably diff.ru_utime is negative depending how self._reported_stats and new_stats interact(??)

def update_metrics(self) -> None:
"""Updates the metrics with values from this process."""
new_stats = self._context.get_resource_usage()
if self._reported_stats is None:
diff = new_stats
else:
diff = new_stats - self._reported_stats
self._reported_stats = new_stats
_background_process_ru_utime.labels(self.desc).inc(diff.ru_utime)
_background_process_ru_stime.labels(self.desc).inc(diff.ru_stime)
_background_process_db_txn_count.labels(self.desc).inc(diff.db_txn_count)
_background_process_db_txn_duration.labels(self.desc).inc(
diff.db_txn_duration_sec
)
_background_process_db_sched_duration.labels(self.desc).inc(
diff.db_sched_duration_sec
)

@DMRobertson DMRobertson added A-Logging Synapse's logs (structured or otherwise). Not metrics. S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. A-Metrics metrics, measures, stuff we put in Prometheus Z-Sentry Issue was discovered by looking at Sentry reports on Matrix.org labels Oct 28, 2022
@DMRobertson
Copy link
Contributor Author

For each counter being inced, either:

  • negative values make sense, in which case we should probably ignore them (clamp upwards to 0); or
  • negative values don't make sense, and we should raise an exception as today
    • but if so I wish they wouldn't spam sentry

We already do this clamping somewhere else:

def update_metrics(self) -> None:
"""Updates the in flight metrics with values from this request."""
if not self.start_context:
logger.error(
"Tried to update a RequestMetric from the sentinel context.\n%s",
"".join(traceback.format_stack()),
)
return
new_stats = self.start_context.get_resource_usage()
diff = new_stats - self._request_stats
self._request_stats = new_stats
# max() is used since rapid use of ru_stime/ru_utime can end up with the
# count going backwards due to NTP, time smearing, fine-grained
# correction, or floating points. Who knows, really?
in_flight_requests_ru_utime.labels(self.method, self.name).inc(
max(diff.ru_utime, 0)
)
in_flight_requests_ru_stime.labels(self.method, self.name).inc(
max(diff.ru_stime, 0)
)
in_flight_requests_db_txn_count.labels(self.method, self.name).inc(
diff.db_txn_count
)
in_flight_requests_db_txn_duration.labels(self.method, self.name).inc(
diff.db_txn_duration_sec
)
in_flight_requests_db_sched_duration.labels(self.method, self.name).inc(
diff.db_sched_duration_sec
)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Logging Synapse's logs (structured or otherwise). Not metrics. A-Metrics metrics, measures, stuff we put in Prometheus S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. Z-Sentry Issue was discovered by looking at Sentry reports on Matrix.org
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant