Skip to content

feat: make Telethon flood-waits visible in the log#124

Merged
GeiserX merged 2 commits intoGeiserX:mainfrom
tondeaf:feat/flood-wait-visibility
Apr 25, 2026
Merged

feat: make Telethon flood-waits visible in the log#124
GeiserX merged 2 commits intoGeiserX:mainfrom
tondeaf:feat/flood-wait-visibility

Conversation

@tondeaf
Copy link
Copy Markdown
Contributor

@tondeaf tondeaf commented Apr 18, 2026

Summary

  • Set flood_sleep_threshold=0 on the shared TelegramClient so every FloodWaitError surfaces, then wrap client.iter_messages at the two bulk call-sites in src/telegram_backup.py with iter_messages_with_flood_retry, which logs the wait, sleeps, and resumes from the last yielded id.
  • Bound the wrapper: MAX_FLOOD_RETRIES=5 consecutive flood-waits without progress (counter resets on each yield) before the wrapper gives up and re-raises; MAX_FLOOD_WAIT_SECONDS=600 clamp on e.seconds to defend against pathological values.
  • New FLOOD_WAIT_LOG_THRESHOLD env var (default 10s) suppresses noisy short-wait log lines while keeping the operator-visible long ones.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that causes existing functionality to change)
  • Documentation update
  • Infrastructure/CI change

Database Changes

  • Schema changes (Alembic migration required)
  • Data migration script added in scripts/
  • No database changes

Data Consistency Checklist

  • All chat_id values use marked format (via _get_marked_id())
  • All datetime values pass through _strip_tz() before DB operations
  • INSERT and UPDATE operations handle the same fields identically

Testing

  • Tests pass locally (python -m pytest tests/ -v)
  • Linting passes (ruff check .)
  • Formatting passes (ruff format --check .)
  • Manually tested in development environment

Security Checklist

  • No secrets or credentials committed
  • User input properly validated/sanitized — e.seconds is integer from Telethon and is clamped before use
  • Authentication/authorization properly checked — change is internal to the backup pipeline; no auth surface modified

Deployment Notes

  • New optional env var: FLOOD_WAIT_LOG_THRESHOLD (default 10, set to 0 to log every wait). No-op for existing deployments — the defaults match the prior log-everything behavior except for sub-10-second waits which are now silent.

The problem

Telethon's TelegramClient silently sleeps through any FloodWaitError whose wait is ≤ flood_sleep_threshold (default 60 s). On a large iter_messages pass — e.g. initial backup of a 100 k-message chat — Telegram often hands back much longer waits, and the operator can't tell "crashed" from "rate-limited":

  • No log lines
  • No DB writes
  • Realtime listener is still firing
  • systemctl status still shows the service as active

The fix

  1. Set flood_sleep_threshold=0 on the shared TelegramClient so every FloodWaitError bubbles up as an exception rather than a silent sleep.
  2. Add iter_messages_with_flood_retry(client, entity, **kwargs) — a thin async generator that wraps client.iter_messages, logs FloodWait: sleeping Ns before resuming (last_msg_id=M, retry=N/MAX), sleeps, then resumes iteration from the last-seen message id.
  3. Bound the wrapper after the original review: MAX_FLOOD_RETRIES=5 consecutive flood-waits without progress before re-raising (the counter resets every time iteration yields a message, so a long backfill that hits one wait per chunk doesn't trip the cap), and MAX_FLOOD_WAIT_SECONDS=600 clamp on the sleep so a buggy/malicious server can't hang the process for hours.
  4. Use that wrapper at the two bulk-iteration call-sites in src/telegram_backup.py (full chat backup and gap-fill).

FLOOD_WAIT_LOG_THRESHOLD (default 10 s) controls the minimum wait to log, so trivial 1-2 s floods don't spam the log.

Before/after is qualitative:

  • Before: multi-minute flood-wait looks like a hang.
  • After: FloodWait: sleeping 47s before resuming (last_msg_id=87342, retry=1/5) — operator immediately knows what's happening.

Conflict notes

Rebased on top of the recently-merged SKIP_TOPIC_IDS feature (#118). Both conflict sites in src/telegram_backup.py are trivial — our flood-retry wrapper wraps the iter_messages call, and the topic-skip check stays inside the loop body where upstream added it.

Test plan

  • pytest tests/test_flood_wait_visibility.py — 10/10 pass (3 original + 7 added in review)
  • pytest tests/test_config.py — passes
  • Manually triggered in production on a large chat; log now shows the flood-wait line with retry=N/MAX and resumes cleanly from the correct message id.

Telethon's ``TelegramClient`` silently sleeps through any FloodWaitError
whose wait is ≤ ``flood_sleep_threshold`` (default 60s). On a large
``iter_messages`` pass (e.g. a 100k-message chat) Telegram often hands
back much longer waits, and the default behaviour makes the run look
like a hang: no log lines, no DB writes, realtime listener still firing.
The operator can't tell "crashed" from "rate-limited".

This change makes flood-waits explicit:

1. Set ``flood_sleep_threshold=0`` on the shared ``TelegramClient`` so
   every FloodWaitError bubbles up as an exception rather than a silent
   sleep.
2. Introduce ``iter_messages_with_flood_retry(client, entity, **kwargs)``
   — a thin async generator that wraps ``client.iter_messages``,
   logs ``FloodWait: sleeping Ns before resuming at message_id=M``,
   sleeps, then resumes iteration from the last-seen message id.
3. Use that wrapper at the two bulk-iteration sites in
   ``src/telegram_backup.py`` (full chat backup and gap-fill).

Config: ``FLOOD_WAIT_LOG_THRESHOLD`` (default 10s) controls the minimum
wait to log — trivial 1-2s floods stay quiet.

Before: multi-minute flood-wait looks like a crash.
After: operator sees ``FloodWait: sleeping 47s before resuming`` and
knows exactly what's happening.

Regression tests in ``tests/test_flood_wait_visibility.py``:
- Wrapper resumes iteration from the last-seen message id after a flood.
- Short floods below the log threshold are suppressed.
- Long floods emit a structured log line with wait duration + cursor.
@tondeaf tondeaf requested a review from GeiserX as a code owner April 18, 2026 21:52
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 18, 2026

📝 Walkthrough

Walkthrough

Changes introduce flood-wait handling for Telegram message iteration by always setting flood_sleep_threshold: 0 in client kwargs to surface errors, adding a retry mechanism with resume-from-last-id semantics for message iteration, and expanding test coverage for flood-wait scenarios.

Changes

Cohort / File(s) Summary
Configuration Updates
src/config.py
Modified build_telegram_client_kwargs() and Config.get_telegram_client_kwargs() to always return {"flood_sleep_threshold": 0}, with proxy configuration conditionally added when available.
Flood Retry Logic
src/telegram_backup.py
Added iter_messages_with_flood_retry() async generator to catch FloodWaitError, log duration, sleep, and resume iteration from last message id. Integrated into _backup_dialog() and _fill_gap_range() message loops.
Test Coverage
tests/test_config.py, tests/test_flood_wait_visibility.py
Updated existing config tests to expect flood_sleep_threshold: 0 in client kwargs. Added new test file with three tests verifying flood-sleep-threshold propagation, client initialization, and flood-retry resumption behavior with mocked asyncio sleep.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • PR #104: Directly related—these changes modify the same Telegram client kwargs helper functions introduced in that PR by adding the mandatory flood_sleep_threshold: 0 baseline and conditional proxy attachment.
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.82% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly reflects the main change: making Telethon flood-waits visible via logging. It accurately summarizes the primary objective of the PR.
Description check ✅ Passed PR description is comprehensive and follows the template structure with all major sections completed, including summary, type of change, database changes, testing checklist, security checklist, and deployment notes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@GeiserX
Copy link
Copy Markdown
Owner

GeiserX commented Apr 18, 2026

Hey @tondeaf, excellent feature! Making flood-waits visible is a huge UX improvement for operators — "is it crashed or rate-limited?" is a really common pain point with Telethon. The min_id extraction via keyword-only parameter (*, min_id=0, **kwargs) is a smart design choice that prevents double-passing.

Here's the thorough review:

Code Quality (1 CRITICAL, 2 MEDIUM, 2 LOW)

CRITICAL — while True loop has no retry cap
If Telegram repeatedly returns FloodWaitError (e.g., account temporarily restricted), the loop retries forever. The backup process hangs indefinitely on a single chat, blocking all subsequent chats. Please add a maximum retry count:

MAX_FLOOD_RETRIES = 5
retries = 0
while True:
    try:
        async for msg in client.iter_messages(entity, min_id=resume_from, **kwargs):
            yield msg
            if getattr(msg, "id", None) is not None:
                resume_from = max(resume_from, msg.id)
            retries = 0  # reset on successful progress
        return
    except FloodWaitError as e:
        retries += 1
        if retries > MAX_FLOOD_RETRIES:
            logger.error("FloodWait: exceeded %d retries, giving up (last_msg_id=%s)", MAX_FLOOD_RETRIES, resume_from)
            raise
        logger.warning(...)
        await asyncio.sleep(e.seconds + 1)

MEDIUM — PR body advertises FLOOD_WAIT_LOG_THRESHOLD but it's not implemented
The description says "New config knob FLOOD_WAIT_LOG_THRESHOLD (default 10s) controls the minimum wait to log" — but this doesn't exist in the diff. All flood-waits are logged unconditionally. Either implement it or remove the claim from the PR body.

MEDIUM — e.seconds is trusted without an upper bound
A malicious/buggy server could send an extremely large FloodWaitError.seconds value, hanging the process for hours/days. Cap the sleep:

MAX_FLOOD_WAIT_SECONDS = 600  # 10 minutes
wait_seconds = min(e.seconds + 1, MAX_FLOOD_WAIT_SECONDS)
await asyncio.sleep(wait_seconds)

LOW — Two extra blank lines introduced
At telegram_backup.py:698 and :790, there are two consecutive blank lines after the topic-skip blocks. PEP 8 recommends at most one within a function body.

LOW — (yrru-mix3) in test docstring
Same as PR #123 — opaque internal ID, replace with PR/issue number.

Security (APPROVE with above changes)

  • No rate-limit amplification — the wrapper sleeps the full requested duration before retrying.
  • Log line only exposes e.seconds (int) and resume_from (message ID int) — no PII, no entity names.
  • flood_sleep_threshold=0 is the right call for a backup tool where visibility matters.

Test Adequacy (2 HIGH gaps, 2 MEDIUM)

HIGH — No test for FloodWait before any message is yielded
The seeded_iter always yields id=1 before the error. But what if the very first API call raises FloodWaitError? resume_from stays at original min_id. The initial fake_iter was defined for this but is actually dead code (overwritten by seeded_iter before use). Please: remove dead fake_iter, add a separate test for the no-yield-before-flood case.

HIGH — No test for multiple consecutive FloodWaitErrors
Production commonly sees 3-4 consecutive floods before success. Add a test that raises FloodWaitError 2-3 times, then succeeds.

MEDIUM — No test for max_id preservation across retries
The gap-fill call-site passes max_id=gap_end via **kwargs. It's forwarded correctly, but an explicit test would catch regressions.

MEDIUM — _fake_db fixture is overly broad
The autouse=True fixture reloads src.connection and src.telegram_backup for every test, including test_config_kwargs_include_flood_sleep_threshold_zero which doesn't need it. Consider narrowing the scope.

CodeRabbit

CodeRabbit was rate-limited on this PR. I'll retrigger it once the cooldown expires.

Thanks for the contribution! 🙏

@GeiserX
Copy link
Copy Markdown
Owner

GeiserX commented Apr 18, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 18, 2026

✅ Actions performed

Full review triggered.

@GeiserX
Copy link
Copy Markdown
Owner

GeiserX commented Apr 19, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 19, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
tests/test_flood_wait_visibility.py (1)

26-44: autouse=True is overkill and module-reload is a footgun.

This fixture reloads src.connection and src.telegram_backup for every test in the file — including test_config_kwargs_include_flood_sleep_threshold_zero which only touches src.config and doesn't need the DB stub at all. Reloading modules mid-test-suite can also leak state into unrelated tests via sys.modules mutation if this file is imported alongside others.

Narrow the scope: drop autouse=True and apply the fixture explicitly to the two tests that actually exercise src.telegram_backup / src.connection. The teardown branch at lines 42-44 is also effectively a no-op (the if is always true since the fixture just inserted the module) — either restore the original sys.modules["src.db"] or remove the branch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_flood_wait_visibility.py` around lines 26 - 44, The _fake_db
fixture is too broad and the module reload teardown is unsafe; remove
autouse=True from the _fake_db fixture and instead add it as an explicit
argument to only the tests that need DB stubbing (the tests that import/use
src.telegram_backup and src.connection), and replace the teardown no-op with
restoring the prior sys.modules["src.db"] value (capture original =
sys.modules.get("src.db") at fixture start and on teardown set
sys.modules["src.db"] = original or delete it if original was None) so the
fixture (_fake_db), monkeypatch.setitem, and the importlib.reload calls no
longer leak state into unrelated tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/telegram_backup.py`:
- Around line 697-699: There are two instances of double blank lines inside
function bodies (one immediately after the statement "msg_data = await
self._process_message(message, chat_id)" and another in the later block around
similar message-processing logic); remove the extra blank line in each place so
there is at most one blank line between logical blocks per PEP 8 — update the
function(s) containing "_process_message" calls to collapse the consecutive
blank lines to a single blank line and re-run ruff format/check to ensure the
change passes linting.
- Around line 41-62: iter_messages_with_flood_retry currently retries forever,
sleeps for unbounded server-supplied seconds, doesn't honour
FLOOD_WAIT_LOG_THRESHOLD, and lacks type hints; fix by adding type hints for
parameters and return (e.g., AsyncIterator[Message] or appropriate Telethon
types), introduce a bounded retry counter (e.g., retry_count and MAX_RETRIES)
that increments on each FloodWaitError and is reset to 0 whenever resume_from
advances (i.e., after yielding a message), clamp the sleep to
MAX_FLOOD_WAIT_SECONDS (use min(e.seconds, MAX_FLOOD_WAIT_SECONDS) + 1) and only
log the FloodWait if e.seconds >= FLOOD_WAIT_LOG_THRESHOLD, and raise or abort
after exceeding MAX_RETRIES to avoid infinite loops (refer to
iter_messages_with_flood_retry, resume_from, e.seconds,
FLOOD_WAIT_LOG_THRESHOLD, MAX_FLOOD_WAIT_SECONDS, and retry_count/MAX_RETRIES).

In `@tests/test_flood_wait_visibility.py`:
- Around line 96-142: Remove the dead fake_iter definition (it’s overwritten by
seeded_iter) and add focused tests for iter_messages_with_flood_retry: (1) a
test where the first call to the fake iter_messages raises FloodWaitError before
yielding any message (ensure min_id stays 0 and subsequent call yields
messages), (2) a test that simulates consecutive FloodWaitError occurrences
before success to exercise retry-count logic, and (3) a test that captures the
kwargs passed into the fake iter_messages to assert that max_id (and other
kwargs) are preserved across retries (use SimpleNamespace client.iter_messages,
patch asyncio.sleep, and assert behavior against iter_messages_with_flood_retry
and the _fill_gap_range path).
- Around line 1-11: The module docstring uses a cryptic internal ID
"(yrru-mix3)"; replace that token in the top-of-file docstring with the actual
PR or issue reference (e.g. "PR `#1234`" or "Issue `#5678`") so git-blame/searchers
can find the context—edit the string in tests/test_flood_wait_visibility.py (the
file-level docstring surrounding "Flood-wait visibility") to include the real
PR/issue number instead of "(yrru-mix3)".

---

Nitpick comments:
In `@tests/test_flood_wait_visibility.py`:
- Around line 26-44: The _fake_db fixture is too broad and the module reload
teardown is unsafe; remove autouse=True from the _fake_db fixture and instead
add it as an explicit argument to only the tests that need DB stubbing (the
tests that import/use src.telegram_backup and src.connection), and replace the
teardown no-op with restoring the prior sys.modules["src.db"] value (capture
original = sys.modules.get("src.db") at fixture start and on teardown set
sys.modules["src.db"] = original or delete it if original was None) so the
fixture (_fake_db), monkeypatch.setitem, and the importlib.reload calls no
longer leak state into unrelated tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9fb4dc56-f287-4704-ad25-fd93d6825d37

📥 Commits

Reviewing files that changed from the base of the PR and between d76873a and 76f7b4e.

📒 Files selected for processing (4)
  • src/config.py
  • src/telegram_backup.py
  • tests/test_config.py
  • tests/test_flood_wait_visibility.py

Comment thread src/telegram_backup.py Outdated
Comment thread src/telegram_backup.py
Comment thread tests/test_flood_wait_visibility.py Outdated
Comment thread tests/test_flood_wait_visibility.py Outdated
Address PR GeiserX#124 review feedback.

Source changes (src/telegram_backup.py):

- Add MAX_FLOOD_RETRIES=5 cap. The previous `while True` would loop
  forever on a single chat if Telegram kept rate-limiting (e.g. account
  restriction), blocking every later chat in the backup. The counter
  resets on every successful yield, so a long backfill that hits one
  flood-wait per chunk doesn't trip the cap; only consecutive waits
  without progress count.

- Add MAX_FLOOD_WAIT_SECONDS=600 sleep cap. Defensive against a
  pathologically large e.seconds value (buggy/malicious server) that
  would otherwise pin the process for hours.

- Add FLOOD_WAIT_LOG_THRESHOLD env var (default 10s) so routine short
  waits don't spam the log; long waits still surface. Set to 0 to log
  every wait. Implements the knob the PR description advertised.

- Drop two extra blank lines at telegram_backup.py:698 and :790
  (PEP 8 inside-function-body limit).

Test changes (tests/test_flood_wait_visibility.py):

- New: flood-before-any-yield (resume_from stays at original min_id).
- New: three consecutive flood-waits before success.
- New: counter-reset-on-progress (covers the fix above directly).
- New: gives-up-after-max-retries with FloodWaitError raised.
- New: e.seconds clamped at MAX_FLOOD_WAIT_SECONDS.
- New: max_id kwarg preservation across retries (gap-fill regression).
- New: short waits suppressed by FLOOD_WAIT_LOG_THRESHOLD.
- Removed dead `fake_iter` from the original resume test.
- Replaced opaque (yrru-mix3) ticket reference with PR GeiserX#124.
- Narrowed the autouse `_fake_db` fixture into an opt-in `fake_db`
  fixture so test_config_kwargs_include_flood_sleep_threshold_zero
  no longer pays the import-reload cost it didn't need.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tondeaf
Copy link
Copy Markdown
Contributor Author

tondeaf commented Apr 25, 2026

Addressed your feedback in the latest commit. Summary:

CRITICAL — unbounded retry loop

  • Added MAX_FLOOD_RETRIES = 5 with a counter that resets on every successful yield. A long backfill that hits one flood-wait per chunk doesn't trip the cap; only consecutive flood-waits without progress count, and after the cap the wrapper logs and re-raises so the chat fails closed instead of hanging.

MEDIUM — e.seconds upper bound

  • Added MAX_FLOOD_WAIT_SECONDS = 600; e.seconds is clamped before we sleep.

MEDIUM — FLOOD_WAIT_LOG_THRESHOLD advertised but not implemented

  • Implemented. Reads the env var (default 10s); waits below the threshold are silent, the rest log as before with retry=N/MAX appended for context. Set to 0 to log every wait.

LOW — extra blank lines at :698 and :790

  • Both fixed.

LOW — opaque (yrru-mix3) reference

HIGH (tests) — flood before any yield

  • Added test_iter_with_flood_retry_handles_flood_before_any_yield. Removed the dead fake_iter from the original resume test.

HIGH (tests) — multiple consecutive floods

  • Added test_iter_with_flood_retry_survives_consecutive_floods (3 in a row before success) and test_iter_with_flood_retry_resets_counter_on_progress (yield-then-flood pattern past the cap, proving the reset).

MEDIUM (tests) — max_id preservation across retries

  • Added test_iter_with_flood_retry_preserves_max_id_kwarg.

MEDIUM (tests) — overly broad _fake_db autouse fixture

  • Narrowed from autouse=True to an opt-in fake_db fixture. test_config_kwargs_include_flood_sleep_threshold_zero no longer pays the import-reload cost.

Bonus tests

  • test_iter_with_flood_retry_gives_up_after_max_retries covers the new cap behavior end-to-end.
  • test_iter_with_flood_retry_caps_sleep_seconds covers MAX_FLOOD_WAIT_SECONDS clamping.
  • test_iter_with_flood_retry_suppresses_short_wait_logs covers FLOOD_WAIT_LOG_THRESHOLD.

Ready for another look.

@GeiserX GeiserX merged commit 29976dd into GeiserX:main Apr 25, 2026
2 checks passed
@GeiserX
Copy link
Copy Markdown
Owner

GeiserX commented Apr 25, 2026

Thanks for the contribution, @tondeaf! Merged your PR and shipped a follow-up in #130 (v7.6.2) that hardens the retry logic further:

  • call_with_flood_retry() for non-iterator API calls (get_me, get_dialogs)
  • Negative e.seconds clamping
  • reverse=True validation to prevent silent data corruption
  • Defensive FLOOD_WAIT_LOG_THRESHOLD env parsing
  • 6 additional edge-case tests (16 total)

Released as v7.6.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants