Skip to content

Conversation

haiqi96
Copy link
Contributor

@haiqi96 haiqi96 commented Aug 8, 2025

Description

As the title describes, this PR adds the user doc for retention control.
Note that one feature described in this PR depends on #1208

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Locally built the docs and confirmed it looks as expected.

Summary by CodeRabbit

  • Documentation
    • Added a “Retention control” guide covering retention concepts, TTLs, sweep intervals, expiry rules, timezone handling, examples, race‑condition handling, observability, and recovery; linked it into the user‑guide and toctree.
    • Updated multi‑node docs to include a garbage‑collector controller and adjusted start order; replaced a Presto card with a Retention card and added a “Using Presto” card (grid now 4 items).
  • Chores
    • Enabled dollarmath support in the documentation build.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
docs/src/user-guide/guides-retention.md (4)

24-25: Use MyST/Sphinx cross-refs instead of raw fragment links to the figure

Switch [Figure 1](#figure-1) to a {ref} link for robustness and to satisfy MD051. You already have the label (figure-1)=, so this is a drop-in change.

-criteria shown in [Figure 1](#figure-1):
+criteria shown in {ref}`Figure 1 <figure-1>`:
-For archives, $data\_timestamp$ (in the expiry criteria equation from [Figure 1](#figure-1)) is the
+For archives, $data\_timestamp$ (in the expiry criteria equation from {ref}`Figure 1 <figure-1>`) is the
-For search results, $data\_timestamp$ (in the expiry criteria equation from [Figure 1](#figure-1))
+For search results, $data\_timestamp$ (in the expiry criteria equation from {ref}`Figure 1 <figure-1>`)

Also applies to: 85-86, 141-142


117-122: State units for signed_utc_offset and add a quick example after the formula

This prevents misconfiguration and aligns with prior reviewer guidance. Per the team’s agreed logic, the signed subtraction is correct: adjusted = retention − signed_offset.

 $$adjusted\_retention\_period = retention\_period - signed\_utc\_offset$$
+
+where `signed_utc_offset` is the time zone offset in minutes (e.g., UTC+8 = +480, UTC−4 = −240).
+For example, if `retention_period = 60` and logs are in UTC−4, then
+`adjusted_retention_period = 60 - (−240) = 300` minutes.

186-188: Trim trailing whitespace (MD009)

There’s trailing space at the end of Line 186. Remove it to keep diffs clean and pass markdownlint.

-* If any query job is running, CLP conservatively calculates a **safe expiry timestamp** based on 
+* If any query job is running, CLP conservatively calculates a **safe expiry timestamp** based on

20-20: Clarify that current_time is UTC

Make UTC explicit where you define current_time to reduce confusion later in the document.

-| $current\_time$     | The time at which the garbage collector is performing a check.                                                                                                             |
+| $current\_time$     | The time (UTC) at which the garbage collector is performing a check.                                                                                                       |
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e993809 and d2800cf.

📒 Files selected for processing (1)
  • docs/src/user-guide/guides-retention.md (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-15T21:48:46.674Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:57-66
Timestamp: 2025-08-15T21:48:46.674Z
Learning: When dealing with timezone offset adjustments for retention periods in CLP, the correct formula is `adjusted_retention_period = retention_period - UTC_offset` (not using absolute value). For negative UTC offsets, this increases the retention period; for positive UTC offsets, this decreases it, properly compensating for how local timestamps are interpreted by the UTC-based garbage collector.

Applied to files:

  • docs/src/user-guide/guides-retention.md
📚 Learning: 2025-01-13T21:18:54.629Z
Learnt from: haiqi96
PR: y-scope/clp#662
File: components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py:167-186
Timestamp: 2025-01-13T21:18:54.629Z
Learning: In the CLP system, streams can be deterministically recreated if needed for debugging purposes, so it's acceptable to delete them even if their S3 upload fails.

Applied to files:

  • docs/src/user-guide/guides-retention.md
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.

Applied to files:

  • docs/src/user-guide/guides-retention.md
🪛 LanguageTool
docs/src/user-guide/guides-retention.md

[typographical] ~110-~110: If you want to indicate numerical ranges or time ranges, consider using an en dash.
Context: ...The equation then becomes $08:00 &lt; 01:01 - 01:00$, which evaluates to false. Thus, ...

(DASH_RULE)

🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/guides-retention.md

24-24: Link fragments should be valid

(MD051, link-fragments)


85-85: Link fragments should be valid

(MD051, link-fragments)


141-141: Link fragments should be valid

(MD051, link-fragments)


186-186: Trailing spaces
Expected: 0 or 2; Actual: 1

(MD009, no-trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: lint-check (macos-15)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (2)
docs/src/user-guide/guides-retention.md (2)

117-121: Units and example for the signed UTC offset (reduces misconfiguration risk)

Spell out units and add a short numeric example below the formula.

 $$adjusted\_retention\_period = retention\_period - signed\_utc\_offset$$
+where `signed_utc_offset` is the time zone offset in minutes (e.g., UTC+8 = +480, UTC−4 = −240).
+For example, if `retention_period = 60` and the logs are in UTC−4, then
+`adjusted_retention_period = 60 - (−240) = 300` minutes.

3-4: Clarify scope: retention applies to both local filesystem and object storage

Make it explicit up front where deletions occur.

-CLP can automatically delete *archives* and/or *search results* once they're older than a configured
-retention period. This guide explains:
+CLP can automatically delete *archives* and/or *search results* from both the local filesystem and
+object storage once they're older than a configured retention period. This guide explains:
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9f85fd3 and 9654d87.

📒 Files selected for processing (1)
  • docs/src/user-guide/guides-retention.md (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-15T21:48:46.674Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:57-66
Timestamp: 2025-08-15T21:48:46.674Z
Learning: When dealing with timezone offset adjustments for retention periods in CLP, the correct formula is `adjusted_retention_period = retention_period - UTC_offset` (not using absolute value). For negative UTC offsets, this increases the retention period; for positive UTC offsets, this decreases it, properly compensating for how local timestamps are interpreted by the UTC-based garbage collector.

Applied to files:

  • docs/src/user-guide/guides-retention.md
📚 Learning: 2025-01-13T21:18:54.629Z
Learnt from: haiqi96
PR: y-scope/clp#662
File: components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py:167-186
Timestamp: 2025-01-13T21:18:54.629Z
Learning: In the CLP system, streams can be deterministically recreated if needed for debugging purposes, so it's acceptable to delete them even if their S3 upload fails.

Applied to files:

  • docs/src/user-guide/guides-retention.md
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.

Applied to files:

  • docs/src/user-guide/guides-retention.md
🪛 LanguageTool
docs/src/user-guide/guides-retention.md

[typographical] ~110-~110: If you want to indicate numerical ranges or time ranges, consider using an en dash.
Context: ...The equation then becomes $08:00 &lt; 01:01 - 01:00$, which evaluates to false. Thus, ...

(DASH_RULE)

🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/guides-retention.md

24-24: Link fragments should be valid

(MD051, link-fragments)


85-85: Link fragments should be valid

(MD051, link-fragments)


141-141: Link fragments should be valid

(MD051, link-fragments)

🔇 Additional comments (2)
docs/src/user-guide/guides-retention.md (2)

117-121: Retention offset formula is correct (signed subtraction is required)

Using a signed offset matches the intended behaviour: negative offsets increase TTL; positive offsets decrease it. Good catch.


122-137: Add default for search result retention period

Include a one-liner under the YAML example to document the out-of-the-box default:

File: docs/src/user-guide/guides-retention.md
Location: immediately after the closing ``` of the results_cache YAML block

Proposed diff:

```yaml
  retention_period: 1440
  • By default, results_cache.retention_period is 60 minutes.

</details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Copy link
Collaborator

@quinntaylormitchell quinntaylormitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly lgtm.

The only big comment I have is about the data expiry criteria in Figure 1. I really think that it should be arranged as current_time - data_timestamp > retention_period as it was before (though I do like the new variable names and the formatting). My rationale is that the current_time and data_timestamp variables are both times of day, while retention_period is a duration. I think that it is much more intuitive to subtract two times of day from one another to get a duration, and then compare those two durations with the > operator. If we leave it as it currently is, then we're subtracting a duration from a time of day, and then comparing two times of day with the > operator, which does not feel intuitive to me (to the point where I had to write it out and check whether it was correct).

That being said, I won't push back any farther than this comment; @kirkrodrigues I leave it to you to decide.

Also left three specific comments.

(figure-1)=
:::{card}

$$data\_timestamp < current\_time - retention\_period$$
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See high-level comment.

Co-authored-by: Quinn Taylor Mitchell <q.mitchell@mail.utoronto.ca>
Co-authored-by: haiqi96 <14502009+haiqi96@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (7)
docs/src/user-guide/guides-retention.md (7)

94-96: Add an explicit MyST label for the non-UTC section

Enables {ref} cross-references to resolve.

+ (handling-log-events-with-non-utc-timestamps)=
 #### Handling log events with non-UTC timestamps

76-81: Prefer {ref} over hash fragments for the non-UTC section link

Improves stability and consistency with other cross-refs.

-If your log events use timestamps that *aren't* in the UTC time zone, you will need to adjust the
-configured retention period to ensure expired archives are deleted at the correct time. See
-[Handling log events with non-UTC timestamps](#handling-log-events-with-non-utc-timestamps) for
-details.
+If your log events use timestamps that *aren't* in the UTC time zone, you will need to adjust the
+configured retention period to ensure expired archives are deleted at the correct time. See
+{ref}`Handling log events with non-UTC timestamps <handling-log-events-with-non-utc-timestamps>` for
+details.

141-142: Cross-reference Figure 1 with {ref} (search results section)

Aligns with earlier Figure link changes.

-For search results, $data\_timestamp$ (in the expiry criteria equation from [Figure 1](#figure-1))
+For search results, $data\_timestamp$ (in the expiry criteria equation from {ref}`Figure 1 <figure-1>`)

85-86: Cross-reference Figure 1 with {ref}

Keeps link style consistent and resolves the markdownlint MD051 hint.

-For archives, $data\_timestamp$ (in the expiry criteria equation from [Figure 1](#figure-1)) is the
+For archives, $data\_timestamp$ (in the expiry criteria equation from {ref}`Figure 1 <figure-1>`) is the

23-25: Use MyST cross-ref for Figure 1 (fixes MD051 and is more robust)

Switch the fragment link to a {ref}.

-When the garbage collector wakes up, it will scan for and delete any data that fits the expiry
-criteria shown in [Figure 1](#figure-1):
+When the garbage collector wakes up, it will scan for and delete any data that fits the expiry
+criteria shown in {ref}`Figure 1 <figure-1>`:

117-121: Clarify units for the UTC offset and add a quick numeric example

Keeps the correct signed-offset formula and makes it harder to misconfigure.

-To avoid this issue, you can adjust the retention period to account for the offset of the log
-events' time zone from UTC:
+To avoid this issue, you can adjust the retention period to account for the offset of the log
+events' time zone from UTC (use minutes for both values):
 
 $$adjusted\_retention\_period = retention\_period - signed\_utc\_offset$$
+
+where `signed_utc_offset` is the time zone offset in minutes (e.g., UTC+8 = +480, UTC−4 = −240).
+For example, if `retention_period = 60` and the logs are in UTC−4, then
+`adjusted_retention_period = 60 - (−240) = 300` minutes.

16-22: Add a brief global UTC note after the definitions

Gives readers an early heads-up and a stable link to the detailed guidance.

 | $data\_timestamp$   | The end of the time range for the data being evaluated for expiration (e.g., for an archive, this is the timestamp of the most recent log event contained in the archive). |
 
+:::{note}
+CLP assumes timestamps are in UTC. If your logs use local (non-UTC) timestamps, see
+{ref}`Handling log events with non-UTC timestamps <handling-log-events-with-non-utc-timestamps>` to
+adjust your retention period accordingly.
+:::
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9654d87 and edc2b35.

📒 Files selected for processing (1)
  • docs/src/user-guide/guides-retention.md (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-15T21:48:46.674Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:57-66
Timestamp: 2025-08-15T21:48:46.674Z
Learning: When dealing with timezone offset adjustments for retention periods in CLP, the correct formula is `adjusted_retention_period = retention_period - UTC_offset` (not using absolute value). For negative UTC offsets, this increases the retention period; for positive UTC offsets, this decreases it, properly compensating for how local timestamps are interpreted by the UTC-based garbage collector.

Applied to files:

  • docs/src/user-guide/guides-retention.md
📚 Learning: 2025-01-13T21:18:54.629Z
Learnt from: haiqi96
PR: y-scope/clp#662
File: components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py:167-186
Timestamp: 2025-01-13T21:18:54.629Z
Learning: In the CLP system, streams can be deterministically recreated if needed for debugging purposes, so it's acceptable to delete them even if their S3 upload fails.

Applied to files:

  • docs/src/user-guide/guides-retention.md
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.

Applied to files:

  • docs/src/user-guide/guides-retention.md
🪛 LanguageTool
docs/src/user-guide/guides-retention.md

[typographical] ~110-~110: If you want to indicate numerical ranges or time ranges, consider using an en dash.
Context: ...The equation then becomes $08:00 &lt; 01:01 - 01:00$, which evaluates to false. Thus, ...

(DASH_RULE)

🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/guides-retention.md

24-24: Link fragments should be valid

(MD051, link-fragments)


85-85: Link fragments should be valid

(MD051, link-fragments)


141-141: Link fragments should be valid

(MD051, link-fragments)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (1)
docs/src/user-guide/guides-retention.md (1)

130-137: Document the default for results_cache.retention_period (symmetry with archives)

You state the default for archive_output.retention_period (null/disabled) but not for results_cache.retention_period. Please confirm the default and add it here for consistency.

Would you confirm the default value for results_cache.retention_period in the current release? If it's not null, add a short note after the YAML block (e.g., “By default, results are retained for X minutes.”). I can follow up with a patch once you confirm the value.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
docs/src/user-guide/guides-retention.md (2)

76-81: Add a brief global UTC note earlier for quicker orientation.

You have a solid warning here; consider also adding a short note right after the definitions so readers see the UTC assumption before the first figure/example.

Proposed insertion just after the definitions table (after Line 22):

+:::{note}
+CLP assumes timestamps are in UTC. If your logs use local (non‑UTC) timestamps, see
+{ref}`Handling log events with non-UTC timestamps <handling-log-events-with-non-utc-timestamps>` to
+adjust the configured retention period.
+:::

120-124: Clarify units and add a one-line numeric example to the signed-offset formula.

The formula is correct; specifying units and a quick example reduces misconfiguration.

-To avoid this issue, you can adjust the retention period to account for the offset of the log
-events' time zone from UTC:
+To avoid this issue, you can adjust the retention period to account for the offset of the log
+events' time zone from UTC (use minutes for both values):
 
 $$adjusted\_retention\_period = retention\_period - signed\_utc\_offset$$
+
+For example, if `retention_period = 60` and logs are in UTC−4 (`signed_utc_offset = −240`),
+then `adjusted_retention_period = 60 - (−240) = 300` minutes.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between edc2b35 and 15df283.

📒 Files selected for processing (1)
  • docs/src/user-guide/guides-retention.md (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-15T21:48:46.674Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:57-66
Timestamp: 2025-08-15T21:48:46.674Z
Learning: When dealing with timezone offset adjustments for retention periods in CLP, the correct formula is `adjusted_retention_period = retention_period - UTC_offset` (not using absolute value). For negative UTC offsets, this increases the retention period; for positive UTC offsets, this decreases it, properly compensating for how local timestamps are interpreted by the UTC-based garbage collector.

Applied to files:

  • docs/src/user-guide/guides-retention.md
📚 Learning: 2025-01-13T21:18:54.629Z
Learnt from: haiqi96
PR: y-scope/clp#662
File: components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py:167-186
Timestamp: 2025-01-13T21:18:54.629Z
Learning: In the CLP system, streams can be deterministically recreated if needed for debugging purposes, so it's acceptable to delete them even if their S3 upload fails.

Applied to files:

  • docs/src/user-guide/guides-retention.md
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.

Applied to files:

  • docs/src/user-guide/guides-retention.md
🪛 LanguageTool
docs/src/user-guide/guides-retention.md

[typographical] ~112-~112: If you want to indicate numerical ranges or time ranges, consider using an en dash.
Context: ... then becomes... $$is_expired = (01:01 - 08:00 &gt; 01:00)$$ ... which evaluates to...

(DASH_RULE)

🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/guides-retention.md

24-24: Link fragments should be valid

(MD051, link-fragments)


85-85: Link fragments should be valid

(MD051, link-fragments)


157-157: Link fragments should be valid

(MD051, link-fragments)

🔇 Additional comments (2)
docs/src/user-guide/guides-retention.md (2)

206-207: Verify runtime behaviour: scheduler avoidance of expired archives.

“CLP will not search an archive once it is considered expired” is a strong behavioural guarantee. Please confirm the scheduler/engine enforces this (e.g., by filtering on a safe expiry timestamp in queries) so users don’t rely on an assumption that might change.

If confirmed, consider linking to the relevant module or adding a brief note about how the safe cut-off is computed and applied.


142-143: The verification script is running—I'll review the default value definitions for the config objects and confirm the actual defaults once it completes.

Comment on lines +108 to +116
When the garbage collector runs, it will evaluate the archive's expiry criteria, substituting
$08:00$ for $data\_timestamp$, and $01:01$ for $current\_time$, since $09:01$ AWST = $01:01$ UTC.
The equation then becomes...

$$is\_expired = (01:01 - 08:00 > 01:00)$$

... which evaluates to false. Thus, the garbage collector won't delete the archive; in fact, it
won't delete it until $09:01$ UTC, which is 8 hours later than it should've been deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Optional: add a short parenthetical noting omitted dates/seconds.

The time-of-day arithmetic is correct, but readers may wonder about dates. A brief parenthetical keeps the example unambiguous.

-... which evaluates to false. Thus, the garbage collector won't delete the archive; in fact, it
-won't delete it until $09:01$ UTC, which is 8 hours later than it should've been deleted.
+... which evaluates to false. Thus, the garbage collector won't delete the archive; in fact, it
+won't delete it until $09:01$ UTC (dates/seconds omitted for brevity), which is 8 hours later than
+it should've been deleted.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
When the garbage collector runs, it will evaluate the archive's expiry criteria, substituting
$08:00$ for $data\_timestamp$, and $01:01$ for $current\_time$, since $09:01$ AWST = $01:01$ UTC.
The equation then becomes...
$$is\_expired = (01:01 - 08:00 > 01:00)$$
... which evaluates to false. Thus, the garbage collector won't delete the archive; in fact, it
won't delete it until $09:01$ UTC, which is 8 hours later than it should've been deleted.
... which evaluates to false. Thus, the garbage collector won't delete the archive; in fact, it
won't delete it until $09:01$ UTC (dates/seconds omitted for brevity), which is 8 hours later than
it should've been deleted.
🧰 Tools
🪛 LanguageTool

[typographical] ~112-~112: If you want to indicate numerical ranges or time ranges, consider using an en dash.
Context: ... then becomes... $$is_expired = (01:01 - 08:00 &gt; 01:00)$$ ... which evaluates to...

(DASH_RULE)

🤖 Prompt for AI Agents
docs/src/user-guide/guides-retention.md around lines 108 to 116: the example's
time-of-day arithmetic omits dates and seconds which can confuse readers; add a
short parenthetical after the sentence that introduces the substituted times
clarifying that dates and seconds are omitted for brevity (e.g., “(dates and
seconds omitted for clarity)”), so the example remains unambiguous without
changing the math or times used.

Copy link
Collaborator

@quinntaylormitchell quinntaylormitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retained for 60 minutes (1 hour).

:::{note}
When a user runs consecutive queries in the webui without refreshing the page, the results of one
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about? or maybe something like "current query"

Suggested change
When a user runs consecutive queries in the webui without refreshing the page, the results of one
When a user runs consecutive queries in the webui without refreshing the page, the results of the previous

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
docs/src/user-guide/guides-retention.md (4)

103-107: Prefer numeric UTC offsets over “AWST” to avoid DST/locale ambiguity (optional).

This keeps examples globally unambiguous; if you want to keep AWST for regional familiarity, no blocker.

Apply:

-* your log events use timestamps in the AWST timezone (UTC+8);
+* your log events use timestamps in a time zone with UTC+8;

76-81: Optional: switch the intra-doc link to a MyST {ref} and add a section label.

Keeps cross-references consistent with the Figure link.

Apply:

-See
-[Handling log events with non-UTC timestamps](#handling-log-events-with-non-utc-timestamps) for
-details.
+See {ref}`Handling log events with non-UTC timestamps <handling-log-events-with-non-utc-timestamps>` for details.

And add a label immediately above the subsection header (Line 94):

+(handling-log-events-with-non-utc-timestamps)=
 #### Handling log events with non-UTC timestamps

Note: If you intentionally avoid explicit labels, feel free to skip—this is purely for cross-ref consistency.


22-23: Add an early, global UTC note after the definitions (navigation aid).

Provides an immediate heads-up and a stable pointer to the detailed section.

Apply:

 | $data\_timestamp$   | The end of the time range for the data being evaluated for expiration (e.g., for an archive, this is the timestamp of the most recent log event contained in the archive). |
 
+:::{note}
+CLP assumes timestamps are in UTC. If your logs use local (non-UTC) timestamps, see
+[Handling log events with non-UTC timestamps](#handling-log-events-with-non-utc-timestamps) to adjust your retention period accordingly.
+:::
+
 When the garbage collector wakes up, it will scan for and delete any data that satisfies the expiry
 criteria shown in {ref}`Figure 1 <figure-1>`:

If you prefer to avoid hash fragments, I can convert that link to a {ref} by adding a label above the subsection.


120-124: Clarify units and add a concrete example for the signed UTC offset formula.

Prevents misconfiguration and makes the guidance immediately actionable. Retain the signed subtraction; do not use absolute value.

Apply:

 To avoid this issue, you can adjust the retention period to account for the offset of the log
-events' time zone from UTC:
+events' time zone from UTC (use minutes for both values):
 
 $$adjusted\_retention\_period = retention\_period - signed\_utc\_offset$$
+
+where `signed_utc_offset` is the time-zone offset in minutes (e.g., UTC+8 = +480, UTC−4 = −240).
+Example: for `retention_period = 60` and logs in UTC−4, `adjusted_retention_period = 60 - (−240) = 300` minutes.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 15df283 and fd6f121.

📒 Files selected for processing (1)
  • docs/src/user-guide/guides-retention.md (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-15T21:48:46.674Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:57-66
Timestamp: 2025-08-15T21:48:46.674Z
Learning: When dealing with timezone offset adjustments for retention periods in CLP, the correct formula is `adjusted_retention_period = retention_period - UTC_offset` (not using absolute value). For negative UTC offsets, this increases the retention period; for positive UTC offsets, this decreases it, properly compensating for how local timestamps are interpreted by the UTC-based garbage collector.

Applied to files:

  • docs/src/user-guide/guides-retention.md
📚 Learning: 2025-01-13T21:18:54.629Z
Learnt from: haiqi96
PR: y-scope/clp#662
File: components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py:167-186
Timestamp: 2025-01-13T21:18:54.629Z
Learning: In the CLP system, streams can be deterministically recreated if needed for debugging purposes, so it's acceptable to delete them even if their S3 upload fails.

Applied to files:

  • docs/src/user-guide/guides-retention.md
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.

Applied to files:

  • docs/src/user-guide/guides-retention.md
🪛 LanguageTool
docs/src/user-guide/guides-retention.md

[typographical] ~112-~112: If you want to indicate numerical ranges or time ranges, consider using an en dash.
Context: ... then becomes... $$is_expired = (01:01 - 08:00 &gt; 01:00)$$ ... which evaluates to...

(DASH_RULE)

🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/guides-retention.md

24-24: Link fragments should be valid

(MD051, link-fragments)


85-85: Link fragments should be valid

(MD051, link-fragments)


157-157: Link fragments should be valid

(MD051, link-fragments)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (1)
docs/src/user-guide/guides-retention.md (1)

88-92: The script will search across the repository for how "expiry", "Retention", "garbage", and "timestamp" are handled in the retention/garbage-collector code to confirm whether archives lacking message timestamps are excluded from retention. Once the results come back, I’ll verify the edge-case behavior and then update the note accordingly.

Comment on lines 75 to 76
5. (Optional) Configure retention periods for archives and search results. See
[retention control](guides-retention) for details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit out of scope of this doc, right?

@@ -63,6 +63,7 @@ guides-overview
guides-using-object-storage/index
guides-multi-node
guides-using-presto
guides-retention
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this above guides-multi-node? Same for the card in the overview.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fd6f121 and 0d8a48e.

📒 Files selected for processing (3)
  • docs/src/user-guide/guides-multi-node.md (2 hunks)
  • docs/src/user-guide/guides-overview.md (2 hunks)
  • docs/src/user-guide/index.md (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.

Applied to files:

  • docs/src/user-guide/guides-overview.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (3)
docs/src/user-guide/guides-multi-node.md (1)

93-99: Confirming garbage_collector Group 2 placement and UTC-based expiry

  • The retention guide explicitly states that the garbage collector runs in the UTC time zone (see the warning and “Handling log events with non-UTC timestamps” section noting GC’s UTC-based scheduling).
  • The garbage_collector component is correctly placed in Group 2: it only depends on Group 1 services (database, queue, Redis, results_cache) and does not rely on any Group 3 workers to compute expirations or enqueue deletion tasks.

No changes needed here.

docs/src/user-guide/index.md (1)

62-67: Good placement of the new guide in the toctree

Adding guides-retention before guides-multi-node aligns with prior feedback about ordering. No issues from a Sphinx/MyST perspective with :glob: plus explicit entries here.

docs/src/user-guide/guides-overview.md (1)

16-20: Title casing and positioning look consistent

“Retention control” uses sentence case and appears before “Multi-node deployment,” matching the requested ordering. Looks good.

Comment on lines +34 to 35
| garbage_collector | Background process for retention control |
:::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Add cross-link for discoverability to the new component row

Point readers from the components table directly to the new guide.

-| garbage_collector     | Background process for retention control                        |
+| garbage_collector     | Background process for retention control; see [Retention control](guides-retention) |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| garbage_collector | Background process for retention control |
:::
| garbage_collector | Background process for retention control; see [Retention control](guides-retention) |
:::
🤖 Prompt for AI Agents
In docs/src/user-guide/guides-multi-node.md around lines 34–35, the components
table row for "garbage_collector" should include a cross-link to the new guide
for discoverability; update the table cell to make the component name or its
description a Markdown link pointing to the new guide (for example the relative
path docs/src/user-guide/guides-garbage-collector.md or the correct guide
filename), ensuring the link text remains clear (e.g., "garbage_collector —
Background process for retention control") and the link URL points to the new
guide.

Comment on lines +29 to 35
:::{grid-item-card}
:link: guides-using-presto
Using Presto with CLP
^^^
How to use Presto to query compressed logs in CLP.
:::
::::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Optional copy tweak for parallelism across cards

Minor style suggestion to make descriptions crisper and consistent with the “How to …” phrasing used elsewhere.

-How to use Presto to query compressed logs in CLP.
+How to use Presto to query compressed logs stored by CLP.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
:::{grid-item-card}
:link: guides-using-presto
Using Presto with CLP
^^^
How to use Presto to query compressed logs in CLP.
:::
::::
:::{grid-item-card}
:link: guides-using-presto
Using Presto with CLP
How to use Presto to query compressed logs stored by CLP.
:::
::::
🤖 Prompt for AI Agents
In docs/src/user-guide/guides-overview.md around lines 29 to 35, update the card
copy to use the "How to ..." phrasing for parallelism: change the card title
from "Using Presto with CLP" to "How to use Presto with CLP" (or otherwise
ensure the header matches the descriptive sentence), and confirm the description
reads "How to use Presto to query compressed logs in CLP." so the header and
body follow the same "How to ..." style used across other cards.

@kirkrodrigues kirkrodrigues changed the title docs(package): Add user documents for retention control; Add garbage collector to the components table in the readme. docs(package): Add user documents for retention control; Add garbage collector to the multi-node guide. Aug 20, 2025
Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
docs/src/user-guide/guides-overview.md (1)

29-34: Optional: align title with “How to …” phrasing for parallelism

For consistency with the description and other cards that use “How to …”, consider updating the title.

-Using Presto with CLP
+How to use Presto with CLP
@@
-How to use Presto to query compressed logs in CLP.
+How to use Presto to query compressed logs stored by CLP.
docs/src/user-guide/guides-retention.md (2)

16-22: Add a short, upfront UTC note (addresses the UTC reminder and improves discoverability)

You explain non‑UTC handling later, but an early heads‑up reduces surprises and gives readers a quick pointer.

 | $data\_timestamp$   | The end of the time range for the data being evaluated for expiration (e.g., for an archive, this is the timestamp of the most recent log event contained in the archive). |
 
+:::{note}
+CLP assumes timestamps are in UTC. If your logs use local (non‑UTC) timestamps, see “Handling log events with non‑UTC timestamps” below to adjust your retention period accordingly.
+:::

120-124: Clarify units for signed_utc_offset and add a concrete example

Prevents misconfiguration and makes the formula directly actionable.

-To avoid this issue, you can adjust the retention period to account for the offset of the log
-events' time zone from UTC:
+To avoid this issue, you can adjust the retention period to account for the offset of the log
+events' time zone from UTC (use minutes for both values):
 
 $$adjusted\_retention\_period = retention\_period - signed\_utc\_offset$$
+
+where `signed_utc_offset` is the time zone offset in minutes (e.g., UTC+8 = +480, UTC−4 = −240).
+For example, if `retention_period = 60` and the logs are in UTC−4, then
+`adjusted_retention_period = 60 - (−240) = 300` minutes.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0d8a48e and 15ba738.

📒 Files selected for processing (2)
  • docs/src/user-guide/guides-overview.md (2 hunks)
  • docs/src/user-guide/guides-retention.md (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.

Applied to files:

  • docs/src/user-guide/guides-overview.md
  • docs/src/user-guide/guides-retention.md
📚 Learning: 2025-08-15T21:48:46.674Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:57-66
Timestamp: 2025-08-15T21:48:46.674Z
Learning: When dealing with timezone offset adjustments for retention periods in CLP, the correct formula is `adjusted_retention_period = retention_period - UTC_offset` (not using absolute value). For negative UTC offsets, this increases the retention period; for positive UTC offsets, this decreases it, properly compensating for how local timestamps are interpreted by the UTC-based garbage collector.

Applied to files:

  • docs/src/user-guide/guides-overview.md
  • docs/src/user-guide/guides-retention.md
📚 Learning: 2025-01-13T21:18:54.629Z
Learnt from: haiqi96
PR: y-scope/clp#662
File: components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py:167-186
Timestamp: 2025-01-13T21:18:54.629Z
Learning: In the CLP system, streams can be deterministically recreated if needed for debugging purposes, so it's acceptable to delete them even if their S3 upload fails.

Applied to files:

  • docs/src/user-guide/guides-retention.md
🪛 LanguageTool
docs/src/user-guide/guides-retention.md

[typographical] ~112-~112: If you want to indicate numerical ranges or time ranges, consider using an en dash.
Context: ... then becomes... $$is_expired = (01:01 - 08:00 &gt; 01:00)$$ ... which evaluates to...

(DASH_RULE)

🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/guides-retention.md

24-24: Link fragments should be valid

(MD051, link-fragments)


85-85: Link fragments should be valid

(MD051, link-fragments)


157-157: Link fragments should be valid

(MD051, link-fragments)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (3)
docs/src/user-guide/guides-overview.md (2)

16-20: Retention card reads clearly and matches the new guide

Link target, title, and description are concise and consistent with neighbouring cards.


29-31: Link slug verified: Presto guide exists

I’ve confirmed that the bare slug guides-using-presto corresponds to an existing file in docs/src/user-guide (and is listed in the toctree at index.md), so the grid‐item‐card link won’t break.

• docs/src/user-guide/index.md includes guides-using-presto in its toctree
• docs/src/user-guide/guides-using-presto.md exists and renders as expected

No changes are strictly required here. If you’d like to standardize on directory‐style links (e.g. guides-using-presto/index), you can update the link—but it’s purely optional.

docs/src/user-guide/guides-retention.md (1)

73-75: Retention defaults verified

The documented defaults for archive_output.retention_period and results_cache.retention_period match the code’s shipped defaults. No changes are required.

• In components/clp-py-utils/clp_py_utils/clp_config.py (around line 417), ResultsCache.retention_period: Optional[int] = 60 confirms the 60-minute default.
• In components/clp-py-utils/clp_py_utils/clp_config.py (around line 658), ArchiveOutput.retention_period: Optional[int] = None confirms the null default.

Comment on lines +23 to +25
When the garbage collector wakes up, it will scan for and delete any data that satisfies the expiry
criteria shown in [Figure 1](#figure-1):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Use MyST {ref} cross‑refs for Figure 1 (fixes MD051 and is more robust)

Replace hash fragments with {ref} to the existing “(figure-1)=” target.

-criteria shown in [Figure 1](#figure-1):
+criteria shown in {ref}`Figure 1 <figure-1>`:
@@
-For archives, $data\_timestamp$ (in the expiry criteria equation from [Figure 1](#figure-1)) is the
+For archives, $data\_timestamp$ (in the expiry criteria equation from {ref}`Figure 1 <figure-1>`) is the
@@
-For search results, $data\_timestamp$ (in the expiry criteria equation from [Figure 1](#figure-1))
+For search results, $data\_timestamp$ (in the expiry criteria equation from {ref}`Figure 1 <figure-1>`)

Also applies to: 85-86, 157-158

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

24-24: Link fragments should be valid

(MD051, link-fragments)

🤖 Prompt for AI Agents
In docs/src/user-guide/guides-retention.md around lines 23-25 (and similarly at
85-86 and 157-158), replace inline hash-fragment links like "(#figure-1)" with
MyST {ref} cross-refs pointing to the existing target label (figure-1) — e.g.
use {ref}`figure-1` style — so update each occurrence to the MyST {ref} syntax,
preserving surrounding text and punctuation.

Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edited the PR title directly.

@haiqi96 haiqi96 merged commit 248c84a into y-scope:main Aug 20, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants