Skip to content

Limit concurrent replays of TSDB on startup #1917

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 15, 2020

Conversation

thorfour
Copy link
Contributor

@thorfour thorfour commented Dec 16, 2019

What this PR does: Experimental TSDB change only. On startup walk the TSDB dir, and open a TSDB under each user dir. Limiting the number of concurrently opening TSDB.

Large memory spikes were observed when Ingesters would startup that had a large number of users with WALs.
image

This was causing oom kills on ingesters that would otherwise be able to run just fine. This change limits the number of TSDBs that can be opening at once, by opening all pre-existing user TSDBs at startup.

It does not open a TSDB for an empty user dir, which means it has no interaction with a transfer, since transfers delete empty dirs, and error out on non-empty user dirs.

Edit: While this change is still useful it doesn't address the memory usage spikes seen above, and because of a missed lock when looking over the code WAL's weren't being replayed in parallel. WAL replay still eats a large amount of memory regardless of how many are happening in parallel.

Which issue(s) this PR fixes:
N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@thorfour thorfour force-pushed the feat/tsdb/limit-wal-replay branch 4 times, most recently from b8e6ef2 to fa17e68 Compare December 16, 2019 21:27
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Thor! I left few comments here and there (just minor things, should be quick to address).

@thorfour thorfour force-pushed the feat/tsdb/limit-wal-replay branch 3 times, most recently from 266678b to 88317d8 Compare December 17, 2019 17:54
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done @thorfour! The change looks clean and safe to me. I just left few nits and then we're good to go to me 🙏

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Thor for addressing my feedback! LGTM!

@thorfour thorfour changed the title Feat/tsdb/limit wal replay Replay TSDB on startup Jan 2, 2020
@thorfour thorfour force-pushed the feat/tsdb/limit-wal-replay branch 3 times, most recently from 9f1b1e5 to 9d79340 Compare January 7, 2020 18:22
@thorfour thorfour force-pushed the feat/tsdb/limit-wal-replay branch from 9d79340 to a220652 Compare January 14, 2020 14:52
@pracucci
Copy link
Contributor

@thorfour May you fix the failing test, please?

@thorfour thorfour force-pushed the feat/tsdb/limit-wal-replay branch from a220652 to 126a194 Compare January 14, 2020 15:02
@thorfour
Copy link
Contributor Author

@pracucci should be fixed.

@thorfour thorfour force-pushed the feat/tsdb/limit-wal-replay branch from 126a194 to 214ecbb Compare January 14, 2020 15:06
@pracucci
Copy link
Contributor

@thorfour Other tests are failing now. May you take a look, please?

@thorfour thorfour force-pushed the feat/tsdb/limit-wal-replay branch 2 times, most recently from a68dc76 to 5c2a5a0 Compare January 14, 2020 15:21
Copy link
Contributor

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nit

@thorfour thorfour force-pushed the feat/tsdb/limit-wal-replay branch from 5c2a5a0 to a7f7776 Compare January 15, 2020 14:55
opening

Signed-off-by: Thor <thansen@digitalocean.com>
@thorfour thorfour force-pushed the feat/tsdb/limit-wal-replay branch from a7f7776 to e810294 Compare January 15, 2020 14:56
@gouthamve gouthamve changed the title Replay TSDB on startup Limit concurrent replays of TSDB on startup Jan 15, 2020
@gouthamve gouthamve merged commit 326828b into cortexproject:master Jan 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants