-
Notifications
You must be signed in to change notification settings - Fork 820
Limit concurrent replays of TSDB on startup #1917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit concurrent replays of TSDB on startup #1917
Conversation
b8e6ef2
to
fa17e68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Thor! I left few comments here and there (just minor things, should be quick to address).
266678b
to
88317d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done @thorfour! The change looks clean and safe to me. I just left few nits and then we're good to go to me 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Thor for addressing my feedback! LGTM!
9f1b1e5
to
9d79340
Compare
9d79340
to
a220652
Compare
@thorfour May you fix the failing test, please? |
a220652
to
126a194
Compare
@pracucci should be fixed. |
126a194
to
214ecbb
Compare
@thorfour Other tests are failing now. May you take a look, please? |
a68dc76
to
5c2a5a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with nit
5c2a5a0
to
a7f7776
Compare
opening Signed-off-by: Thor <thansen@digitalocean.com>
a7f7776
to
e810294
Compare
What this PR does: Experimental TSDB change only. On startup walk the TSDB dir, and open a TSDB under each user dir. Limiting the number of concurrently opening TSDB.
Large memory spikes were observed when Ingesters would startup that had a large number of users with WALs.

This was causing oom kills on ingesters that would otherwise be able to run just fine. This change limits the number of TSDBs that can be opening at once, by opening all pre-existing user TSDBs at startup.
It does not open a TSDB for an empty user dir, which means it has no interaction with a transfer, since transfers delete empty dirs, and error out on non-empty user dirs.
Edit: While this change is still useful it doesn't address the memory usage spikes seen above, and because of a missed lock when looking over the code WAL's weren't being replayed in parallel. WAL replay still eats a large amount of memory regardless of how many are happening in parallel.
Which issue(s) this PR fixes:
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]