-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(wip)raft: introduce term cache #141524
base: master
Are you sure you want to change the base?
(wip)raft: introduce term cache #141524
Conversation
This is a draft commit that adds a metric to record the latency for loading term if there is a cache miss. Also adds separate metrics for raftRntry cache accesses from loadTerm. Running roachtests against this build will help us with understanding the access patterns of raftEntry Cache, and understand how much performance improvement we may get when we introduce a term cache. This commit is only for reference purpose, shouldn't be merged. References: cockroachdb#136296 Epic: None Release note: None
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
bdf7b88
to
369e311
Compare
Prototype for the termCache. And append logic. References: cockroachdb#136296 Epic: None Release note: None
369e311
to
3eef8f5
Compare
Passed all unit tests in kv/kvserver! References: cockroachdb#136296 Epic: None Release note: None
05c0ccb
to
c1ed467
Compare
updates: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to also clear/initialize the cache somewhere in applySnapshot
stack - the log is cleared there.
|
||
// Term returns the entry term based on the given entry index | ||
// Returns error if not in the termCache | ||
func (tc *TermCache) Term(index uint64) (term uint64, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably sufficient to return a bool instead of error
. Don't need much of a granularity here - just found or not found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on whether we want the termCache to be a best effort attempt.
It does have potential to do more than it is doing rn.
In the case that the index we are asking for is for sure not in the raftLog according to the TermCache, we can return early, no more need to ask raftentry/cache and storage for term.
(think of the scenario: term cache has t1/11, t2/14, t4/20
if we ask for match term entryID: t3/15, it means there is no record of it being in the raftLog. so we should exit early. this is also why i wrote the extra match term function
)
I think asking for raftentry/cache and storage in this case will end up being a worst case access path, we won't find anything in the end. And it will take more time than we want.
I think it is certainly possible for this scenario to happen, when raft layer calls matchTerm() to validate an entry when we have inconsistent logs. Which is why we should implement the error here, for early exit.
Its risky though, if we don't get it right, its hard to reproduce the error, since usually logs are consistent. Its gonna require dedicated unit tests involving network partitions, etc.
pkg/raft/termCache.go
Outdated
for i := 0; i < len(tc.cache)-1; i++ { | ||
if index >= tc.cache[i].index && index < tc.cache[i+1].index { | ||
return tc.cache[i].term, nil | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can scan from the back, and halve the slice accesses:
for i := len(tc.cache)-2; i >= 0; i-- {
if index >= tc.cache[i].index {
return tc.cache[i].term
}
}
|
||
// Match returns whether the entryID is in the TermCache. | ||
// If it is in the termCache, then it is in the raftLog. | ||
func (tc *TermCache) Match(argEntryId entryID) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is unused. Remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be useful in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, if a function(ality) is not used immediately in the PR, or there is no clear vision on how it will be used in follow-up PRs, it should not be added.
For this Match
method, do you expect any use? Seems like it's already covered by raftLog.matchTerm
function, and having the Term()
method is enough to support it.
Integrate with metrics from another branch cockroachdb#141159 References: cockroachdb#136296 Epic: None Release note: None
c1ed467
to
40c99ef
Compare
// TODO(hakuuww): actually, we are clearing more than hi here, | ||
// | ||
// which causes TermCache misses in steady state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to fix this. Truncations are happening pretty much all the time, so we should not reduce coverage here.
2 options:
- if a truncation index is in the middle of
tc.cache[i]
, bump thetc.cache[i].index
to match the truncation index. - leave the
tc.cache[i]
intact. There is no harm in remembering terms of entries belowfirstIndex
.
(anthony)
btw just to be clear truncateFrom is not the problem. its clearTo that we are worrying about. (the wording we use is a bit confusing).
truncateFrom is removing suffix of the raftlog.
clearTo is removing prefix of the raftLog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from pavel:
say you have a log [firstIndex=10] t1 t1 t1 t2 t2 t2
which is then truncated to start at index 12, so the new log is [firstIndex=12] t1 t2 t2 t2
the term cache before the change was: {index=10,term=1}, {index=13,term=2}
I’m saying that, after the truncation, we shouldn’t shrink the cache to be {index=13,term=2}. Because it’s unable to tell us the term of entry 12 that remains in the log.
Options:
make it {index=12,term=1}, {index=13,term=2}
leave it as {index=10,term=1}, {index=13,term=2}
Both options are able to answer the term for entry 12. Option 2 is slightly better because it doesn’t require rewriting the cache. So, looking forward, it will not require updating itself in storage.
4:55
it’s ok if the term cache remembers “more” history than the log itself. We wouldn’t be reading terms of entries below firstIndex-1
Inside termCache, we can pretty much assume that the caller has already made the necessary bound checks. The caller won’t be trying to access terms outside the log’s first-1/last index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: why don't we just not call clearTo at all, if it is ok to leave truncated entries in the termCache?
TODO(hakuuww):
|
Question: why don't we just not do clearTo at all, if it is ok to leave truncated entries in the termCache? |
…d cold restart References: cockroachdb#136296 Epic: None Release note: None
Don't review. Prototype for the termCache.
More info here https://cockroachlabs.atlassian.net/wiki/x/MgB7AAE
References: #136296
Epic: None
Release note: None