-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle out of order events #5071
Conversation
@@ -115,28 +122,62 @@ func (a *AuthorizedEntryFetcherWithEventsBasedCache) updateRegistrationEntriesCa | |||
|
|||
seenMap := map[string]struct{}{} | |||
for _, event := range resp.Events { | |||
// If there is a gap in the event log the missed events for later processing | |||
if event.EventID != a.lastRegistrationEntryEventID+1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to check for zero
a.missedRegistrationEntryEvents[i] = struct{}{} | ||
} | ||
} | ||
|
||
// Skip fetching entries we've already fetched this call | ||
if _, seen := seenMap[event.EntryID]; seen { | ||
a.lastRegistrationEntryEventID = event.EventID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We get the list of events in ascending order
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
1c402e1
to
447802f
Compare
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
func buildAttestedNodesCache(ctx context.Context, ds datastore.DataStore, clk clock.Clock, cache *authorizedentries.Cache) (uint, map[uint]time.Time, error) { | ||
// Gather any events that may have been skipped during restart | ||
var lastEventID uint | ||
missedAttestedNodeEvents := make(map[uint]time.Time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question that I did for entries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to look back in the event history to see if there is any gaps because the event might commit after we get all entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcosDY Yes, it is required. When a long running write transaction that is modify SPIRE Entries doesn't commit because it is hung, it leaves gaps in the auto increment keys of the event table. By tracking which event ids in the event table are missing, we can then go back and explicitly poll the skipped event ids prior to scanning the table for new entry event ids. This is the best that can be done at the moment to ensure that we don't skip event ids when other transactions push the last seen event id beyond those that are stuck in hung database transactions.
require.NoError(t, err) | ||
require.NotNil(t, ef) | ||
|
||
assert.Contains(t, ef.missedRegistrationEntryEvents, uint(2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cna you use equal here? so we are sure there are no unexpected results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you suggest the alternative approach that you intend?
assert.Contains(t, ef.missedAttestedNodeEvents, uint(2)) | ||
assert.Contains(t, ef.missedAttestedNodeEvents, uint(3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cna you use equal here? so we are sure there are no unexpected results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Faisal checked for equality of the count of returned items on line 231. Note that the actual items can come in any order (due to SQL and not having an order statement here) so it is correct to check each item is contained in the collection instead of assuming some sort of index based equality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a map, so order does not matter when comparing maps, with that equal will guaranty that you have 2 entries and the expected entries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you suggest the alternative approach you are seeking here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert.ElementsMatch is useful when you don't need exact equality but want to make sure that the set of elements is equal, independent of ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yeah, what marcos said works. Map equality checks are already order independent.
@faisal-memon Can you push a signed copy? This is failing the DCO checks. |
@MarcosDY @faisal-memon @azdagron If you have time, let's see if we can get this reviewed early enough it doesn't upset the 1.10.0 release. |
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
100bcde
to
e3c3518
Compare
dco fixed |
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Co-authored-by: Marcos Yacob <marcosyacob@gmail.com> Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Co-authored-by: Marcos Yacob <marcosyacob@gmail.com> Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
@MarcosDY All comments addressed |
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
CacheReloadInterval string `hcl:"cache_reload_interval"` | ||
EventsBasedCache bool `hcl:"events_based_cache"` | ||
PruneEventsOlderThan string `hcl:"prune_events_older_than"` | ||
SQLTransactionTimeout string `hcl:"sql_transaction_timeout"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it really required?
do you have a use case where a timeout that is used for pruning can be useful?
log.Info("Building event-based in-memory entry cache") | ||
cache, lastRegistrationEntryEventID, lastAttestedNodeEventID, err := buildCache(ctx, ds, clk) | ||
cache, receivedFirstRegistrationEntryEvent, lastRegistrationEntryEventID, missedRegistrationEntryEvents, receivedFirstAttestedNodeEvent, lastAttestedNodeEventID, missedAttestedNodeEvents, err := buildCache(ctx, ds, clk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is too much for a return
it is better to return a struct here, or reduce the scope of this buildCache
defer a.mu.Unlock() | ||
|
||
for eventID, eventTime := range a.missedRegistrationEntryEvents { | ||
if time.Since(eventTime) > a.pruneEventsOlderThan { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use a.clk (to get Now() and then do a Sub)
func (a *AuthorizedEntryFetcherWithEventsBasedCache) updateRegistrationEntriesCache(ctx context.Context) error { | ||
// Pocess events skipped over previously |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Pocess events skipped over previously | |
// Process events skipped over previously |
assert.Contains(t, ef.missedAttestedNodeEvents, uint(2)) | ||
assert.Contains(t, ef.missedAttestedNodeEvents, uint(3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert.ElementsMatch is useful when you don't need exact equality but want to make sure that the set of elements is equal, independent of ordering.
defer a.mu.Unlock() | ||
|
||
for eventID, eventTime := range a.missedRegistrationEntryEvents { | ||
if time.Since(eventTime) > a.sqlTransactionTimeout { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if time.Since(eventTime) > a.sqlTransactionTimeout { | |
if a.clk.Now().Sub(eventTime) > a.sqlTransactionTimeout { |
defer a.mu.Unlock() | ||
|
||
for eventID, eventTime := range a.missedAttestedNodeEvents { | ||
if time.Since(eventTime) > a.sqlTransactionTimeout { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if time.Since(eventTime) > a.sqlTransactionTimeout { | |
if a.clk.Now().Sub(eventTime) > a.sqlTransactionTimeout { |
pkg/server/endpoints/endpoints.go
Outdated
@@ -51,6 +51,9 @@ const ( | |||
|
|||
// This is the default amoount of time events live before they are pruned | |||
defaultPruneEventsOlderThan = 12 * time.Hour | |||
|
|||
// This is the default SQL transaction timeout. This value matches MYSQL's default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, the postgres default is the larger of the two. For safety sake, should the default be the larger value?
@@ -115,32 +157,85 @@ func (a *AuthorizedEntryFetcherWithEventsBasedCache) updateRegistrationEntriesCa | |||
|
|||
seenMap := map[string]struct{}{} | |||
for _, event := range resp.Events { | |||
// If there is a gap in the event stream, log the missed events for later processing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long can this gap realistically get? Would it be better to track this via ranges instead of individually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no known limit on the gap, because the gap will contain all of the event ids that are stuck in a transaction. With the bulk transaction APIs, this might be more than just an item or two. That said, we also support single item updates, so stuck transactions on single item updates would only contain one event id.
Also, since the locking on the event id SQL sequence is on a per-row request, and not on a per-transaction request, the numbering of the ids is not guaranteed to be a single range, for example a request to update 4 items might easily contain ids (1001, 1003, 1004, 1006) if some other in-flight update managed to snag 1002 and 1005.
If we want to shift to ranges, it would be a blind optimization on behavior we haven't even started to address. I would argue that we should add metrics (as blocked issues #4836 #4837 and #4720 suggest) and then decide if the routine needs further optimization.
Keep in mind that all polling for event ids will only occur on a stuck transaction, which according to Uber, occurs ~0.02% of the time, and may occur more or less frequently depending on database performance and event update patterns. Additional optimization might not make sense for such a small frequency of occurrences, but fixing the logical errors of missing an event is a very high priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. I can buy the argument that using ranges might be premature at this point without some further insight.
Signed-off-by: Marcos Yacob <marcos.yacob@hpe.com>
Pull Request check list
Affected functionality
Events based cache
Description of change
Events can come out of order from sql. If we skip over an event, retry it later.
Which issue this PR fixes
fixes #5021