Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ipc) avoid new workers attempting to replay evicted events #97

Merged
merged 1 commit into from
Nov 18, 2020

Conversation

thibaultcha
Copy link
Owner

@thibaultcha thibaultcha commented Sep 25, 2020

A follow-up to #88.

Described in #93 and reported a few months ago is the likely possibility that mlcache and ipc instances are created in the init phase. If a worker is to be started much later in the master process' lifetime, the newly forked ipc instance will have an idx attribute set to the shm_idx value at the time of init (likely 0). These workers will resume polling evicted events, and poll() will likely timeout indefinitely from there.

For the fix in #88 to work, the mlcache instance has to be instantiated during init_worker or later.

This patch proposes an approach which works for instances created in both init and init_worker: as soon as the ipc shm has started evicting items, we guarantee that future workers will resume polling at the current index, without having to call any method but poll().

Fix #93

A follow-up to #88.

Described in #93 and reported a few months ago is the likely possibility
that mlcache and ipc instances are created in the `init` phase. If a
worker is to be started much later in the master process' lifetime, the
newly forked ipc instance will have an `idx` attribute set to the
shm_idx value at the time of `init` (likely 0). These workers will
resume polling evicted events, and `poll()` will likely timeout
indefinitely from there.

For the fix in #88 to work, the mlcache instance has to be instantiated
during `init_worker` or later.

This patch proposes an approach which works for instances created in
both `init` and `init_worker`: as soon as the ipc shm has started
evicting items, we guarantee that future workers will resume polling at
the current index, without having to call any method but `poll()`.

Fix #93
-- take note that eviction has started
-- we repeat this flagging to avoid this key from ever being
-- evicted itself
local ok, err = self.dict:set(FORCIBLE_KEY, true)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better approach here would be if shm:incr() accepted flags, or maybe even an shm:set_flags() API. In the absence of a solution to rely on flags, we rely on an extraneous key, and on it never reaching the LRU tail.

Copy link

@juancampa juancampa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be enough to run this on the first call to poll?

if self.idx < 0 then
  self.idx = shm_idx
end

🤔

@thibaultcha
Copy link
Owner Author

If we do this, we'd introduce a regression and we would prevent the first workers from polling any early broadcasted event, see for example this test. Since what we're worried about are missing (evicted) events which would cause streaks of timeouts, I should still be safe to preserve this behavior that if the event at index 1 is still present, then we can just replay them all fast enough.

@thibaultcha
Copy link
Owner Author

Does it still sound good to you @juancampa? Something you'd care testing maybe? I doubt it for such a minor change but just checking-in in case.

@juancampa
Copy link

What I hesitate a bit on is that the behavior is significantly different between workers that are spawned before the first eviction and workers that are started after an eviction. A simple example would be:

a worker that is started, after 10000 events but before the eviction happened. Wouldn't that worker go through all 10000 events? Compare that to a worker that was started after 10000 events but after an eviction. This one wouldn't catch up at all.

For mlcache purposes, do we need to replay old events at all. If a worker is new, it won't have the keys being deleted by the broadcast in L1 anyway.

@thibaultcha
Copy link
Owner Author

thibaultcha commented Oct 1, 2020

Well, like I said above, new/fresh workers need to run past events before the first poll() in the current implementation, and I did not want to break this atm. Keeping track of which workers are new or which ones are re-spawning could be done but I think this is already a worthy improvement without too complicated of a patch. Better have a worker replay past events even if it may timeout a few times along the way as it catches up, rather than never catch up at all.

If there isn't any other way, maybe we can change this behavior in a major release so that all ipcs first poll() always skip to the current index, for mlcache purposes only.

@thibaultcha thibaultcha merged commit 1a5bd4a into master Nov 18, 2020
@thibaultcha thibaultcha deleted the feat/new-workers-resume-polling branch November 18, 2020 23:27
bungle added a commit to Kong/kong that referenced this pull request Dec 7, 2020
### Summary

#### [2.5.0]

> Released on: 2020/11/18

##### Added

- `get()` callback functions are now optional. Without a callback, `get()` now
  still performs on-cpu L1/L2 lookups (no yielding). This allows implementing
  new cache lookup patterns guaranteed to be on-cpu for a more constant,
  smoother latency tail end (e.g. values are refreshed in background timers with
  `set()`).
  Thanks Hamish Forbes and Corina Purcarea for proposing this feature and
  participating in its development!
  [#96](thibaultcha/lua-resty-mlcache#96)

##### Fixed

- Improve `update()` robustness to worker crashes. Now, the library behind
  `cache:update()` is much more robust to re-spawned workers when initialized in
  the `init_by_lua` phase.
  [#97](thibaultcha/lua-resty-mlcache#97)
- Document the `peek()` method `stale` argument which was not mentioned, as well
  as the possibility of negative TTL return values for expired items.
bungle added a commit to Kong/kong that referenced this pull request Dec 8, 2020
### Summary

#### [2.5.0]

> Released on: 2020/11/18

##### Added

- `get()` callback functions are now optional. Without a callback, `get()` now
  still performs on-cpu L1/L2 lookups (no yielding). This allows implementing
  new cache lookup patterns guaranteed to be on-cpu for a more constant,
  smoother latency tail end (e.g. values are refreshed in background timers with
  `set()`).
  Thanks Hamish Forbes and Corina Purcarea for proposing this feature and
  participating in its development!
  [#96](thibaultcha/lua-resty-mlcache#96)

##### Fixed

- Improve `update()` robustness to worker crashes. Now, the library behind
  `cache:update()` is much more robust to re-spawned workers when initialized in
  the `init_by_lua` phase.
  [#97](thibaultcha/lua-resty-mlcache#97)
- Document the `peek()` method `stale` argument which was not mentioned, as well
  as the possibility of negative TTL return values for expired items.
gszr pushed a commit to Kong/kong that referenced this pull request Dec 8, 2020
### Summary

#### [2.5.0]

> Released on: 2020/11/18

##### Added

- `get()` callback functions are now optional. Without a callback, `get()` now
  still performs on-cpu L1/L2 lookups (no yielding). This allows implementing
  new cache lookup patterns guaranteed to be on-cpu for a more constant,
  smoother latency tail end (e.g. values are refreshed in background timers with
  `set()`).
  Thanks Hamish Forbes and Corina Purcarea for proposing this feature and
  participating in its development!
  [#96](thibaultcha/lua-resty-mlcache#96)

##### Fixed

- Improve `update()` robustness to worker crashes. Now, the library behind
  `cache:update()` is much more robust to re-spawned workers when initialized in
  the `init_by_lua` phase.
  [#97](thibaultcha/lua-resty-mlcache#97)
- Document the `peek()` method `stale` argument which was not mentioned, as well
  as the possibility of negative TTL return values for expired items.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[enhancement] strengthen the IPC module
2 participants