Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: resurrect stale values on soft callback errors #52

Merged
merged 2 commits into from
Jun 29, 2018

Conversation

thibaultcha
Copy link
Owner

@thibaultcha thibaultcha commented May 12, 2018

Implement a new opts.resurrect_ttl instance and get() option. When
specified, the get() method behaves in a more resilient fashion.

Context

We currently consider callback-returned errors (nil, err) to be the
cause of I/O failures in the callback (i.e. failure to fetch/refresh
some value due to timeout, connection refused, etc...).

When the callback returns as such, we currently return nil, err from
get() as well. Instead, we could serve the stale data if is it still
available (and desired by the user)

Changes

No changes in behavior when opts.resurrect_ttl is not specified
(backwards-compatible).

If it is specified, and a callback lookup is triggered and returns such
a soft error (nil, err), we resurrect the stale value (if it is still
available) for resurrect_ttl seconds. When resurrect_ttl is reached,
subsequent get() calls will trigger the callback again, and if the
callback happens to fail again, the stale value (if it is still
available) will be resurrected for resurrect_ttl again. The cycle
continues for as long as the value isn't LRU evicted during its
stale-but-non-resurrected lifetime, or when the callback succeeds in
producing a fresh value.

To ensure backwards compatibility, callback soft errors produced when
the opts.resurrect_ttl option is used do not return an error
(data, err). Instead, the error is logged with the WARN level (similarly
to shm retries errors), and we return data, nil. Current users of this
module can therefore simply add the resurrect_ttl option without
modifying the way return values are currently handled.

When stale data is resurrected or subsequently returned by get(), the
hit_lvl return value will be 4.

Lock timeouts occurring when resurrect_ttl is used will also log the
error at the WARN level and return a hit_lvl of 4 (if the value is
still available). However, the stale value will not be resurrected. That
is because another worker is running the callback, and it might succeed,
or fail again (and will thus be responsible for resurrecting the value).

Caveats

  • Due to limitations in the lua-resty-lrucache module, we can only detect
    stale values when hitting the L2 cache (shm). Therefore, a stale value
    retrieved from the L1 cache will produce a hit_lvl of 1. This
    limitation is to be removed in the future if the following PR gets
    accepted upstream:

    feature: implemented a user flags attribute. openresty/lua-resty-lrucache#35

  • It is impossible for a caller of get() to distinguish a resurrected
    value from an already stale one. The same way, it is not possible to
    retrieve the callback error for consumption (as it is internally logged
    as a WARN by get()). Future work on the API of this module will make
    such requirements possible, but will likely be breaking changes.

@thibaultcha
Copy link
Owner Author

thibaultcha commented May 20, 2018

openresty/lua-resty-lrucache#35 will allow us to store a stale flag in the LRU level, and thus efficiently provide a consistent stale return value for resurrected items, both in the shm and LRU levels.

@thibaultcha thibaultcha force-pushed the feat/resurrect-stale branch from 739467e to dec424a Compare May 21, 2018 18:22
@thibaultcha

This comment has been minimized.

@thibaultcha thibaultcha force-pushed the feat/resurrect-stale branch 4 times, most recently from 65a129c to 01ba82a Compare June 29, 2018 00:33
@thibaultcha thibaultcha force-pushed the feat/resurrect-stale branch 3 times, most recently from 0936275 to a23e8cf Compare June 29, 2018 00:43
thibaultcha added a commit that referenced this pull request Jun 29, 2018
Implement a new `opts.resurrect_ttl` instance and `get()` option. When
specified, the `get()` method behaves in a more resilient fashion.

Context
-------

We currently consider callback-returned errors (`nil, err`) to be the
cause of I/O failures in the callback (i.e. failure to fetch/refresh
some value due to timeout, connection refused, etc...).

When the callback returns as such, we currently return `nil, err` from
`get()` as well. Instead, we could serve the stale data if is it still
available (and desired by the user)

Changes
-------

No changes in behavior when `opts.resurrect_ttl` is not specified
(backwards-compatible).

If it is specified, and a callback lookup is triggered and returns such
a soft error (`nil, err`), we resurrect the stale value (if it is still
available) for `resurrect_ttl` seconds. When `resurrect_ttl` is reached,
subsequent `get()` calls will trigger the callback again, and if the
callback happens to fail again, the stale value (if it is still
available) will be resurrected for `resurrect_ttl` again. The cycle
continues for as long as the value isn't LRU evicted during its
stale-but-non-resurrected lifetime, or when the callback succeeds in
producing a fresh value.

To ensure backwards compatibility, callback soft errors produced when
the `opts.resurrect_ttl` option is used **do not** return an error
(`data, err`). Instead, the error is logged with the WARN level (similarly
to shm retries errors), and we return `data, nil`. Current users of this
module can therefore simply add the `resurrect_ttl` option without
modifying the way return values are currently handled.

When stale data is resurrected or subsequently returned by `get()`, the
`hit_lvl` return value will be `4`.

Lock timeouts occurring when `resurrect_ttl` is used will also log the
error at the WARN level and return a `hit_lvl` of `4` (if the value is
still available). However, the stale value will not be resurrected. That
is because another worker is running the callback, and it might succeed,
or fail again (and will thus be responsible for resurrecting the value).

Caveats
-------

- Due to limitations in the lua-resty-lrucache module, we can only detect
stale values when hitting the L2 cache (shm). Therefore, a stale value
retrieved from the L1 cache will produce a `hit_lvl` of `1`. This
limitation is to be removed in the future if the following PR gets
accepted upstream:

    openresty/lua-resty-lrucache#35

- It is impossible for a caller of `get()` to distinguish a resurrected
value from an already stale one. The same way, it is not possible to
retrieve the callback error for consumption (as it is internally logged
as a WARN by `get()`). Future work on the API of this module will make
such requirements possible, but will likely be breaking changes.

See also
--------

Fix #50
From #52
Implement a new `opts.resurrect_ttl` instance and `get()` option. When
specified, the `get()` method behaves in a more resilient fashion.

Context
-------

We currently consider callback-returned errors (`nil, err`) to be the
cause of I/O failures in the callback (i.e. failure to fetch/refresh
some value due to timeout, connection refused, etc...).

When the callback returns as such, we currently return `nil, err` from
`get()` as well. Instead, we could serve the stale data if is it still
available (and desired by the user)

Changes
-------

No changes in behavior when `opts.resurrect_ttl` is not specified
(backwards-compatible).

If it is specified, and a callback lookup is triggered and returns such
a soft error (`nil, err`), we resurrect the stale value (if it is still
available) for `resurrect_ttl` seconds. When `resurrect_ttl` is reached,
subsequent `get()` calls will trigger the callback again, and if the
callback happens to fail again, the stale value (if it is still
available) will be resurrected for `resurrect_ttl` again. The cycle
continues for as long as the value isn't LRU evicted during its
stale-but-non-resurrected lifetime, or when the callback succeeds in
producing a fresh value.

To ensure backwards compatibility, callback soft errors produced when
the `opts.resurrect_ttl` option is used **do not** return an error
(`data, err`). Instead, the error is logged with the WARN level (similarly
to shm retries errors), and we return `data, nil`. Current users of this
module can therefore simply add the `resurrect_ttl` option without
modifying the way return values are currently handled.

When stale data is resurrected or subsequently returned by `get()`, the
`hit_lvl` return value will be `4`.

Lock timeouts occurring when `resurrect_ttl` is used will also log the
error at the WARN level and return a `hit_lvl` of `4` (if the value is
still available). However, the stale value will not be resurrected. That
is because another worker is running the callback, and it might succeed,
or fail again (and will thus be responsible for resurrecting the value).

Caveats
-------

- Due to limitations in the lua-resty-lrucache module, we can only detect
stale values when hitting the L2 cache (shm). Therefore, a stale value
retrieved from the L1 cache will produce a `hit_lvl` of `1`. This
limitation is to be removed in the future if the following PR gets
accepted upstream:

    openresty/lua-resty-lrucache#35

- It is impossible for a caller of `get()` to distinguish a resurrected
value from an already stale one. The same way, it is not possible to
retrieve the callback error for consumption (as it is internally logged
as a WARN by `get()`). Future work on the API of this module will make
such requirements possible, but will likely be breaking changes.

See also
--------

Fix #50
From #52
@thibaultcha thibaultcha force-pushed the feat/resurrect-stale branch from a23e8cf to 29c0902 Compare June 29, 2018 00:58
@thibaultcha thibaultcha merged commit 0c839b3 into master Jun 29, 2018
@thibaultcha thibaultcha deleted the feat/resurrect-stale branch June 29, 2018 23:27
thibaultcha added a commit to Kong/kong that referenced this pull request Jun 30, 2018
This option is based on
thibaultcha/lua-resty-mlcache#52

Documentation for its behaviour can be read in the description test
included in kong.conf.default.
thibaultcha added a commit to Kong/kong that referenced this pull request Jun 30, 2018
This option is based on
thibaultcha/lua-resty-mlcache#52

Documentation for its behaviour can be read in the description test
included in kong.conf.default.
thibaultcha added a commit to Kong/kong that referenced this pull request Jul 5, 2018
This option is based on
thibaultcha/lua-resty-mlcache#52

Documentation for its behaviour can be read in the description test
included in kong.conf.default.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant