Introduce HASH items expiration #2089

ranshid · 2025-05-15T15:29:31Z

Closes #640

Summary

This PR introduces support for field-level expiration in Valkey hash types, making it possible for individual fields inside a hash to expire independently — creating what we call volatile fields.
This is just the first out of 3 PRs. The content of this PR focus on enabling the basic ability to set and modify hash fields expiration as well as persistency (AOF+RDB) and defrag.
The second PR introduces the new algorithm (volatile-set) to track volatile hash fields is in the last stages of review. The current implementation in this PR (in volatile-set.h/c) is just s tub implementation and will be replaced by The second PR
The third PR which introduces the active expiration and defragmentation jobs.

For more highlevel design details you can track the RFC PR: valkey-io/valkey-rfc#22.

Major decisions

Some highlevel major decisions which are taken as part of this work:

We decided to copy the existing Redis API in order to maintain compatibility with existing clients.
We decided to avoid introducing lazy-expiration at this point, in order to reduce complexity and rely only on active-expiration for memory reclamation. This will require us to continue to work on improving the active expiration job and potentially consider introduce lazy-expiration support later on.
Although different commands which are adding expiration on hash fields are influencing the memory utilization (by allocating more memory for expiration time and metadata) we decided to avoid adding the DENYOOM for these commands (an exception is HSETEX) in order to be better aligned with highlevel keys commands like expire
Some hash type commands will produce unexpected results:

HLEN - will still reflect the number of fields which exists in the hash object (either actually expired or not).
HRANDFIELD - in some cases we will not be able to randomly select a field which was not already expired. this case happen in 2 cases: 1/ when we are asked to provide a non-uniq fields (i.e negative count) 2/ when the size of the hash is much bigger than the count and we need to provide uniq results. In both cases it is possible that an empty response will be returned to the caller, even in case there are fields in the hash which are either persistent or not expired.

For the case were a field is provided with a zero (0) expiration time or expiration time in the past, it is immediately deleted. We decided that, in order to be aligned with how high level keys are handled, we will emit hexpired keyspace event for that case (instead of hdel). For example:
for the case:
We will ALWAYS load hash fields during rdb load. This means that when primary is rebooting with an old snapshot, it will take time to reclaim all the expired fields. However this simplifies the current logic and avoid major refactoring that I suspect will be needed.

HSET myhash f1 v1
> 0
HGETEX myhash EX 0 FIELDS 1 f1
> "v1"
HTTL myhash FIELDS 1 f1
>  -2

The reported events are:

1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"

New entry type

This PR also modularizes and exposes the internal hashTypeEntry logic as a new standalone entry.c/h module. This new abstraction handles all aspects of field–value–expiry encoding using multiple memory layouts optimized for performance and memory efficiency.

An entry is an abstraction that represents a single field–value pair with optional expiration. Internally, Valkey uses different memory layouts for compactness and efficiency, chosen dynamically based on size and encoding constraints.

The entry pointer is the field sds. Which make us use an entry just like any sds. We encode the entry layout type
in the field SDS header. Field type SDS_TYPE_5 doesn't have any spare bits to
encode this so we use it only for the first layout type.

Entry with embedded value, used for small sizes. The value is stored as
SDS_TYPE_8. The field can use any SDS type.

Entry can also have expiration timestamp, which is the UNIX timestamp for it to be expired.
For aligned fast access, we keep the expiry timestamp prior to the start of the sds header.

 +----------------+--------------+---------------+
 | Expiration     | field        | value         |
 | 1234567890LL   | hdr "foo" \0 | hdr8 "bar" \0 |
 +-----------------------^-------+---------------+
                         |
                         |
                        entry pointer (points to field sds content)

Entry with value pointer, used for larger fields and values. The field is SDS
type 8 or higher.

 +--------------+-------+--------------+
 | Expiration   | value | field        |
 | 1234567890LL | ptr   | hdr "foo" \0 |
 +--------------+--^----+------^-------+
                   |           |
                   |           |
                   |         entry pointer (points to field sds content)
                   |
                  value pointer = value sds

The entry.c/h API provides methods to:

Create, read, and write and Update field/value/expiration
Set or clear expiration
Check expiration state
Clone or delete an entry

Supported Commands

This PR introduces new commands and extends existing ones to support field expiration:

Commands

The proposed API is very much identical to the Redis provided API (Redis 7.4 + 8.0). This is intentionally proposed in order to avoid breaking client applications already opted to use hash items TTL.

HSETEX

Synopsis

HSETEX key [NX | XX] [FNX | FXX] [EX seconds | PX milliseconds |
  EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL]
  FIELDS numfields field value [field value ...]

Set the value of one or more fields of a given hash key, and optionally set their expiration time or time-to-live (TTL).

The HSETEX command supports the following set of options:

NX — Only set the fields if the hash object does NOT exist.
XX — Only set the fields if if the hash object doesx exist.
FNX — Only set the fields if none of them already exist.
FXX — Only set the fields if all of them already exist.
EX seconds — Set the specified expiration time in seconds.
PX milliseconds — Set the specified expiration time in milliseconds.
EXAT unix-time-seconds — Set the specified Unix time in seconds at which the fields will expire.
PXAT unix-time-milliseconds — Set the specified Unix time in milliseconds at which the fields will expire.
KEEPTTL — Retain the TTL associated with the fields.

The EX, PX, EXAT, PXAT, and KEEPTTL options are mutually exclusive.

HEGTEX

Synopsis

HGETEX key [EX seconds | PX milliseconds | EXAT unix-time-seconds |
  PXAT unix-time-milliseconds | PERSIST] FIELDS numfields field
  [field ...]

Get the value of one or more fields of a given hash key and optionally set their expiration time or time-to-live (TTL).

The HGETEX command supports a set of options:

EX seconds — Set the specified expiration time, in seconds.
PX milliseconds — Set the specified expiration time, in milliseconds.
EXAT unix-time-seconds — Set the specified Unix time at which the fields will expire, in seconds.
PXAT unix-time-milliseconds — Set the specified Unix time at which the fields will expire, in milliseconds.
PERSIST — Remove the TTL associated with the fields.

The EX, PX, EXAT, PXAT, and PERSIST options are mutually exclusive.

HEXPIRE

Synopsis

HEXPIRE key seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]

Set an expiration (TTL or time to live) on one or more fields of a given hash key. You must specify at least one field. Field(s) will automatically be deleted from the hash key when their TTLs expire.
Field expirations will only be cleared by commands that delete or overwrite the contents of the hash fields, including HDEL and HSET commands. This means that all the operations that conceptually alter the value stored at a hash key's field without replacing it with a new one will leave the TTL untouched.
You can clear the TTL of a specific field by specifying 0 for the ‘seconds’ argument.
Note that calling HEXPIRE/HPEXPIRE with a time in the past will result in the hash field being deleted immediately.

The HEXPIRE command supports a set of options:

NX — For each specified field, set expiration only when the field has no expiration.
XX — For each specified field, set expiration only when the field has an existing expiration.
GT — For each specified field, set expiration only when the new expiration is greater than current one.
LT — For each specified field, set expiration only when the new expiration is less than current one.

HEXPIREAT

Synopsis

HEXPIREAT key unix-time-seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]

HEXPIREAT has the same effect and semantics as HEXPIRE, but instead of specifying the number of seconds for the TTL (time to live), it takes an absolute Unix timestamp in seconds since Unix epoch. A timestamp in the past will delete the field immediately.

The HEXPIREAT command supports a set of options:

NX — For each specified field, set expiration only when the field has no expiration.
XX — For each specified field, set expiration only when the field has an existing expiration.
GT — For each specified field, set expiration only when the new expiration is greater than current one.
LT — For each specified field, set expiration only when the new expiration is less than current one.

HPEXPIRE

Synopsis

HPEXPIRE key milliseconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]

This command works like HEXPIRE, but the expiration of a field is specified in milliseconds instead of seconds.

The HPEXPIRE command supports a set of options:

NX — For each specified field, set expiration only when the field has no expiration.
XX — For each specified field, set expiration only when the field has an existing expiration.
GT — For each specified field, set expiration only when the new expiration is greater than current one.
LT — For each specified field, set expiration only when the new expiration is less than current one.

HPEXPIREAT

Synopsis

HPEXPIREAT key unix-time-milliseconds [NX | XX | GT | LT]
  FIELDS numfields field [field ...]

HPEXPIREAT has the same effect and semantics as HEXPIREAT``, but the Unix time at which the field will expire is specified in milliseconds since Unix epoch instead of seconds.

HPERSIST

Synopsis

HPERSIST key FIELDS numfields field [field ...]

Remove the existing expiration on a hash key's field(s), turning the field(s) from volatile (a field with expiration set) to persistent (a field that will never expire as no TTL (time to live) is associated).

HSETEX

Synopsis

HSETEX key [NX] seconds field value [field value ...]

Similar to HSET but adds one or more hash fields that expire after specified number of seconds. By default, this command overwrites the values and expirations of specified fields that exist in the hash. If NX option is specified, the field data will not be overwritten. If key doesn't exist, a new Hash key is created.

The HSETEX command supports a set of options:

NX — For each specified field, set expiration only when the field has no expiration.

HTTL

Synopsis

HTTL key FIELDS numfields field [field ...]

Returns the remaining TTL (time to live) of a hash key's field(s) that have a set expiration. This introspection capability allows you to check how many seconds a given hash field will continue to be part of the hash key.

HPTTL

HPTTL key FIELDS numfields field [field ...]

Like HTTL, this command returns the remaining TTL (time to live) of a field that has an expiration set, but in milliseconds instead of seconds.

HEXPIRETIME

Synopsis

HEXPIRETIME key FIELDS numfields field [field ...]

Returns the absolute Unix timestamp in seconds since Unix epoch at which the given key's field(s) will expire.

HPEXPIRETIME

Synopsis

HPEXPIRETIME key FIELDS numfields field [field ...]

HPEXPIRETIME has the same semantics as HEXPIRETIME, but returns the absolute Unix expiration timestamp in milliseconds since Unix epoch instead of seconds.

Keyspace Notifications

This PR introduces new notification events to support field-level expiration:

Event	Trigger
`hexpire`	Field expiration was set
`hexpired`	Field was deleted due to expiration
`hpersist`	Expiration was removed from a field
`del`	Key was deleted after all fields expired

Note that we diverge from Redis in the cases we emit hexpired event.
For example:
given the following usecase:

HSET myhash f1 v1
(integer) 0
HGETEX myhash EX 0 FIELDS 1 f1
1) "v1"
 HTTL myhash FIELDS 1 f1
1) (integer) -2

regarding the keyspace-notifications:
Redis reports:

1) "psubscribe"
2) "__keyevent@0__:*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hset"
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hdel" <---------------- note this
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:del"
4) "myhash2"

However In our current suggestion, Valkey will emit:

1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"

Propagation and Replication

Expiration-aware commands (HSETEX, HGETEX, etc.) are not propagated as-is.
Instead, Valkey rewrites them into equivalent commands like:
- HDEL (for expired fields)
- HPEXPIREAT (for setting absolute expiration)
- HPERSIST (for removing expiration)

This ensures compatibility with replication and AOF while maintaining consistent field-level expiry behavior.

Performance Comparison

Command Name	QPS Standard	QPS HFE	QPS Diff %	Latency Standard (ms)	Latency HFE (ms)	Latency Diff %
One Large Hash Table
HGET	137988.12	138484.97	+0.36%	0.951	0.949	-0.21%
HSET	138561.73	137343.77	-0.87%	0.948	0.956	+0.84%
HEXISTS	139431.12	138677.02	-0.54%	0.942	0.946	+0.42%
HDEL	140114.89	138966.09	-0.81%	0.938	0.945	+0.74%
Many Hash Tables (100 fields)
HGET	136798.91	137419.27	+0.45%	0.959	0.956	-0.31%
HEXISTS	138946.78	139645.31	+0.50%	0.946	0.941	-0.52%
HGETALL	42194.09	42016.80	-0.42%	0.621	0.625	+0.64%
HSET	137230.69	137249.53	+0.01%	0.959	0.958	-0.10%
HDEL	138985.41	138619.34	-0.26%	0.948	0.949	+0.10%
Many Hash Tables (1000 fields)
HGET	135795.77	139256.36	+2.54%	0.965	0.943	-2.27%
HEXISTS	138121.55	137950.06	-0.12%	0.951	0.952	+0.10%
HGETALL	5885.81	5633.80	-4.28%	2.690	2.841	+5.61%
HSET	137005.08	137400.39	+0.28%	0.959	0.955	-0.41%
HDEL	138293.45	137381.52	-0.65%	0.948	0.955	+0.73%

Accumulated Backlog

[ ] Consider extending HSETEX with extra arguments: NX/XX so that it is possible to prevent adding/setting/mutating fields of a non-existent hash
[ ] Avoid loading expired fields when non-preamble RDB is being loaded on primary. This is an optimization in order to reduce loading unnecessary fields (which are expired). This would also require us to propagate the HDEL to the replicas in case of RDBFLAGS_FEED_REPL. Note that it might have to require some refactoring:
1/ propagate the rdbflags and current time to rdbLoadObject. 2/ consider the case of restore and check_rdb etc...
For this reason I would like to avoid this optimizationfor the first drop.

codecov · 2025-05-18T09:32:19Z

Codecov Report

❌ Patch coverage is 76.09302% with 514 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.37%. Comparing base (dceb9f3) to head (4319edb).
⚠️ Report is 11 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/vset.c	51.27%	439 Missing ⚠️
src/t_hash.c	94.88%	32 Missing ⚠️
src/aof.c	26.08%	17 Missing ⚠️
src/entry.c	95.40%	8 Missing ⚠️
src/expire.c	95.27%	6 Missing ⚠️
src/module.c	0.00%	5 Missing ⚠️
src/rdb.c	85.18%	4 Missing ⚠️
src/defrag.c	80.00%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2089      +/-   ##
============================================
- Coverage     71.49%   71.37%   -0.13%     
============================================
  Files           123      125       +2     
  Lines         67487    69207    +1720     
============================================
+ Hits          48251    49395    +1144     
- Misses        19236    19812     +576

Files with missing lines	Coverage Δ
src/anet.c	`72.44% <ø> (ø)`
src/commands.def	`100.00% <ø> (ø)`
src/db.c	`90.47% <100.00%> (+0.47%)`	⬆️
src/hashtable.c	`82.71% <100.00%> (+0.28%)`	⬆️
src/lazyfree.c	`86.39% <100.00%> (+0.28%)`	⬆️
src/object.c	`81.86% <100.00%> (+0.43%)`	⬆️
src/server.c	`88.40% <100.00%> (+0.33%)`	⬆️
src/server.h	`100.00% <ø> (ø)`
src/t_string.c	`96.34% <100.00%> (-0.50%)`	⬇️
src/util.c	`66.21% <100.00%> (+0.41%)`	⬆️
... and 9 more

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

zuiderkwast

I did a partial pass on this. I got to the hashtable callback and the entry abstraction. I didn't get to the actual field expiration logic in t_hash and the volatile set though. Need to continue another day.

src/commands/hexpire.json

src/commands/hexpireat.json

src/commands/hexpiretime.json

src/db.c

src/entry.h

src/hashtable.c

src/rdb.h

src/server.h

src/volatile_set.c

src/Makefile

SoftlyRaining

This is a lot of work you've done! I've only had time for a partial review today, but I had a few comments/questions so far. The command schema and entry memory layout looks good to me. It'll be interesting to see perf testing too! 😀

src/hashtable.c

src/hashtable.h

src/volatile_set.h

src/server.h

src/entry.c

ranshid · 2025-06-19T05:24:11Z

This is a lot of work you've done! I've only had time for a partial review today, but I had a few comments/questions so far. The command schema and entry memory layout looks good to me. It'll be interesting to see perf testing too! 😀

We are just more focused on introducing the functionality and would focus on performance testing as soon as possible.

src/hashtable.c

src/entry.c

src/entry.h

ranshid · 2025-06-19T20:43:49Z

src/entry.c

+    zfree(entryAllocPtr(entry));
+}
+
+/* Takes ownership of value, does not take ownership of field */


I will just remove that logic for now. it was meant for sets, but I am not sure it will remain that way.

src/entry.c

src/server.h

src/t_hash.c

src/sds.c

src/t_hash.c

JimB123 · 2025-06-24T18:50:07Z

First comment - 37 changed files??? Dang!

JimB123

still reviewing. Posting Day 1. 😨

src/db.c

src/entry.h

src/entry.c

ranshid · 2025-06-25T07:54:32Z

First comment - 37 changed files??? Dang!

you still have another PR in the oven - it might be bigger than this :(

rjd15372

Hi @ranshid , this is just the review of the entry.c code that you can check while I continue to review the rest of the code.

src/entry.c

ranshid · 2025-07-01T17:51:31Z

Hi @ranshid , this is just the review of the entry.c code that you can check while I continue to review the rest of the code.

Thank you @rjd15372 !

TBH the entry is NOT the main focus of this PR. most of the entry code is taken from the already existing implementation of hashTypeEntry (with indeed some changes).

I think the really interesting part are the new commands themselves. this is were the complex logic is introduced (HSETEX, HGETEX, HEXPIRE etc...)

there is also the new volatile set API in the t_hash.c (that I do not like that much) but we can focus on this in the PR introducing the volatile set.

src/entry.c

PingXie

sending partial review on entry.* and volatile_set.*. will continue.

src/entry.h

PingXie · 2025-07-13T22:48:10Z

src/entry.h

+#include <stdbool.h>
+
+/*-----------------------------------------------------------------------------
+ * Entry


the name entry is too generic. is there a better name that we could consider? would there be any concern with naming it as hashField? hashEntry also works for me

Old Name New Recommended Name

entry (type) hashField

entry.h hashField.h

entry.c hashField.c

entryCreate hashFieldCreate

entryUpdate hashFieldUpdate

entryFree hashFieldFree

entryGetField hashFieldGetName

entryGetValue hashFieldGetValue

entrySetValue hashFieldSetValue

entryGetExpiry hashFieldGetExpiry

entrySetExpiry hashFieldSetExpiry

entryHasExpiry hashFieldHasExpiry

entryIsExpired hashFieldIsExpired

entryMemUsage hashFieldMemUsage

entryDefrag hashFieldDefrag

entryDismissMemory hashFieldDismissMemory

entryHasEmbeddedValue hashFieldHasEmbeddedValue

I would prefer to avoid making an entry hash related only. I would consider maybe call it mappedEntry or mapEntry?

if the idea is to expand this structure into other data structures like set, what do you think about expirableEntry? basically I am just trying to avoid a very generic name like entry

Yes I can try changing the name to be less "generic" and more specific. However since this is a massive change I would prefer to wait till the other PRs are merged inside.

expirableEntry looks OK for the type, but the filename and function would become a bit long, e.g. expirableEntry.{c,h} and expirableEntryGetValue. It is a little too long to have multiple camelcased words in the prefix itself.

To compare: In an earlier draft, the volatile set used be have volatileSet as the prefix but with it looked like all functions were setters (e.g. volatileSetXyz looks like it is setting something). We changed volatileSet to vset which is a more concise prefix and filename.

How about ventry = volatile entry? (ventry.c, ventryCreate, ventrySetValue, etc.)

Personally I do not think the entry should be identified by the fact that it is POTENTIALLY volatile.
This is simply a key-value object with optional metadata like expiration and maybe even reference count (like is being suggested in #2299).

I would suggest call it: kventry. although it is possible that it will support non value entries.

I am pending with changing it, since I would like to make this change after the other 2 pRs are merged into this one.

This is simply a key-value object with optional metadata like expiration and maybe even reference count

If it's this generic, I actually prefer the current name entry.

Could we potentially use it for various internal hash tables too, like blocked clients or cluster nodes per node-id?

The "kv" in kventry just adds extra junk, especially if it doesn't even always have a value. Also it incorrectly hints that it's an en entry of kvstore. Thus, I don't like kventry very much.

O.K.
Given the API the entry is always a "field" which can be assigned a value. In the current implementation it must be assigned with a value. theoretically we can modify it to be able to work with "NULL" value. IMO this would be more simple in order to manage adding expiration to set elements (rather than creating a new type for sets).
at the time, I also thought about adding expiration to sds instead of entry so to be able to sdswrite field sds with the leading expiration time, but we discussed that it might be too intrusive and extensive to do.
I think that entry which can manage NULL value is fine for future use in sets.

Could we potentially use it for various internal hash tables too, like blocked clients or cluster nodes per node-id?

Theoretically it can be used to store any sds with expiration time and/or any other type of metadata we choose. the original entry implementation in HFE also included the ability to set variable metadata (but I removed this API as there was no use for it back then).

I am O.K with keeping the name entry but got some complaints that it is simply very generic and there are some local variables using this "name" which might be confusing.

Personally I do not want to link the entry name with some of it's "optional metadata" characteristics (e.g ventry for 'volatile'). I would even prefer the use of record or mentry/mapentry (for "mapped entry") but free to hear other thoughts

src/entry.c

src/volatile_set.c

PingXie · 2025-07-14T02:03:29Z

src/volatile_set.c

+    raxStart(&it->bucket, set->expiry_buckets);
+}
+
+int volatileSetNext(volatileSetIterator *it, void **entryptr) {


nit - is it possible to use a different name than set? it took me a while to realize that this set is not a verb but noun... would expiryIndex work?

IDK. we had many discussions on names. I have no strong opinion here (I think @zuiderkwast was fond of vset which is what we eventually took in ranshid#5). I mainly dislike that fact that it is not really a set as it lacks the correct protection that the same element exists only once in the DS (although the usage of it implies that an element SHOULD exist only once in the set, and the API is helps drive this logic).

Yeah, the capitalized "Set" was confusing to me too, but it disappears with the vset prefix, as in vsetNext(). I'd be fine with expiryIndex or something like that too, but a single word is better for the function prefix.

src/hashtable.h

madolson · 2025-07-16T18:43:23Z

src/t_hash.c

-
-    /* HMSET (deprecated) and HSET return value is different. */
-    char *cmdname = c->argv[0]->ptr;
-    if (cmdname[1] == 's' || cmdname[1] == 'S') {


Not related to this PR, but I noticed it while we were talking.

Doesn't his break if you rename HSET?

@madolson I guess so.

PingXie

I have gone through the server code but not the test code. LGTM overall with some high level callouts

I think there is a bug in hashTypeIgnoreTTL
I still don't like entry - I proposed expirableEntry for your consideration
some formatting nits, such as using curly braces consistently and preferring single lines whenever it makes sense.

other than 1), please feel free to address them in separate prs.

this is a hard one. looking forward to the active expiration pr and the new vset refactoring! thanks @ranshid :)

src/commands/hexpire.json

src/commands/hexpireat.json

PingXie · 2025-08-04T03:59:59Z

src/t_hash.c

+/* make any access to the hash object elements ignore the specific elements expiration.
+ * This is mainly in order to be able to access hash elements which are already expired. */
+static inline void hashTypeIgnoreTTL(robj *o, bool ignore) {
+    if (o->encoding == OBJ_ENCODING_HASHTABLE) {


assert?

Suggested change

if (o->encoding == OBJ_ENCODING_HASHTABLE) {

serverAssert(o->encoding == OBJ_ENCODING_HASHTABLE);

Let me tell you what I think: I prefer not to assume the user will not call this based on internal encoding. It is completely valid to call this API on a listpack encoded hash but will just not cause any effect as there are clearly no fields with TTL (at least at THIS stage).
For example, we might (and should) also consider listpack "like" encoding with expiration time, we just excluded it for the time being. when we will add something like this, I this the API should not change.

src/t_hash.c

src/volatile_set.h

PingXie · 2025-08-04T04:23:57Z

src/rdb.c

@@ -981,8 +985,17 @@ ssize_t rdbSaveObject(rio *rdb, robj *o, robj *key, int dbid) {
                    return -1;
                }
                nwritten += n;
+                if (add_expiry) {
+                    long long expiry = entryGetExpiry(next);
+                    if ((n = rdbSaveMillisecondTime(rdb, expiry) == -1)) {


curious - is this how redis stores the expiry time as well?

IDK. I am not looking at the redis code.

Closes valkey-io#640 This PR introduces support for **field-level expiration in Valkey hash types**, making it possible for individual fields inside a hash to expire independently — creating what we call **volatile fields**. This is just the first out of 3 PRs. The content of this PR focus on enabling the basic ability to set and modify hash fields expiration as well as persistency (AOF+RDB) and defrag. [The second PR](#5) introduces the new algorithm (volatile-set) to track volatile hash fields is in the last stages of review. The current implementation in this PR (in volatile-set.h/c) is just s tub implementation and will be replaced by [The second PR](#5) [The third PR](#4) which introduces the active expiration and defragmentation jobs. For more highlevel design details you can track the RFC PR: valkey-io/valkey-rfc#22. --- Some highlevel major decisions which are taken as part of this work: 1. We decided to copy the existing Redis API in order to maintain compatibility with existing clients. 2. We decided to avoid introducing lazy-expiration at this point, in order to reduce complexity and rely only on active-expiration for memory reclamation. This will require us to continue to work on improving the active expiration job and potentially consider introduce lazy-expiration support later on. 3. Although different commands which are adding expiration on hash fields are influencing the memory utilization (by allocating more memory for expiration time and metadata) we decided to avoid adding the DENYOOM for these commands (an exception is HSETEX) in order to be better aligned with highlevel keys commands like `expire` 4. Some hash type commands will produce unexpected results: - HLEN - will still reflect the number of fields which exists in the hash object (either actually expired or not). - HRANDFIELD - in some cases we will not be able to randomly select a field which was not already expired. this case happen in 2 cases: 1/ when we are asked to provide a non-uniq fields (i.e negative count) 2/ when the size of the hash is much bigger than the count and we need to provide uniq results. In both cases it is possible that an empty response will be returned to the caller, even in case there are fields in the hash which are either persistent or not expired. 5. For the case were a field is provided with a zero (0) expiration time or expiration time in the past, it is immediately deleted. We decided that, in order to be aligned with how high level keys are handled, we will emit hexpired keyspace event for that case (instead of hdel). For example: for the case: 6. We will ALWAYS load hash fields during rdb load. This means that when primary is rebooting with an old snapshot, it will take time to reclaim all the expired fields. However this simplifies the current logic and avoid major refactoring that I suspect will be needed. ``` HSET myhash f1 v1 > 0 HGETEX myhash EX 0 FIELDS 1 f1 > "v1" HTTL myhash FIELDS 1 f1 > -2 ``` The reported events are: ``` 1) "psubscribe" 2) "__keyevent@0__*" 3) (integer) 1 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:hset" 4) "myhash" 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:hexpired" <---------------- note this 4) "myhash" 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:del" 4) "myhash" ``` --- This PR also **modularizes and exposes the internal `hashTypeEntry` logic** as a new standalone `entry.c/h` module. This new abstraction handles all aspects of **field–value–expiry encoding** using multiple memory layouts optimized for performance and memory efficiency. An `entry` is an abstraction that represents a single **field–value pair with optional expiration**. Internally, Valkey uses different memory layouts for compactness and efficiency, chosen dynamically based on size and encoding constraints. The entry pointer is the field sds. Which make us use an entry just like any sds. We encode the entry layout type in the field SDS header. Field type SDS_TYPE_5 doesn't have any spare bits to encode this so we use it only for the first layout type. Entry with embedded value, used for small sizes. The value is stored as SDS_TYPE_8. The field can use any SDS type. Entry can also have expiration timestamp, which is the UNIX timestamp for it to be expired. For aligned fast access, we keep the expiry timestamp prior to the start of the sds header. +----------------+--------------+---------------+ | Expiration | field | value | | 1234567890LL | hdr "foo" \0 | hdr8 "bar" \0 | +-----------------------^-------+---------------+ | | entry pointer (points to field sds content) Entry with value pointer, used for larger fields and values. The field is SDS type 8 or higher. +--------------+-------+--------------+ | Expiration | value | field | | 1234567890LL | ptr | hdr "foo" \0 | +--------------+--^----+------^-------+ | | | | | entry pointer (points to field sds content) | value pointer = value sds The `entry.c/h` API provides methods to: - Create, read, and write and Update field/value/expiration - Set or clear expiration - Check expiration state - Clone or delete an entry --- This PR introduces **new commands** and extends existing ones to support field expiration: The proposed API is very much identical to the Redis provided API (Redis 7.4 + 8.0). This is intentionally proposed in order to avoid breaking client applications already opted to use hash items TTL. **Synopsis** ``` HSETEX key [NX | XX] [FNX | FXX] [EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL] FIELDS numfields field value [field value ...] ``` Set the value of one or more fields of a given hash key, and optionally set their expiration time or time-to-live (TTL). The HSETEX command supports the following set of options: * `NX` — Only set the fields if the hash object does NOT exist. * `XX` — Only set the fields if if the hash object doesx exist. * `FNX` — Only set the fields if none of them already exist. * `FXX` — Only set the fields if all of them already exist. * `EX seconds` — Set the specified expiration time in seconds. * `PX milliseconds` — Set the specified expiration time in milliseconds. * `EXAT unix-time-seconds` — Set the specified Unix time in seconds at which the fields will expire. * `PXAT unix-time-milliseconds` — Set the specified Unix time in milliseconds at which the fields will expire. * `KEEPTTL` — Retain the TTL associated with the fields. The `EX`, `PX`, `EXAT`, `PXAT`, and `KEEPTTL` options are mutually exclusive. **Synopsis** ``` HGETEX key [EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT unix-time-milliseconds | PERSIST] FIELDS numfields field [field ...] ``` Get the value of one or more fields of a given hash key and optionally set their expiration time or time-to-live (TTL). The `HGETEX` command supports a set of options: * `EX seconds` — Set the specified expiration time, in seconds. * `PX milliseconds` — Set the specified expiration time, in milliseconds. * `EXAT unix-time-seconds` — Set the specified Unix time at which the fields will expire, in seconds. * `PXAT unix-time-milliseconds` — Set the specified Unix time at which the fields will expire, in milliseconds. * `PERSIST` — Remove the TTL associated with the fields. The `EX`, `PX`, `EXAT`, `PXAT`, and `PERSIST` options are mutually exclusive. **Synopsis** ``` HEXPIRE key seconds [NX | XX | GT | LT] FIELDS numfields field [field ...] ``` Set an expiration (TTL or time to live) on one or more fields of a given hash key. You must specify at least one field. Field(s) will automatically be deleted from the hash key when their TTLs expire. Field expirations will only be cleared by commands that delete or overwrite the contents of the hash fields, including `HDEL` and `HSET` commands. This means that all the operations that conceptually *alter* the value stored at a hash key's field without replacing it with a new one will leave the TTL untouched. You can clear the TTL of a specific field by specifying 0 for the ‘seconds’ argument. Note that calling `HEXPIRE`/`HPEXPIRE` with a time in the past will result in the hash field being deleted immediately. The `HEXPIRE` command supports a set of options: * `NX` — For each specified field, set expiration only when the field has no expiration. * `XX` — For each specified field, set expiration only when the field has an existing expiration. * `GT` — For each specified field, set expiration only when the new expiration is greater than current one. * `LT` — For each specified field, set expiration only when the new expiration is less than current one. **Synopsis** ``` HEXPIREAT key unix-time-seconds [NX | XX | GT | LT] FIELDS numfields field [field ...] ``` `HEXPIREAT` has the same effect and semantics as `HEXPIRE`, but instead of specifying the number of seconds for the TTL (time to live), it takes an absolute Unix timestamp in seconds since Unix epoch. A timestamp in the past will delete the field immediately. The `HEXPIREAT` command supports a set of options: * `NX` — For each specified field, set expiration only when the field has no expiration. * `XX` — For each specified field, set expiration only when the field has an existing expiration. * `GT` — For each specified field, set expiration only when the new expiration is greater than current one. * `LT` — For each specified field, set expiration only when the new expiration is less than current one. **Synopsis** ``` HPEXPIRE key milliseconds [NX | XX | GT | LT] FIELDS numfields field [field ...] ``` This command works like `HEXPIRE`, but the expiration of a field is specified in milliseconds instead of seconds. The `HPEXPIRE` command supports a set of options: * `NX` — For each specified field, set expiration only when the field has no expiration. * `XX` — For each specified field, set expiration only when the field has an existing expiration. * `GT` — For each specified field, set expiration only when the new expiration is greater than current one. * `LT` — For each specified field, set expiration only when the new expiration is less than current one. **Synopsis** ``` HPEXPIREAT key unix-time-milliseconds [NX | XX | GT | LT] FIELDS numfields field [field ...] ``` `HPEXPIREAT` has the same effect and semantics as `HEXPIREAT``,` but the Unix time at which the field will expire is specified in milliseconds since Unix epoch instead of seconds. **Synopsis** ``` HPERSIST key FIELDS numfields field [field ...] ``` Remove the existing expiration on a hash key's field(s), turning the field(s) from *volatile* (a field with expiration set) to *persistent* (a field that will never expire as no TTL (time to live) is associated). **Synopsis** ``` HSETEX key [NX] seconds field value [field value ...] ``` Similar to `HSET` but adds one or more hash fields that expire after specified number of seconds. By default, this command overwrites the values and expirations of specified fields that exist in the hash. If `NX` option is specified, the field data will not be overwritten. If `key` doesn't exist, a new Hash key is created. The HSETEX command supports a set of options: * `NX` — For each specified field, set expiration only when the field has no expiration. **Synopsis** ``` HTTL key FIELDS numfields field [field ...] ``` Returns the **remaining** TTL (time to live) of a hash key's field(s) that have a set expiration. This introspection capability allows you to check how many seconds a given hash field will continue to be part of the hash key. ``` HPTTL key FIELDS numfields field [field ...] ``` Like `HTTL`, this command returns the remaining TTL (time to live) of a field that has an expiration set, but in milliseconds instead of seconds. **Synopsis** ``` HEXPIRETIME key FIELDS numfields field [field ...] ``` Returns the absolute Unix timestamp in seconds since Unix epoch at which the given key's field(s) will expire. **Synopsis** ``` HPEXPIRETIME key FIELDS numfields field [field ...] ``` `HPEXPIRETIME` has the same semantics as `HEXPIRETIME`, but returns the absolute Unix expiration timestamp in milliseconds since Unix epoch instead of seconds. This PR introduces new notification events to support field-level expiration: | Event | Trigger | |-------------|-------------------------------------------| | `hexpire` | Field expiration was set | | `hexpired` | Field was deleted due to expiration | | `hpersist` | Expiration was removed from a field | | `del` | Key was deleted after all fields expired | Note that we diverge from Redis in the cases we emit hexpired event. For example: given the following usecase: ``` HSET myhash f1 v1 (integer) 0 HGETEX myhash EX 0 FIELDS 1 f1 1) "v1" HTTL myhash FIELDS 1 f1 1) (integer) -2 ``` regarding the keyspace-notifications: Redis reports: ``` 1) "psubscribe" 2) "__keyevent@0__:*" 3) (integer) 1 1) "pmessage" 2) "__keyevent@0__:*" 3) "__keyevent@0__:hset" 4) "myhash2" 1) "pmessage" 2) "__keyevent@0__:*" 3) "__keyevent@0__:hdel" <---------------- note this 4) "myhash2" 1) "pmessage" 2) "__keyevent@0__:*" 3) "__keyevent@0__:del" 4) "myhash2" ``` However In our current suggestion, Valkey will emit: ``` 1) "psubscribe" 2) "__keyevent@0__*" 3) (integer) 1 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:hset" 4) "myhash" 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:hexpired" <---------------- note this 4) "myhash" 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:del" 4) "myhash" ``` --- - Expiration-aware commands (`HSETEX`, `HGETEX`, etc.) are **not propagated as-is**. - Instead, Valkey rewrites them into equivalent commands like: - `HDEL` (for expired fields) - `HPEXPIREAT` (for setting absolute expiration) - `HPERSIST` (for removing expiration) This ensures compatibility with replication and AOF while maintaining consistent field-level expiry behavior. --- | Command Name | QPS Standard | QPS HFE | QPS Diff % | Latency Standard (ms) | Latency HFE (ms) | Latency Diff % | |--------------|-------------|---------|------------|----------------------|------------------|----------------| | **One Large Hash Table** | | HGET | 137988.12 | 138484.97 | +0.36% | 0.951 | 0.949 | -0.21% | | HSET | 138561.73 | 137343.77 | -0.87% | 0.948 | 0.956 | +0.84% | | HEXISTS | 139431.12 | 138677.02 | -0.54% | 0.942 | 0.946 | +0.42% | | HDEL | 140114.89 | 138966.09 | -0.81% | 0.938 | 0.945 | +0.74% | | **Many Hash Tables (100 fields)** | | HGET | 136798.91 | 137419.27 | +0.45% | 0.959 | 0.956 | -0.31% | | HEXISTS | 138946.78 | 139645.31 | +0.50% | 0.946 | 0.941 | -0.52% | | HGETALL | 42194.09 | 42016.80 | -0.42% | 0.621 | 0.625 | +0.64% | | HSET | 137230.69 | 137249.53 | +0.01% | 0.959 | 0.958 | -0.10% | | HDEL | 138985.41 | 138619.34 | -0.26% | 0.948 | 0.949 | +0.10% | | **Many Hash Tables (1000 fields)** | | HGET | 135795.77 | 139256.36 | +2.54% | 0.965 | 0.943 | -2.27% | | HEXISTS | 138121.55 | 137950.06 | -0.12% | 0.951 | 0.952 | +0.10% | | HGETALL | 5885.81 | 5633.80 | **-4.28%** | 2.690 | 2.841 | **+5.61%** | | HSET | 137005.08 | 137400.39 | +0.28% | 0.959 | 0.955 | -0.41% | | HDEL | 138293.45 | 137381.52 | -0.65% | 0.948 | 0.955 | +0.73% | [ ] Consider extending HSETEX with extra arguments: NX/XX so that it is possible to prevent adding/setting/mutating fields of a non-existent hash [ ] Avoid loading expired fields when non-preamble RDB is being loaded on primary. This is an optimization in order to reduce loading unnecessary fields (which are expired). This would also require us to propagate the HDEL to the replicas in case of RDBFLAGS_FEED_REPL. Note that it might have to require some refactoring: 1/ propagate the rdbflags and current time to rdbLoadObject. 2/ consider the case of restore and check_rdb etc... For this reason I would like to avoid this optimizationfor the first drop. Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

------------- Overview: --------- This PR introduces a complete redesign of the 'vset' (stands for volatile set) data structure, creating an adaptive container for expiring entries. The new design is memory-efficient, scalable, and dynamically promotes/demotes its internal representation depending on runtime behavior and volume. The core concept uses a single tagged pointer (`expiry_buckets`) that encodes one of several internal structures: - NONE (-1): Empty set - SINGLE (0x1): One entry - VECTOR (0x2): Sorted vector of entry pointers - HT (0x4): Hash table for larger buckets with many entries - RAX (0x6): Radix tree (keyed by aligned expiry timestamps) This allows the set to grow and shrink seamlessly while optimizing for both space and performance. Motivation: ----------- The previous design lacked flexibility in high-churn environments or workloads with skewed expiry distributions. This redesign enables dynamic layout adjustment based on the time distribution and volume of the inserted entries, while maintaining fast expiry checks and minimal memory overhead. Key Concepts: ------------- - All pointers stored in the structure must be odd-aligned to preserve 3 bits for tagging. This is safe with SDS strings (which set the LSB). - Buckets evolve automatically: - Start as NONE. - On first insert → become SINGLE. - If another entry with similar expiry → promote to VECTOR. - If VECTOR exceeds 127 entries → convert to RAX. - If a RAX bucket's vector fills and cannot split → promote to HT. - Each vector bucket is kept sorted by `entry->getExpiry()`. - Binary search is used for efficient insertion and splitting. # Coarse Buckets Expiration System for Hash Fields This PR introduces **coarse-grained expiration buckets** to support per-field expirations in hash types — a feature known as *volatile fields*. It enables scalable expiration tracking by grouping fields into time-aligned buckets instead of individually tracking exact timestamps. ## Motivation Valkey traditionally supports key-level expiration. However, in many applications, there's a strong need to expire individual fields within a hash (e.g., session keys, token caches, etc.). Tracking these at fine granularity is expensive and potentially unscalable, so this implementation introduces *bucketed expirations* to batch expirations together. ## Bucket Granularity and Timestamp Handling - Each expiration bucket represents a time slice of fixed width (e.g., 8192 ms). - Expiring fields are mapped to the **end** of a time slice (not the floor). - This design facilitates: - Efficient *splitting* of large buckets when needed - *Downgrading* buckets when fields permit tighter packing - Coalescing during lazy cleanup or memory pressure ### Example Calculation Suppose a field has an expiration time of `1690000123456` ms and the max bucket interval is 8192 ms: ``` BUCKET_INTERVAL_MAX = 8192; expiry = 1690000123456; bucket_ts = (expiry & ~(BUCKET_INTERVAL_MAX - 1LL)) + BUCKET_INTERVAL_MAX; = (1690000123456 & ~8191) + 8192 = 1690000122880 + 8192 = 1690000131072 ``` The field is stored in a bucket that **ends at** `1690000131072` ms. ### Bucket Alignment Diagram ``` Time (ms) → |----------------|----------------|----------------| 128ms buckets → 1690000122880 1690000131072 ^ ^ | | expiry floor assigned bucket end ``` ## Bucket Placement Logic - If a suitable bucket **already exists** (i.e., its `end_ts > expiry`), the field is added. - If no bucket covers the `expiry`, a **new bucket** is created at the computed `end_ts`. ## Bucket Downgrade Conditions Buckets are downgraded to smaller intervals when overpopulated (>127 fields). This happens when **all fields fit into a tighter bucket**. Downgrade rule: ``` (max_expiry & ~(BUCKET_INTERVAL_MIN - 1LL)) + BUCKET_INTERVAL_MIN < current_bucket_ts ``` If the above holds, all fields can be moved to a tighter bucket interval. ### Downgrade Bucket — Diagram ``` Before downgrade: Current Bucket (8192 ms) |----------------------------------------| | Field A | Field B | Field C | Field D | | exp=+30 | +200 | +500 | +1500 | |----------------------------------------| ↑ All expiries fall before tighter boundary After downgrade to 1024 ms: New Bucket (1024 ms) |------------------| | A | B | C | D | |------------------| ``` ### Bucket Split Strategy If downgrade is not possible, the bucket is **split**: - Fields are sorted by expiration time. - A subset that fits in an earlier bucket is moved out. - Remaining fields stay in the original bucket. ### Split Bucket — Diagram ``` Before split: Large Bucket (8192 ms) |--------------------------------------------------| | A | B | C | D | E | F | G | H | I | J | ... | Z | |---------------- Sorted by expiry ---------------| ↑ Fields A–L can be moved to an earlier bucket After split: Bucket 1 (end=1690000129024) Bucket 2 (end=1690000131072) |------------------------| |------------------------| | A | B | C | ... | L | | M | N | O | ... | Z | |------------------------| |------------------------| ``` ## Summary of Bucket Behavior | Scenario | Action Taken | |--------------------------------|------------------------------| | No bucket covers expiry | New bucket is created | | Existing bucket fits | Field is added | | Bucket overflows (>127 fields) | Downgrade or split attempted | API Changes: ------------ Create/Free: void vsetInit(vset *set); void vsetClear(vset *set); Mutation: bool vsetAddEntry(vset *set, vsetGetExpiryFunc getExpiry, void *entry); bool vsetRemoveEntry(vset *set, vsetGetExpiryFunc getExpiry, void *entry); bool vsetUpdateEntry(vset *set, vsetGetExpiryFunc getExpiry, void *old_entry, void *new_entry, long long old_expiry, long long new_expiry); Expiry Retrieval: long long vsetEstimatedEarliestExpiry(vset *set, vsetGetExpiryFunc getExpiry); size_t vsetPopExpired(vset *set, vsetGetExpiryFunc getExpiry, vsetExpiryFunc expiryFunc, mstime_t now, size_t max_count, void *ctx); Utilities: bool vsetIsEmpty(vset *set); size_t vsetMemUsage(vset *set); Iteration: void vsetStart(vset *set, vsetIterator *it); bool vsetNext(vsetIterator *it, void **entryptr); void vsetStop(vsetIterator *it); Entry Requirements: ------------------- All entries must conform to the following interface via `volatileEntryType`: sds entryGetKey(const void entry); // for deduplication long long getExpiry(const void entry); // used for bucketing int expire(void db, void o, void entry); // used for expiration callbacks Diagrams: --------- 1. Tagged Pointer Representation ----------------------------- Lower 3 bits of `expiry_buckets` encode bucket type: +------------------------------+ | pointer | TAG (3b) | +------------------------------+ ↑ masked via VSET_PTR_MASK TAG values: 0x1 → SINGLE 0x2 → VECTOR 0x4 → HT 0x6 → RAX 2. Evolution of the Bucket ------------------------ *Volatile set top-level structure:* ``` +--------+ +--------+ +--------+ +--------+ | NONE | --> | SINGLE | --> | VECTOR | --> | RAX | +--------+ +--------+ +--------+ +--------+ ``` *If the top-level element is a RAX, it has child buckets of type:* ``` +--------+ +--------+ +-----------+ | SINGLE | --> | VECTOR | --> | HASHTABLE | +--------+ +--------+ +-----------+ ``` *Vectors can split into multiple vectors and shrink into SINGLE buckets. A RAX with only one element is collapsed by replacing the RAX with its single element on the top level (except for HASHTABLE buckets which are not allowed on the top level).* 3. RAX Structure with Expiry-Aligned Keys -------------------------------------- Buckets in RAX are indexed by aligned expiry timestamps: +------------------------------+ | RAX key (bucket_ts) → Bucket| +------------------------------+ | 0x00000020 → VECTOR | | 0x00000040 → VECTOR | | 0x00000060 → HT | +------------------------------+ 4. Bucket Splitting (Inside RAX) ----------------------------- If a vector bucket in a RAX fills: - Binary search for best split point. - Use `getExpiry(entry)` + `get_bucket_ts()` to find transition. - Create 2 new buckets and update RAX. Original: [entry1, entry2, ..., entryN] ← bucket_ts = 64ms After split: [entry1, ..., entryK] → bucket_ts = 32ms [entryK+1, ..., entryN] → bucket_ts = 64ms If all entries share same bucket_ts → promote to HT. 5. Shrinking Behavior ------------------ On deletion: - HT may shrink to VECTOR. - VECTOR with 1 item → becomes SINGLE. - If RAX has only one key left, it’s promoted up. Summary: -------- This redesign provides: ✓ Fine-grained memory control ✓ High scalability for bursty TTL data ✓ Fast expiry checks via windowed organization ✓ Minimal overhead for sparse sets ✓ Flexible binary-search-based sorting and bucketing It also lays the groundwork for future enhancements, including metrics, prioritized expiry policies, or segmented cleaning. Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

This change adds support for active expiration of hash fields with TTLs (Hash Field Expiration), building on the existing key-level expiry system. Field TTL metadata is tracked in volatile sets associated with each hash key. Expired fields are reclaimed incrementally by the active expiration loop, using a new job type to alternate between key expiry and field expiry within the same logic and effort budget. Both key and field expiration now share the same scheduler infrastructure. Alternating job types ensures fairness and avoids starvation, while keeping CPU usage predictable. +-----------------+ | DB | +-----------------+ | v +---------------------+ | myhash | (key with TTL) +---------------------+ | v +------------------------------------+ | fields (hashType) | | - field1 | | - field2 | | - fieldN | +------------------------------------+ | v +------------------------------------+ | volatile set (field-level TTL) | | - field1 expires at T1 | | - field5 expires at T5 | +------------------------------------+ No new configuration was introduced; the existing active-expire-effort and time budget are reused for both key and field expiry. Also active defrag for volatile sets is added. Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

This is needed due to changes presented in #2089 --------- Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

enjoy-binbin · 2025-08-06T02:39:40Z

that will be great if we next time, when doing the rebase merge, we squash the PR number like #2089 in the commit message title. (I usually locate the PR web page based on the commit message title)

Following new API presented in #2089, we might access out of bound memory in case of some illegal command input Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

xbasel added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label May 25, 2025

ranshid marked this pull request as ready for review June 4, 2025 11:37

ranshid force-pushed the ttl-poc-new branch from ea6ed7c to 2c4c312 Compare June 4, 2025 12:37

zuiderkwast mentioned this pull request Jun 16, 2025

Hash TTL tests - [don't review] #2136

Draft

zuiderkwast reviewed Jun 16, 2025

View reviewed changes

ranshid force-pushed the ttl-poc-new branch from f63d829 to de675bc Compare June 18, 2025 10:12

zuiderkwast reviewed Jun 18, 2025

View reviewed changes

src/Makefile Outdated Show resolved Hide resolved

ranshid force-pushed the ttl-poc-new branch 2 times, most recently from 65eeb1d to 8ecd584 Compare June 18, 2025 15:12

SoftlyRaining reviewed Jun 18, 2025

View reviewed changes

ranshid force-pushed the ttl-poc-new branch from af11752 to 2d5c653 Compare June 19, 2025 06:06

zuiderkwast reviewed Jun 19, 2025

View reviewed changes

src/hashtable.c Outdated Show resolved Hide resolved

src/entry.c Outdated Show resolved Hide resolved