Hash Field Expiration RFC #22

ranshid · 2025-05-07T13:10:56Z

No description provided.

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

HFE.md

zuiderkwast

I didn't read all of it, but don't spend too much time on the document. I'd like to see a working implementation. :)

HFE.md

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

madolson

Glad to see us making progress on our top feature request!

HFE.md

madolson · 2025-05-07T16:37:52Z

HFE.md

+    NOTE that for some cases (e.g HSETEX, there will be 2 events issued: `HSET` and `EXPIRE`)
+* A new `hpersist` event will be issued whenever an item is persisted. this can be when `HPERSIST` was issued.
+* A new `hexpired` event will be issued whenever an item is actually being expired (either actively or lazily) 
+    NOTE 1 - for the initial implementation the plan is to emit `hexpired` event for each field expiry, however it might be a valid future performance optimization to batch multiple expirations on the same key into a single event reporting.  


Sounds like something we shouldn't optimize for later if we think it's important. That would require client changes.

I am not sure which is better. the event has no indication about the field being expired, BUT if we batch to the same event the application will not get events in the same number as the expired items. not sure which is better performance vs functionality. It is true that batching it later will probably be a breaking change (but in case it will break it also means that some applications are logically dependent on this separation :) )

Are there any other examples of keyspace events which mention multiple keys(fields)? For instance, does HMSET, generate a single event? or multiple?

If there are no examples of events spanning multiple keys/fields, I think it's better to maintain consistency & simplicity. It's very probable that nobody will subscribe to these events, which make performance a non-issue.

Are there any other examples of keyspace events which mention multiple keys(fields)? For instance, does HMSET, generate a single event? or multiple?

Most commands generate a single event per command and not per field. For example hdel is reported once per command. In this proposal the hexpire event will ALSO be reported once per command and not per item. the porposal DOES state that the new hexpired event will be reported once per item. However I can think of improving that if we decide to. For example we can identify that we expired items in the start/end of the access context and decide to issue the event in the closing of the access context.

madolson · 2025-05-08T15:46:42Z

HFE.md

+* As item expiration will produce replication content, in some cases we will avoid applying full expiration logic.
+    the following cases will avoid lazy expiring items:
+     - during HSCAN, HGETALL, Copy (when duplicating an element) and RDB/EOF loading.


I assume also client pause and coordinated failovers.

Yes. true. I mentioned at another place that the expiration context will have the same apply logic as generic key expiration (ie when expirations are paused or on replica etc...)

Said another way... You're saying that READ commands (like HGETALL) shouldn't perform modifications (WRITES) to the data structure. Right?

Modification of the structure (even deleting logically expired data) should be avoided during a "READ" operation.

Modification of the structure (even deleting logically expired data) should be avoided during a "READ" operation.

No. saying that would basically mean we do not support lazy expirations. I am saying that there is a common expiration logic which is applied for generic keys. currently this logic can be observed by reading the content of expireIfNeededWithDictIndex which performs various checks on importing mode, replication client, eviction pause etc... we will also comply with the same logic when deciding to expire hash items.

HFE.md

madolson · 2025-05-08T15:47:37Z

HFE.md

+* As item expiration will produce replication content, in some cases we will avoid applying full expiration logic.
+    the following cases will avoid lazy expiring items:
+     - during HSCAN, HGETALL, Copy (when duplicating an element) and RDB/EOF loading.


I assume also client pause and coordinated failovers.

I also don't follow why HGETALL wouldn't emit deletion events? KEYS * does?

AFAIK KEYS command does NOT expire keys. it will just avoid adding them in the response

Yes, we don't want READ commands to be generating replication traffic.

Yes, we don't want READ commands to be generating replication traffic.

Well, we probably don't want it because of the implementation complexity, however lazy expiring keys is basically what this feature provides. we can decide to avoid READ commands lazy expirations if we want to.

madolson · 2025-05-08T15:53:28Z

HFE.md

+
+### Volatile hash entry memory layout
+
+Currently a field is always an SDS in Valkey. Although it is possible to match a key with external metadata (eg TTL) by mapping the key to the relevant metadata, it will incur extra memory utilization to hold the mapping and will require to use extra CPU cycles in order to locate the TTL per each query. Some dictionaries use objects with embedded keys were the metadata can be set as part of the object. However that would require every dictionary which needs TTL support to use objects with embedded  keys and might significantly complicate existing code paths as well as require extra memory in order hold the object metadata.


I think the folks from VSS want to replace hash values with VSS indexed positions as well, I think that's covered with your referenced value, but wanted to comment.

HFE.md

madolson · 2025-05-08T15:59:47Z

HFE.md

+    Cons:
+     - Error prone - there are many cases where an item is accessed  
+     - Might require extensive code changes.
+     - In some cases can lead to performance degradation on the good-path - It is possible that in order to avoid code complexity we would consider to apply the `itemExpireIfNeeded` logic by first searching the item (provided in the command arguments) and then proceed with the normal implementation. Since double searching the element would probably maintain cache locality for the second search, in reality we observed 2-3% degradation by applying double search on items in the command processing.


We might consider having a dedicated "HashWithExpire" type, that also includes the number of expirable items. That ways we can efficiently skip any logic if we are operating on the special type.

It is possible, however I wanted to check if we can have a change with minimal footprint, without sacrifice other things.

regarding dynamically change the hashtable type... it is doable, small change probably, lets evaluate during the PR review?

that also includes the number of expirable items. That ways we can efficiently skip any logic if we are operating on the special type.

That one is planned and will be part of the initial draft.
we will have both hashTypeHasVolatileItems() and hashTypeNumVolatileItems()

madolson · 2025-05-08T16:02:35Z

HFE.md

+     - Minimal code changes.
+     - Less error prone - since this will be applied in every hashtable access we will reduce the risk of missing item being                 accessed in some flows. 
+Cons:
+      - Additional check for `accessElement` existence in the generic hashtable implementation (we have not yet evaluated if


My intuition is that this will be more degrading than we expect, but it's hard to intuit this without actually seeing the code.

Yeh. the draft will be issues this week (if all goes well)

hwware · 2025-05-08T18:28:09Z

HFE.md

+* Support Redis compatible API to set, get and manipulate Hash field TTL.
+* Support both Lazy and active expiry mechanisms for Hash field expirations.
+* Support Replication of elements TTL as well as expired element replication.
+* Extended support for the same functionality with Sets and Sorted sets.


I just go through several parts quickly, and the Sets and Sorted Set are not mentioned in this RFC, can we just remove these words? Secondly, is there any reason a List can not have the TTL for its element?

SETs and SortedSets are something we should plan for IMO. I am just not sure if we will be able to make it for 9.0 timeline with both of them so we prioritised Hashes first. SETs are the next target.

is there any reason a List can not have the TTL for its element

No special reason aside for the fact that there was no clear request for it in the community. you could use a set or sorted set for that matter right?

Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

xbasel · 2025-05-11T09:40:09Z

HFE.md

+### Tenets
+
+* **memory efficiency** - At the highest priority we target a minimal metadata overhead used in order to store and manage items TTL. While the optimal overhead to maintain item TTL is 8 bytes (could be less if we allow keeping offsets from the existing epoch diff time), we understand that maintaining active expiry logic will require use of more bytes for metadata. We will make our top priority effort to minimize this overhead.
+* **Latency** - After memory efficiency considerations we will require a solution which provides low latency for hash operations. Hash objects are expected to operate in O(1) for all single access operations (get, set, delete etc...) and we will not break this promise even for items with expiry.


"provides low latency" is vague. Do you mean 'doesn’t regress existing latency by a noticeable margin'?

IMO I was clear about asymptotic guarantees.

xbasel · 2025-05-11T10:46:23Z

HFE.md

+1. VALID - meaning the item is to be treated as any existing item and be included in replies as well as operated on during actions and mutations.     
+2. INVALID - meaning the item should be treated as “not exists” and is NOT to be included in any operations and/or replies.
+3. INVALIDATED(deleted) - meaning the item is to-be-removed immediately (i.e expired) and thus does not exist anymore. 


The terms VALID, INVALID, and INVALIDATED are too similar and not intuitive. INVALID suggests corruption, but an expired field is still well-formed and valid. Consider clearer terms like LIVE/NORMAL, EXPIRED, and DELETED/PURGED to reflect TTL state accurately.

I agree. In general, I think we should use the word "expired" to mean that the time has passed for a key, whether it still exists in memory or not. An expired key can exist in the database if we haven't deleted it yet.

We should not use this word as a transitive verb as in "to expire a key". Instead we should say "to delete an expired key".

I agree about this terminology in the scope of lazy expiration (i.e. in the RFC). I used these terms since these are the states recognized by the hashtable which I would like to keep out of scope of "expiration" and "volatile".
I will change it in the RFC.

@xbasel / @zuiderkwast fixed. looks more to your point?

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

zuiderkwast · 2025-05-11T20:45:29Z

HFE.md

+NOTE - we can also consider adding keys_with_volatile_items statistic to track how many objects have 
+volatile items. eg:
+
+```
+db0:keys=1,expires=0,avg_ttl=0,volatile_items=16,keys_with_volatile_items=1


We don't have anything for items currently so it seems fine to me to skip the info about volatile items initially.

If we want it, then I guess we should also add the total number of items (including non-volatile) and the number of keys with items (including non-volatile items).

Regarding the nameing, shouldn't it match the naming used for keys? Here "expires" means the number of volatile keys. To match that, we could use "items", "item_expires" and "item_avg_ttl".

Later, if we add expiration on set elements, sorted set elements, etc. then all these are considered items too, right? Isn't it more useful to have the number of hash fields with expire, set elements with expire and sorted set elements with expire as separate metrics? The full picture starts to look like Wen's KEYSIZES fields: valkey-io/valkey#1967

The full picture starts to look like Wen's KEYSIZES fields: valkey-io/valkey#1967

Exactly. I think we will be able to add items statistics and such later on as part of the new KEYS observability

Again, I suggest a new line. Also, in the new line, you can't use a prefix like db0: as this might be searched for explicitly.

I just checked what redis does. From their docs of the INFO command, we can see they added "subexpiry":

# Keyspace db0:keys=112125,expires=456,avg_ttl=31368299122246,subexpiry=0

xbasel · 2025-05-12T12:43:59Z

HFE.md

+1. volatile_items will be added to the Keyspace section per-db line. eg:
+
+```
+db0:keys=1,expires=0,avg_ttl=0,volatile_items=16


volatile_items is technically clear, but slightly ambiguous, it might be confused with expires. Maybe:
expire_fields
expiring_fields
volatile_fields

I recommend against modification of this existing INFO line.
This line - containing the number of keys - is one of the most likely lines in INFO to be parsed by a client application (or even a client library). Altering this line has the opportunity to break a large number of existing client applications.

I suggest you add an additional line (or lines) containing the new information.

We can hope that clients that parse this split by comma and can handle extra fields. It may break a few but the alternative (a new line per db) may be uglier.

Redis added "subexpiry=0". Can we do the same? We copy their command API...

JimB123 · 2025-05-16T22:04:49Z

HFE.md

+* **Latency** - After memory efficiency considerations we will require a solution which provides low latency for hash operations. Hash objects are expected to operate in O(1) for all single access operations (get, set, delete etc...) and we will not break this promise even for items with expiry.
+* **CPU efficiency** - After latency we priorities system CPU efficiency. For example we would like to avoid high CPU utilization caused by need to perform null active expiry checks during cron runs.
+* **Compatability** - We will avoid breaking clients which are already using HFE API provided by other providers.
+* **Coherency** -  We would like the reported system state to match the logical state as much as possible. For example the reported number of keys per DB is expected to match the number of keys which are NOT logically expired.  


I'm not completely sure what you're saying here. Are you comparing items in a hash to items in the DB?

From the DB perspective, when we perform INFO, the number of keys reported in the last line will include keys which have passed the expiration time but have not yet been physically deleted, right?

Are you suggesting that HASHes should behave "properly"? and only report the number of unexpired items?

Are you suggesting that HASHes should behave "properly"? and only report the number of unexpired items?

I was mainly setting a tenet here. We will not fully support this at this point (for example HLEN will, return the number of items even though some of the items have already been expired). In some implementations, though, we can basically know EXACTLY how many items are expired. For example in case we track all volatile hash items in rax (like the client's timeout rax) we can provide an O(1) report for the number of items which are already expired. However this would cost much memory to maintain, thus this is currently avoided and the fact that this tenet is lower priority than the memory efficiency is providing the justification for such a decision.

JimB123 · 2025-05-16T22:12:56Z

HFE.md

+    NOTE that for some cases (e.g HSETEX, there will be 2 events issued: `HSET` and `EXPIRE`)
+* A new `hpersist` event will be issued whenever an item is persisted. this can be when `HPERSIST` was issued.
+* A new `hexpired` event will be issued whenever an item is actually being expired (either actively or lazily) 
+    NOTE 1 - for the initial implementation the plan is to emit `hexpired` event for each field expiry, however it might be a valid future performance optimization to batch multiple expirations on the same key into a single event reporting.  


Are there any other examples of keyspace events which mention multiple keys(fields)? For instance, does HMSET, generate a single event? or multiple?

If there are no examples of events spanning multiple keys/fields, I think it's better to maintain consistency & simplicity. It's very probable that nobody will subscribe to these events, which make performance a non-issue.

JimB123 · 2025-05-16T22:19:33Z

HFE.md

+1. volatile_items will be added to the Keyspace section per-db line. eg:
+
+```
+db0:keys=1,expires=0,avg_ttl=0,volatile_items=16


I recommend against modification of this existing INFO line.
This line - containing the number of keys - is one of the most likely lines in INFO to be parsed by a client application (or even a client library). Altering this line has the opportunity to break a large number of existing client applications.

I suggest you add an additional line (or lines) containing the new information.

JimB123 · 2025-05-16T22:20:59Z

HFE.md

+NOTE - we can also consider adding keys_with_volatile_items statistic to track how many objects have 
+volatile items. eg:
+
+```
+db0:keys=1,expires=0,avg_ttl=0,volatile_items=16,keys_with_volatile_items=1


Again, I suggest a new line. Also, in the new line, you can't use a prefix like db0: as this might be searched for explicitly.

JimB123 · 2025-05-16T22:23:41Z

HFE.md

+* As item expiration will produce replication content, in some cases we will avoid applying full expiration logic.
+    the following cases will avoid lazy expiring items:
+     - during HSCAN, HGETALL, Copy (when duplicating an element) and RDB/EOF loading.


Said another way... You're saying that READ commands (like HGETALL) shouldn't perform modifications (WRITES) to the data structure. Right?

Modification of the structure (even deleting logically expired data) should be avoided during a "READ" operation.

JimB123 · 2025-05-16T22:24:21Z

HFE.md

+* As item expiration will produce replication content, in some cases we will avoid applying full expiration logic.
+    the following cases will avoid lazy expiring items:
+     - during HSCAN, HGETALL, Copy (when duplicating an element) and RDB/EOF loading.


Yes, we don't want READ commands to be generating replication traffic.

zuiderkwast · 2025-05-17T09:42:56Z

HFE.md

+    The same `hexpire` event will be issued for all different commands which manipulate item TTL (e.g.  `HEXPIRE`, `HEXPIREAT`, `HPEXPIRE` etc...)
+    NOTE that for some cases (e.g HSETEX, there will be 2 events issued: `HSET` and `HEXPIRE`)
+* A new `hpersist` event will be issued whenever an item is persisted. this can be when `HPERSIST` was issued.
+* A new `hexpired` event will be issued whenever an item is actually being expired (either actively or lazily) 


Regarding "being expired" terminology, I think the word expired should always refer to logically expired.

Suggested change

* A new `hexpired` event will be issued whenever an item is actually being expired (either actively or lazily)

* A new `hexpired` event will be issued whenever an expired item is detected and deleted (either actively or lazily)

xbasel · 2025-06-24T20:31:57Z

HFE.md

+### Active expiry cycle and credits 
+
+We plan to introduce a new type of active expiry cycle (in addition to `ACTIVE_EXPIRE_CYCLE_FAST` and `ACTIVE_EXPIRE_CYCLE_SLOW` ): `ACTIVE_EXPIRE_CYCLE_ITEMS`.
+The new expiry cycle will use the same overall logic as the regular active expiry cycle with the following adjustments:  


Key active expiry is built around iterating over databases, and for each DB it scans db->expires using a hit-or-miss approach over keys with expiry. Hash field expiry doesn’t follow the same model, it’s not based on random sampling, and its pacing is entirely different.

I think trying to couple field expiry with key expiry would require a significant refactor, and I don’t see a clear benefit. Expiring keys and fields are logically unrelated and should remain decoupled. I don’t see how field expiry fits naturally into the existing loops in activeExpireCycle(). Entry expiry should have its own database iterator IMO.

Using the same logic for both means combining two almost entirely independent mechanisms into one, which adds complexity without meaningful gain. Is it really worth it?

Separate sounds good to me, unless Ran has some good motivation for coupling them.

Closes valkey-io#640 This PR introduces support for **field-level expiration in Valkey hash types**, making it possible for individual fields inside a hash to expire independently — creating what we call **volatile fields**. This is just the first out of 3 PRs. The content of this PR focus on enabling the basic ability to set and modify hash fields expiration as well as persistency (AOF+RDB) and defrag. [The second PR](#5) introduces the new algorithm (volatile-set) to track volatile hash fields is in the last stages of review. The current implementation in this PR (in volatile-set.h/c) is just s tub implementation and will be replaced by [The second PR](#5) [The third PR](#4) which introduces the active expiration and defragmentation jobs. For more highlevel design details you can track the RFC PR: valkey-io/valkey-rfc#22. --- Some highlevel major decisions which are taken as part of this work: 1. We decided to copy the existing Redis API in order to maintain compatibility with existing clients. 2. We decided to avoid introducing lazy-expiration at this point, in order to reduce complexity and rely only on active-expiration for memory reclamation. This will require us to continue to work on improving the active expiration job and potentially consider introduce lazy-expiration support later on. 3. Although different commands which are adding expiration on hash fields are influencing the memory utilization (by allocating more memory for expiration time and metadata) we decided to avoid adding the DENYOOM for these commands (an exception is HSETEX) in order to be better aligned with highlevel keys commands like `expire` 4. Some hash type commands will produce unexpected results: - HLEN - will still reflect the number of fields which exists in the hash object (either actually expired or not). - HRANDFIELD - in some cases we will not be able to randomly select a field which was not already expired. this case happen in 2 cases: 1/ when we are asked to provide a non-uniq fields (i.e negative count) 2/ when the size of the hash is much bigger than the count and we need to provide uniq results. In both cases it is possible that an empty response will be returned to the caller, even in case there are fields in the hash which are either persistent or not expired. 5. For the case were a field is provided with a zero (0) expiration time or expiration time in the past, it is immediately deleted. We decided that, in order to be aligned with how high level keys are handled, we will emit hexpired keyspace event for that case (instead of hdel). For example: for the case: 6. We will ALWAYS load hash fields during rdb load. This means that when primary is rebooting with an old snapshot, it will take time to reclaim all the expired fields. However this simplifies the current logic and avoid major refactoring that I suspect will be needed. ``` HSET myhash f1 v1 > 0 HGETEX myhash EX 0 FIELDS 1 f1 > "v1" HTTL myhash FIELDS 1 f1 > -2 ``` The reported events are: ``` 1) "psubscribe" 2) "__keyevent@0__*" 3) (integer) 1 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:hset" 4) "myhash" 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:hexpired" <---------------- note this 4) "myhash" 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:del" 4) "myhash" ``` --- This PR also **modularizes and exposes the internal `hashTypeEntry` logic** as a new standalone `entry.c/h` module. This new abstraction handles all aspects of **field–value–expiry encoding** using multiple memory layouts optimized for performance and memory efficiency. An `entry` is an abstraction that represents a single **field–value pair with optional expiration**. Internally, Valkey uses different memory layouts for compactness and efficiency, chosen dynamically based on size and encoding constraints. The entry pointer is the field sds. Which make us use an entry just like any sds. We encode the entry layout type in the field SDS header. Field type SDS_TYPE_5 doesn't have any spare bits to encode this so we use it only for the first layout type. Entry with embedded value, used for small sizes. The value is stored as SDS_TYPE_8. The field can use any SDS type. Entry can also have expiration timestamp, which is the UNIX timestamp for it to be expired. For aligned fast access, we keep the expiry timestamp prior to the start of the sds header. +----------------+--------------+---------------+ | Expiration | field | value | | 1234567890LL | hdr "foo" \0 | hdr8 "bar" \0 | +-----------------------^-------+---------------+ | | entry pointer (points to field sds content) Entry with value pointer, used for larger fields and values. The field is SDS type 8 or higher. +--------------+-------+--------------+ | Expiration | value | field | | 1234567890LL | ptr | hdr "foo" \0 | +--------------+--^----+------^-------+ | | | | | entry pointer (points to field sds content) | value pointer = value sds The `entry.c/h` API provides methods to: - Create, read, and write and Update field/value/expiration - Set or clear expiration - Check expiration state - Clone or delete an entry --- This PR introduces **new commands** and extends existing ones to support field expiration: The proposed API is very much identical to the Redis provided API (Redis 7.4 + 8.0). This is intentionally proposed in order to avoid breaking client applications already opted to use hash items TTL. **Synopsis** ``` HSETEX key [NX | XX] [FNX | FXX] [EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL] FIELDS numfields field value [field value ...] ``` Set the value of one or more fields of a given hash key, and optionally set their expiration time or time-to-live (TTL). The HSETEX command supports the following set of options: * `NX` — Only set the fields if the hash object does NOT exist. * `XX` — Only set the fields if if the hash object doesx exist. * `FNX` — Only set the fields if none of them already exist. * `FXX` — Only set the fields if all of them already exist. * `EX seconds` — Set the specified expiration time in seconds. * `PX milliseconds` — Set the specified expiration time in milliseconds. * `EXAT unix-time-seconds` — Set the specified Unix time in seconds at which the fields will expire. * `PXAT unix-time-milliseconds` — Set the specified Unix time in milliseconds at which the fields will expire. * `KEEPTTL` — Retain the TTL associated with the fields. The `EX`, `PX`, `EXAT`, `PXAT`, and `KEEPTTL` options are mutually exclusive. **Synopsis** ``` HGETEX key [EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT unix-time-milliseconds | PERSIST] FIELDS numfields field [field ...] ``` Get the value of one or more fields of a given hash key and optionally set their expiration time or time-to-live (TTL). The `HGETEX` command supports a set of options: * `EX seconds` — Set the specified expiration time, in seconds. * `PX milliseconds` — Set the specified expiration time, in milliseconds. * `EXAT unix-time-seconds` — Set the specified Unix time at which the fields will expire, in seconds. * `PXAT unix-time-milliseconds` — Set the specified Unix time at which the fields will expire, in milliseconds. * `PERSIST` — Remove the TTL associated with the fields. The `EX`, `PX`, `EXAT`, `PXAT`, and `PERSIST` options are mutually exclusive. **Synopsis** ``` HEXPIRE key seconds [NX | XX | GT | LT] FIELDS numfields field [field ...] ``` Set an expiration (TTL or time to live) on one or more fields of a given hash key. You must specify at least one field. Field(s) will automatically be deleted from the hash key when their TTLs expire. Field expirations will only be cleared by commands that delete or overwrite the contents of the hash fields, including `HDEL` and `HSET` commands. This means that all the operations that conceptually *alter* the value stored at a hash key's field without replacing it with a new one will leave the TTL untouched. You can clear the TTL of a specific field by specifying 0 for the ‘seconds’ argument. Note that calling `HEXPIRE`/`HPEXPIRE` with a time in the past will result in the hash field being deleted immediately. The `HEXPIRE` command supports a set of options: * `NX` — For each specified field, set expiration only when the field has no expiration. * `XX` — For each specified field, set expiration only when the field has an existing expiration. * `GT` — For each specified field, set expiration only when the new expiration is greater than current one. * `LT` — For each specified field, set expiration only when the new expiration is less than current one. **Synopsis** ``` HEXPIREAT key unix-time-seconds [NX | XX | GT | LT] FIELDS numfields field [field ...] ``` `HEXPIREAT` has the same effect and semantics as `HEXPIRE`, but instead of specifying the number of seconds for the TTL (time to live), it takes an absolute Unix timestamp in seconds since Unix epoch. A timestamp in the past will delete the field immediately. The `HEXPIREAT` command supports a set of options: * `NX` — For each specified field, set expiration only when the field has no expiration. * `XX` — For each specified field, set expiration only when the field has an existing expiration. * `GT` — For each specified field, set expiration only when the new expiration is greater than current one. * `LT` — For each specified field, set expiration only when the new expiration is less than current one. **Synopsis** ``` HPEXPIRE key milliseconds [NX | XX | GT | LT] FIELDS numfields field [field ...] ``` This command works like `HEXPIRE`, but the expiration of a field is specified in milliseconds instead of seconds. The `HPEXPIRE` command supports a set of options: * `NX` — For each specified field, set expiration only when the field has no expiration. * `XX` — For each specified field, set expiration only when the field has an existing expiration. * `GT` — For each specified field, set expiration only when the new expiration is greater than current one. * `LT` — For each specified field, set expiration only when the new expiration is less than current one. **Synopsis** ``` HPEXPIREAT key unix-time-milliseconds [NX | XX | GT | LT] FIELDS numfields field [field ...] ``` `HPEXPIREAT` has the same effect and semantics as `HEXPIREAT``,` but the Unix time at which the field will expire is specified in milliseconds since Unix epoch instead of seconds. **Synopsis** ``` HPERSIST key FIELDS numfields field [field ...] ``` Remove the existing expiration on a hash key's field(s), turning the field(s) from *volatile* (a field with expiration set) to *persistent* (a field that will never expire as no TTL (time to live) is associated). **Synopsis** ``` HSETEX key [NX] seconds field value [field value ...] ``` Similar to `HSET` but adds one or more hash fields that expire after specified number of seconds. By default, this command overwrites the values and expirations of specified fields that exist in the hash. If `NX` option is specified, the field data will not be overwritten. If `key` doesn't exist, a new Hash key is created. The HSETEX command supports a set of options: * `NX` — For each specified field, set expiration only when the field has no expiration. **Synopsis** ``` HTTL key FIELDS numfields field [field ...] ``` Returns the **remaining** TTL (time to live) of a hash key's field(s) that have a set expiration. This introspection capability allows you to check how many seconds a given hash field will continue to be part of the hash key. ``` HPTTL key FIELDS numfields field [field ...] ``` Like `HTTL`, this command returns the remaining TTL (time to live) of a field that has an expiration set, but in milliseconds instead of seconds. **Synopsis** ``` HEXPIRETIME key FIELDS numfields field [field ...] ``` Returns the absolute Unix timestamp in seconds since Unix epoch at which the given key's field(s) will expire. **Synopsis** ``` HPEXPIRETIME key FIELDS numfields field [field ...] ``` `HPEXPIRETIME` has the same semantics as `HEXPIRETIME`, but returns the absolute Unix expiration timestamp in milliseconds since Unix epoch instead of seconds. This PR introduces new notification events to support field-level expiration: | Event | Trigger | |-------------|-------------------------------------------| | `hexpire` | Field expiration was set | | `hexpired` | Field was deleted due to expiration | | `hpersist` | Expiration was removed from a field | | `del` | Key was deleted after all fields expired | Note that we diverge from Redis in the cases we emit hexpired event. For example: given the following usecase: ``` HSET myhash f1 v1 (integer) 0 HGETEX myhash EX 0 FIELDS 1 f1 1) "v1" HTTL myhash FIELDS 1 f1 1) (integer) -2 ``` regarding the keyspace-notifications: Redis reports: ``` 1) "psubscribe" 2) "__keyevent@0__:*" 3) (integer) 1 1) "pmessage" 2) "__keyevent@0__:*" 3) "__keyevent@0__:hset" 4) "myhash2" 1) "pmessage" 2) "__keyevent@0__:*" 3) "__keyevent@0__:hdel" <---------------- note this 4) "myhash2" 1) "pmessage" 2) "__keyevent@0__:*" 3) "__keyevent@0__:del" 4) "myhash2" ``` However In our current suggestion, Valkey will emit: ``` 1) "psubscribe" 2) "__keyevent@0__*" 3) (integer) 1 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:hset" 4) "myhash" 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:hexpired" <---------------- note this 4) "myhash" 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:del" 4) "myhash" ``` --- - Expiration-aware commands (`HSETEX`, `HGETEX`, etc.) are **not propagated as-is**. - Instead, Valkey rewrites them into equivalent commands like: - `HDEL` (for expired fields) - `HPEXPIREAT` (for setting absolute expiration) - `HPERSIST` (for removing expiration) This ensures compatibility with replication and AOF while maintaining consistent field-level expiry behavior. --- | Command Name | QPS Standard | QPS HFE | QPS Diff % | Latency Standard (ms) | Latency HFE (ms) | Latency Diff % | |--------------|-------------|---------|------------|----------------------|------------------|----------------| | **One Large Hash Table** | | HGET | 137988.12 | 138484.97 | +0.36% | 0.951 | 0.949 | -0.21% | | HSET | 138561.73 | 137343.77 | -0.87% | 0.948 | 0.956 | +0.84% | | HEXISTS | 139431.12 | 138677.02 | -0.54% | 0.942 | 0.946 | +0.42% | | HDEL | 140114.89 | 138966.09 | -0.81% | 0.938 | 0.945 | +0.74% | | **Many Hash Tables (100 fields)** | | HGET | 136798.91 | 137419.27 | +0.45% | 0.959 | 0.956 | -0.31% | | HEXISTS | 138946.78 | 139645.31 | +0.50% | 0.946 | 0.941 | -0.52% | | HGETALL | 42194.09 | 42016.80 | -0.42% | 0.621 | 0.625 | +0.64% | | HSET | 137230.69 | 137249.53 | +0.01% | 0.959 | 0.958 | -0.10% | | HDEL | 138985.41 | 138619.34 | -0.26% | 0.948 | 0.949 | +0.10% | | **Many Hash Tables (1000 fields)** | | HGET | 135795.77 | 139256.36 | +2.54% | 0.965 | 0.943 | -2.27% | | HEXISTS | 138121.55 | 137950.06 | -0.12% | 0.951 | 0.952 | +0.10% | | HGETALL | 5885.81 | 5633.80 | **-4.28%** | 2.690 | 2.841 | **+5.61%** | | HSET | 137005.08 | 137400.39 | +0.28% | 0.959 | 0.955 | -0.41% | | HDEL | 138293.45 | 137381.52 | -0.65% | 0.948 | 0.955 | +0.73% | [ ] Consider extending HSETEX with extra arguments: NX/XX so that it is possible to prevent adding/setting/mutating fields of a non-existent hash [ ] Avoid loading expired fields when non-preamble RDB is being loaded on primary. This is an optimization in order to reduce loading unnecessary fields (which are expired). This would also require us to propagate the HDEL to the replicas in case of RDBFLAGS_FEED_REPL. Note that it might have to require some refactoring: 1/ propagate the rdbflags and current time to rdbLoadObject. 2/ consider the case of restore and check_rdb etc... For this reason I would like to avoid this optimizationfor the first drop.

Closes valkey-io#640 This PR introduces support for **field-level expiration in Valkey hash types**, making it possible for individual fields inside a hash to expire independently — creating what we call **volatile fields**. This is just the first out of 3 PRs. The content of this PR focus on enabling the basic ability to set and modify hash fields expiration as well as persistency (AOF+RDB) and defrag. [The second PR](#5) introduces the new algorithm (volatile-set) to track volatile hash fields is in the last stages of review. The current implementation in this PR (in volatile-set.h/c) is just s tub implementation and will be replaced by [The second PR](#5) [The third PR](#4) which introduces the active expiration and defragmentation jobs. For more highlevel design details you can track the RFC PR: valkey-io/valkey-rfc#22. --- Some highlevel major decisions which are taken as part of this work: 1. We decided to copy the existing Redis API in order to maintain compatibility with existing clients. 2. We decided to avoid introducing lazy-expiration at this point, in order to reduce complexity and rely only on active-expiration for memory reclamation. This will require us to continue to work on improving the active expiration job and potentially consider introduce lazy-expiration support later on. 3. Although different commands which are adding expiration on hash fields are influencing the memory utilization (by allocating more memory for expiration time and metadata) we decided to avoid adding the DENYOOM for these commands (an exception is HSETEX) in order to be better aligned with highlevel keys commands like `expire` 4. Some hash type commands will produce unexpected results: - HLEN - will still reflect the number of fields which exists in the hash object (either actually expired or not). - HRANDFIELD - in some cases we will not be able to randomly select a field which was not already expired. this case happen in 2 cases: 1/ when we are asked to provide a non-uniq fields (i.e negative count) 2/ when the size of the hash is much bigger than the count and we need to provide uniq results. In both cases it is possible that an empty response will be returned to the caller, even in case there are fields in the hash which are either persistent or not expired. 5. For the case were a field is provided with a zero (0) expiration time or expiration time in the past, it is immediately deleted. We decided that, in order to be aligned with how high level keys are handled, we will emit hexpired keyspace event for that case (instead of hdel). For example: for the case: 6. We will ALWAYS load hash fields during rdb load. This means that when primary is rebooting with an old snapshot, it will take time to reclaim all the expired fields. However this simplifies the current logic and avoid major refactoring that I suspect will be needed. ``` HSET myhash f1 v1 > 0 HGETEX myhash EX 0 FIELDS 1 f1 > "v1" HTTL myhash FIELDS 1 f1 > -2 ``` The reported events are: ``` 1) "psubscribe" 2) "__keyevent@0__*" 3) (integer) 1 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:hset" 4) "myhash" 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:hexpired" <---------------- note this 4) "myhash" 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:del" 4) "myhash" ``` --- This PR also **modularizes and exposes the internal `hashTypeEntry` logic** as a new standalone `entry.c/h` module. This new abstraction handles all aspects of **field–value–expiry encoding** using multiple memory layouts optimized for performance and memory efficiency. An `entry` is an abstraction that represents a single **field–value pair with optional expiration**. Internally, Valkey uses different memory layouts for compactness and efficiency, chosen dynamically based on size and encoding constraints. The entry pointer is the field sds. Which make us use an entry just like any sds. We encode the entry layout type in the field SDS header. Field type SDS_TYPE_5 doesn't have any spare bits to encode this so we use it only for the first layout type. Entry with embedded value, used for small sizes. The value is stored as SDS_TYPE_8. The field can use any SDS type. Entry can also have expiration timestamp, which is the UNIX timestamp for it to be expired. For aligned fast access, we keep the expiry timestamp prior to the start of the sds header. +----------------+--------------+---------------+ | Expiration | field | value | | 1234567890LL | hdr "foo" \0 | hdr8 "bar" \0 | +-----------------------^-------+---------------+ | | entry pointer (points to field sds content) Entry with value pointer, used for larger fields and values. The field is SDS type 8 or higher. +--------------+-------+--------------+ | Expiration | value | field | | 1234567890LL | ptr | hdr "foo" \0 | +--------------+--^----+------^-------+ | | | | | entry pointer (points to field sds content) | value pointer = value sds The `entry.c/h` API provides methods to: - Create, read, and write and Update field/value/expiration - Set or clear expiration - Check expiration state - Clone or delete an entry --- This PR introduces **new commands** and extends existing ones to support field expiration: The proposed API is very much identical to the Redis provided API (Redis 7.4 + 8.0). This is intentionally proposed in order to avoid breaking client applications already opted to use hash items TTL. **Synopsis** ``` HSETEX key [NX | XX] [FNX | FXX] [EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL] FIELDS numfields field value [field value ...] ``` Set the value of one or more fields of a given hash key, and optionally set their expiration time or time-to-live (TTL). The HSETEX command supports the following set of options: * `NX` — Only set the fields if the hash object does NOT exist. * `XX` — Only set the fields if if the hash object doesx exist. * `FNX` — Only set the fields if none of them already exist. * `FXX` — Only set the fields if all of them already exist. * `EX seconds` — Set the specified expiration time in seconds. * `PX milliseconds` — Set the specified expiration time in milliseconds. * `EXAT unix-time-seconds` — Set the specified Unix time in seconds at which the fields will expire. * `PXAT unix-time-milliseconds` — Set the specified Unix time in milliseconds at which the fields will expire. * `KEEPTTL` — Retain the TTL associated with the fields. The `EX`, `PX`, `EXAT`, `PXAT`, and `KEEPTTL` options are mutually exclusive. **Synopsis** ``` HGETEX key [EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT unix-time-milliseconds | PERSIST] FIELDS numfields field [field ...] ``` Get the value of one or more fields of a given hash key and optionally set their expiration time or time-to-live (TTL). The `HGETEX` command supports a set of options: * `EX seconds` — Set the specified expiration time, in seconds. * `PX milliseconds` — Set the specified expiration time, in milliseconds. * `EXAT unix-time-seconds` — Set the specified Unix time at which the fields will expire, in seconds. * `PXAT unix-time-milliseconds` — Set the specified Unix time at which the fields will expire, in milliseconds. * `PERSIST` — Remove the TTL associated with the fields. The `EX`, `PX`, `EXAT`, `PXAT`, and `PERSIST` options are mutually exclusive. **Synopsis** ``` HEXPIRE key seconds [NX | XX | GT | LT] FIELDS numfields field [field ...] ``` Set an expiration (TTL or time to live) on one or more fields of a given hash key. You must specify at least one field. Field(s) will automatically be deleted from the hash key when their TTLs expire. Field expirations will only be cleared by commands that delete or overwrite the contents of the hash fields, including `HDEL` and `HSET` commands. This means that all the operations that conceptually *alter* the value stored at a hash key's field without replacing it with a new one will leave the TTL untouched. You can clear the TTL of a specific field by specifying 0 for the ‘seconds’ argument. Note that calling `HEXPIRE`/`HPEXPIRE` with a time in the past will result in the hash field being deleted immediately. The `HEXPIRE` command supports a set of options: * `NX` — For each specified field, set expiration only when the field has no expiration. * `XX` — For each specified field, set expiration only when the field has an existing expiration. * `GT` — For each specified field, set expiration only when the new expiration is greater than current one. * `LT` — For each specified field, set expiration only when the new expiration is less than current one. **Synopsis** ``` HEXPIREAT key unix-time-seconds [NX | XX | GT | LT] FIELDS numfields field [field ...] ``` `HEXPIREAT` has the same effect and semantics as `HEXPIRE`, but instead of specifying the number of seconds for the TTL (time to live), it takes an absolute Unix timestamp in seconds since Unix epoch. A timestamp in the past will delete the field immediately. The `HEXPIREAT` command supports a set of options: * `NX` — For each specified field, set expiration only when the field has no expiration. * `XX` — For each specified field, set expiration only when the field has an existing expiration. * `GT` — For each specified field, set expiration only when the new expiration is greater than current one. * `LT` — For each specified field, set expiration only when the new expiration is less than current one. **Synopsis** ``` HPEXPIRE key milliseconds [NX | XX | GT | LT] FIELDS numfields field [field ...] ``` This command works like `HEXPIRE`, but the expiration of a field is specified in milliseconds instead of seconds. The `HPEXPIRE` command supports a set of options: * `NX` — For each specified field, set expiration only when the field has no expiration. * `XX` — For each specified field, set expiration only when the field has an existing expiration. * `GT` — For each specified field, set expiration only when the new expiration is greater than current one. * `LT` — For each specified field, set expiration only when the new expiration is less than current one. **Synopsis** ``` HPEXPIREAT key unix-time-milliseconds [NX | XX | GT | LT] FIELDS numfields field [field ...] ``` `HPEXPIREAT` has the same effect and semantics as `HEXPIREAT``,` but the Unix time at which the field will expire is specified in milliseconds since Unix epoch instead of seconds. **Synopsis** ``` HPERSIST key FIELDS numfields field [field ...] ``` Remove the existing expiration on a hash key's field(s), turning the field(s) from *volatile* (a field with expiration set) to *persistent* (a field that will never expire as no TTL (time to live) is associated). **Synopsis** ``` HSETEX key [NX] seconds field value [field value ...] ``` Similar to `HSET` but adds one or more hash fields that expire after specified number of seconds. By default, this command overwrites the values and expirations of specified fields that exist in the hash. If `NX` option is specified, the field data will not be overwritten. If `key` doesn't exist, a new Hash key is created. The HSETEX command supports a set of options: * `NX` — For each specified field, set expiration only when the field has no expiration. **Synopsis** ``` HTTL key FIELDS numfields field [field ...] ``` Returns the **remaining** TTL (time to live) of a hash key's field(s) that have a set expiration. This introspection capability allows you to check how many seconds a given hash field will continue to be part of the hash key. ``` HPTTL key FIELDS numfields field [field ...] ``` Like `HTTL`, this command returns the remaining TTL (time to live) of a field that has an expiration set, but in milliseconds instead of seconds. **Synopsis** ``` HEXPIRETIME key FIELDS numfields field [field ...] ``` Returns the absolute Unix timestamp in seconds since Unix epoch at which the given key's field(s) will expire. **Synopsis** ``` HPEXPIRETIME key FIELDS numfields field [field ...] ``` `HPEXPIRETIME` has the same semantics as `HEXPIRETIME`, but returns the absolute Unix expiration timestamp in milliseconds since Unix epoch instead of seconds. This PR introduces new notification events to support field-level expiration: | Event | Trigger | |-------------|-------------------------------------------| | `hexpire` | Field expiration was set | | `hexpired` | Field was deleted due to expiration | | `hpersist` | Expiration was removed from a field | | `del` | Key was deleted after all fields expired | Note that we diverge from Redis in the cases we emit hexpired event. For example: given the following usecase: ``` HSET myhash f1 v1 (integer) 0 HGETEX myhash EX 0 FIELDS 1 f1 1) "v1" HTTL myhash FIELDS 1 f1 1) (integer) -2 ``` regarding the keyspace-notifications: Redis reports: ``` 1) "psubscribe" 2) "__keyevent@0__:*" 3) (integer) 1 1) "pmessage" 2) "__keyevent@0__:*" 3) "__keyevent@0__:hset" 4) "myhash2" 1) "pmessage" 2) "__keyevent@0__:*" 3) "__keyevent@0__:hdel" <---------------- note this 4) "myhash2" 1) "pmessage" 2) "__keyevent@0__:*" 3) "__keyevent@0__:del" 4) "myhash2" ``` However In our current suggestion, Valkey will emit: ``` 1) "psubscribe" 2) "__keyevent@0__*" 3) (integer) 1 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:hset" 4) "myhash" 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:hexpired" <---------------- note this 4) "myhash" 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:del" 4) "myhash" ``` --- - Expiration-aware commands (`HSETEX`, `HGETEX`, etc.) are **not propagated as-is**. - Instead, Valkey rewrites them into equivalent commands like: - `HDEL` (for expired fields) - `HPEXPIREAT` (for setting absolute expiration) - `HPERSIST` (for removing expiration) This ensures compatibility with replication and AOF while maintaining consistent field-level expiry behavior. --- | Command Name | QPS Standard | QPS HFE | QPS Diff % | Latency Standard (ms) | Latency HFE (ms) | Latency Diff % | |--------------|-------------|---------|------------|----------------------|------------------|----------------| | **One Large Hash Table** | | HGET | 137988.12 | 138484.97 | +0.36% | 0.951 | 0.949 | -0.21% | | HSET | 138561.73 | 137343.77 | -0.87% | 0.948 | 0.956 | +0.84% | | HEXISTS | 139431.12 | 138677.02 | -0.54% | 0.942 | 0.946 | +0.42% | | HDEL | 140114.89 | 138966.09 | -0.81% | 0.938 | 0.945 | +0.74% | | **Many Hash Tables (100 fields)** | | HGET | 136798.91 | 137419.27 | +0.45% | 0.959 | 0.956 | -0.31% | | HEXISTS | 138946.78 | 139645.31 | +0.50% | 0.946 | 0.941 | -0.52% | | HGETALL | 42194.09 | 42016.80 | -0.42% | 0.621 | 0.625 | +0.64% | | HSET | 137230.69 | 137249.53 | +0.01% | 0.959 | 0.958 | -0.10% | | HDEL | 138985.41 | 138619.34 | -0.26% | 0.948 | 0.949 | +0.10% | | **Many Hash Tables (1000 fields)** | | HGET | 135795.77 | 139256.36 | +2.54% | 0.965 | 0.943 | -2.27% | | HEXISTS | 138121.55 | 137950.06 | -0.12% | 0.951 | 0.952 | +0.10% | | HGETALL | 5885.81 | 5633.80 | **-4.28%** | 2.690 | 2.841 | **+5.61%** | | HSET | 137005.08 | 137400.39 | +0.28% | 0.959 | 0.955 | -0.41% | | HDEL | 138293.45 | 137381.52 | -0.65% | 0.948 | 0.955 | +0.73% | [ ] Consider extending HSETEX with extra arguments: NX/XX so that it is possible to prevent adding/setting/mutating fields of a non-existent hash [ ] Avoid loading expired fields when non-preamble RDB is being loaded on primary. This is an optimization in order to reduce loading unnecessary fields (which are expired). This would also require us to propagate the HDEL to the replicas in case of RDBFLAGS_FEED_REPL. Note that it might have to require some refactoring: 1/ propagate the rdbflags and current time to rdbLoadObject. 2/ consider the case of restore and check_rdb etc... For this reason I would like to avoid this optimizationfor the first drop. Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

Closes #640 This PR introduces support for **field-level expiration in Valkey hash types**, making it possible for individual fields inside a hash to expire independently — creating what we call **volatile fields**. This is just the first out of 3 PRs. The content of this PR focus on enabling the basic ability to set and modify hash fields expiration as well as persistency (AOF+RDB) and defrag. [The second PR](ranshid#5) introduces the new algorithm (volatile-set) to track volatile hash fields is in the last stages of review. The current implementation in this PR (in volatile-set.h/c) is just s tub implementation and will be replaced by [The second PR](ranshid#5) [The third PR](ranshid#4) which introduces the active expiration and defragmentation jobs. For more highlevel design details you can track the RFC PR: valkey-io/valkey-rfc#22. --- Some highlevel major decisions which are taken as part of this work: 1. We decided to copy the existing Redis API in order to maintain compatibility with existing clients. 2. We decided to avoid introducing lazy-expiration at this point, in order to reduce complexity and rely only on active-expiration for memory reclamation. This will require us to continue to work on improving the active expiration job and potentially consider introduce lazy-expiration support later on. 3. Although different commands which are adding expiration on hash fields are influencing the memory utilization (by allocating more memory for expiration time and metadata) we decided to avoid adding the DENYOOM for these commands (an exception is HSETEX) in order to be better aligned with highlevel keys commands like `expire` 4. Some hash type commands will produce unexpected results: - HLEN - will still reflect the number of fields which exists in the hash object (either actually expired or not). - HRANDFIELD - in some cases we will not be able to randomly select a field which was not already expired. this case happen in 2 cases: 1/ when we are asked to provide a non-uniq fields (i.e negative count) 2/ when the size of the hash is much bigger than the count and we need to provide uniq results. In both cases it is possible that an empty response will be returned to the caller, even in case there are fields in the hash which are either persistent or not expired. 5. For the case were a field is provided with a zero (0) expiration time or expiration time in the past, it is immediately deleted. We decided that, in order to be aligned with how high level keys are handled, we will emit hexpired keyspace event for that case (instead of hdel). For example: for the case: 6. We will ALWAYS load hash fields during rdb load. This means that when primary is rebooting with an old snapshot, it will take time to reclaim all the expired fields. However this simplifies the current logic and avoid major refactoring that I suspect will be needed. ``` HSET myhash f1 v1 > 0 HGETEX myhash EX 0 FIELDS 1 f1 > "v1" HTTL myhash FIELDS 1 f1 > -2 ``` The reported events are: ``` 1) "psubscribe" 2) "__keyevent@0__*" 3) (integer) 1 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:hset" 4) "myhash" 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:hexpired" <---------------- note this 4) "myhash" 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:del" 4) "myhash" ``` --- This PR also **modularizes and exposes the internal `hashTypeEntry` logic** as a new standalone `entry.c/h` module. This new abstraction handles all aspects of **field–value–expiry encoding** using multiple memory layouts optimized for performance and memory efficiency. An `entry` is an abstraction that represents a single **field–value pair with optional expiration**. Internally, Valkey uses different memory layouts for compactness and efficiency, chosen dynamically based on size and encoding constraints. The entry pointer is the field sds. Which make us use an entry just like any sds. We encode the entry layout type in the field SDS header. Field type SDS_TYPE_5 doesn't have any spare bits to encode this so we use it only for the first layout type. Entry with embedded value, used for small sizes. The value is stored as SDS_TYPE_8. The field can use any SDS type. Entry can also have expiration timestamp, which is the UNIX timestamp for it to be expired. For aligned fast access, we keep the expiry timestamp prior to the start of the sds header. +----------------+--------------+---------------+ | Expiration | field | value | | 1234567890LL | hdr "foo" \0 | hdr8 "bar" \0 | +-----------------------^-------+---------------+ | | entry pointer (points to field sds content) Entry with value pointer, used for larger fields and values. The field is SDS type 8 or higher. +--------------+-------+--------------+ | Expiration | value | field | | 1234567890LL | ptr | hdr "foo" \0 | +--------------+--^----+------^-------+ | | | | | entry pointer (points to field sds content) | value pointer = value sds The `entry.c/h` API provides methods to: - Create, read, and write and Update field/value/expiration - Set or clear expiration - Check expiration state - Clone or delete an entry --- This PR introduces **new commands** and extends existing ones to support field expiration: The proposed API is very much identical to the Redis provided API (Redis 7.4 + 8.0). This is intentionally proposed in order to avoid breaking client applications already opted to use hash items TTL. **Synopsis** ``` HSETEX key [NX | XX] [FNX | FXX] [EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL] FIELDS numfields field value [field value ...] ``` Set the value of one or more fields of a given hash key, and optionally set their expiration time or time-to-live (TTL). The HSETEX command supports the following set of options: * `NX` — Only set the fields if the hash object does NOT exist. * `XX` — Only set the fields if if the hash object doesx exist. * `FNX` — Only set the fields if none of them already exist. * `FXX` — Only set the fields if all of them already exist. * `EX seconds` — Set the specified expiration time in seconds. * `PX milliseconds` — Set the specified expiration time in milliseconds. * `EXAT unix-time-seconds` — Set the specified Unix time in seconds at which the fields will expire. * `PXAT unix-time-milliseconds` — Set the specified Unix time in milliseconds at which the fields will expire. * `KEEPTTL` — Retain the TTL associated with the fields. The `EX`, `PX`, `EXAT`, `PXAT`, and `KEEPTTL` options are mutually exclusive. **Synopsis** ``` HGETEX key [EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT unix-time-milliseconds | PERSIST] FIELDS numfields field [field ...] ``` Get the value of one or more fields of a given hash key and optionally set their expiration time or time-to-live (TTL). The `HGETEX` command supports a set of options: * `EX seconds` — Set the specified expiration time, in seconds. * `PX milliseconds` — Set the specified expiration time, in milliseconds. * `EXAT unix-time-seconds` — Set the specified Unix time at which the fields will expire, in seconds. * `PXAT unix-time-milliseconds` — Set the specified Unix time at which the fields will expire, in milliseconds. * `PERSIST` — Remove the TTL associated with the fields. The `EX`, `PX`, `EXAT`, `PXAT`, and `PERSIST` options are mutually exclusive. **Synopsis** ``` HEXPIRE key seconds [NX | XX | GT | LT] FIELDS numfields field [field ...] ``` Set an expiration (TTL or time to live) on one or more fields of a given hash key. You must specify at least one field. Field(s) will automatically be deleted from the hash key when their TTLs expire. Field expirations will only be cleared by commands that delete or overwrite the contents of the hash fields, including `HDEL` and `HSET` commands. This means that all the operations that conceptually *alter* the value stored at a hash key's field without replacing it with a new one will leave the TTL untouched. You can clear the TTL of a specific field by specifying 0 for the ‘seconds’ argument. Note that calling `HEXPIRE`/`HPEXPIRE` with a time in the past will result in the hash field being deleted immediately. The `HEXPIRE` command supports a set of options: * `NX` — For each specified field, set expiration only when the field has no expiration. * `XX` — For each specified field, set expiration only when the field has an existing expiration. * `GT` — For each specified field, set expiration only when the new expiration is greater than current one. * `LT` — For each specified field, set expiration only when the new expiration is less than current one. **Synopsis** ``` HEXPIREAT key unix-time-seconds [NX | XX | GT | LT] FIELDS numfields field [field ...] ``` `HEXPIREAT` has the same effect and semantics as `HEXPIRE`, but instead of specifying the number of seconds for the TTL (time to live), it takes an absolute Unix timestamp in seconds since Unix epoch. A timestamp in the past will delete the field immediately. The `HEXPIREAT` command supports a set of options: * `NX` — For each specified field, set expiration only when the field has no expiration. * `XX` — For each specified field, set expiration only when the field has an existing expiration. * `GT` — For each specified field, set expiration only when the new expiration is greater than current one. * `LT` — For each specified field, set expiration only when the new expiration is less than current one. **Synopsis** ``` HPEXPIRE key milliseconds [NX | XX | GT | LT] FIELDS numfields field [field ...] ``` This command works like `HEXPIRE`, but the expiration of a field is specified in milliseconds instead of seconds. The `HPEXPIRE` command supports a set of options: * `NX` — For each specified field, set expiration only when the field has no expiration. * `XX` — For each specified field, set expiration only when the field has an existing expiration. * `GT` — For each specified field, set expiration only when the new expiration is greater than current one. * `LT` — For each specified field, set expiration only when the new expiration is less than current one. **Synopsis** ``` HPEXPIREAT key unix-time-milliseconds [NX | XX | GT | LT] FIELDS numfields field [field ...] ``` `HPEXPIREAT` has the same effect and semantics as `HEXPIREAT``,` but the Unix time at which the field will expire is specified in milliseconds since Unix epoch instead of seconds. **Synopsis** ``` HPERSIST key FIELDS numfields field [field ...] ``` Remove the existing expiration on a hash key's field(s), turning the field(s) from *volatile* (a field with expiration set) to *persistent* (a field that will never expire as no TTL (time to live) is associated). **Synopsis** ``` HSETEX key [NX] seconds field value [field value ...] ``` Similar to `HSET` but adds one or more hash fields that expire after specified number of seconds. By default, this command overwrites the values and expirations of specified fields that exist in the hash. If `NX` option is specified, the field data will not be overwritten. If `key` doesn't exist, a new Hash key is created. The HSETEX command supports a set of options: * `NX` — For each specified field, set expiration only when the field has no expiration. **Synopsis** ``` HTTL key FIELDS numfields field [field ...] ``` Returns the **remaining** TTL (time to live) of a hash key's field(s) that have a set expiration. This introspection capability allows you to check how many seconds a given hash field will continue to be part of the hash key. ``` HPTTL key FIELDS numfields field [field ...] ``` Like `HTTL`, this command returns the remaining TTL (time to live) of a field that has an expiration set, but in milliseconds instead of seconds. **Synopsis** ``` HEXPIRETIME key FIELDS numfields field [field ...] ``` Returns the absolute Unix timestamp in seconds since Unix epoch at which the given key's field(s) will expire. **Synopsis** ``` HPEXPIRETIME key FIELDS numfields field [field ...] ``` `HPEXPIRETIME` has the same semantics as `HEXPIRETIME`, but returns the absolute Unix expiration timestamp in milliseconds since Unix epoch instead of seconds. This PR introduces new notification events to support field-level expiration: | Event | Trigger | |-------------|-------------------------------------------| | `hexpire` | Field expiration was set | | `hexpired` | Field was deleted due to expiration | | `hpersist` | Expiration was removed from a field | | `del` | Key was deleted after all fields expired | Note that we diverge from Redis in the cases we emit hexpired event. For example: given the following usecase: ``` HSET myhash f1 v1 (integer) 0 HGETEX myhash EX 0 FIELDS 1 f1 1) "v1" HTTL myhash FIELDS 1 f1 1) (integer) -2 ``` regarding the keyspace-notifications: Redis reports: ``` 1) "psubscribe" 2) "__keyevent@0__:*" 3) (integer) 1 1) "pmessage" 2) "__keyevent@0__:*" 3) "__keyevent@0__:hset" 4) "myhash2" 1) "pmessage" 2) "__keyevent@0__:*" 3) "__keyevent@0__:hdel" <---------------- note this 4) "myhash2" 1) "pmessage" 2) "__keyevent@0__:*" 3) "__keyevent@0__:del" 4) "myhash2" ``` However In our current suggestion, Valkey will emit: ``` 1) "psubscribe" 2) "__keyevent@0__*" 3) (integer) 1 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:hset" 4) "myhash" 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:hexpired" <---------------- note this 4) "myhash" 1) "pmessage" 2) "__keyevent@0__*" 3) "__keyevent@0__:del" 4) "myhash" ``` --- - Expiration-aware commands (`HSETEX`, `HGETEX`, etc.) are **not propagated as-is**. - Instead, Valkey rewrites them into equivalent commands like: - `HDEL` (for expired fields) - `HPEXPIREAT` (for setting absolute expiration) - `HPERSIST` (for removing expiration) This ensures compatibility with replication and AOF while maintaining consistent field-level expiry behavior. --- | Command Name | QPS Standard | QPS HFE | QPS Diff % | Latency Standard (ms) | Latency HFE (ms) | Latency Diff % | |--------------|-------------|---------|------------|----------------------|------------------|----------------| | **One Large Hash Table** | | HGET | 137988.12 | 138484.97 | +0.36% | 0.951 | 0.949 | -0.21% | | HSET | 138561.73 | 137343.77 | -0.87% | 0.948 | 0.956 | +0.84% | | HEXISTS | 139431.12 | 138677.02 | -0.54% | 0.942 | 0.946 | +0.42% | | HDEL | 140114.89 | 138966.09 | -0.81% | 0.938 | 0.945 | +0.74% | | **Many Hash Tables (100 fields)** | | HGET | 136798.91 | 137419.27 | +0.45% | 0.959 | 0.956 | -0.31% | | HEXISTS | 138946.78 | 139645.31 | +0.50% | 0.946 | 0.941 | -0.52% | | HGETALL | 42194.09 | 42016.80 | -0.42% | 0.621 | 0.625 | +0.64% | | HSET | 137230.69 | 137249.53 | +0.01% | 0.959 | 0.958 | -0.10% | | HDEL | 138985.41 | 138619.34 | -0.26% | 0.948 | 0.949 | +0.10% | | **Many Hash Tables (1000 fields)** | | HGET | 135795.77 | 139256.36 | +2.54% | 0.965 | 0.943 | -2.27% | | HEXISTS | 138121.55 | 137950.06 | -0.12% | 0.951 | 0.952 | +0.10% | | HGETALL | 5885.81 | 5633.80 | **-4.28%** | 2.690 | 2.841 | **+5.61%** | | HSET | 137005.08 | 137400.39 | +0.28% | 0.959 | 0.955 | -0.41% | | HDEL | 138293.45 | 137381.52 | -0.65% | 0.948 | 0.955 | +0.73% | [ ] Consider extending HSETEX with extra arguments: NX/XX so that it is possible to prevent adding/setting/mutating fields of a non-existent hash [ ] Avoid loading expired fields when non-preamble RDB is being loaded on primary. This is an optimization in order to reduce loading unnecessary fields (which are expired). This would also require us to propagate the HDEL to the replicas in case of RDBFLAGS_FEED_REPL. Note that it might have to require some refactoring: 1/ propagate the rdbflags and current time to rdbLoadObject. 2/ consider the case of restore and check_rdb etc... For this reason I would like to avoid this optimizationfor the first drop. Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

Hash Field Expiration RFC

63b66b5

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

ranshid marked this pull request as draft May 7, 2025 13:14

madolson reviewed May 7, 2025

View reviewed changes

HFE.md Show resolved Hide resolved

zuiderkwast reviewed May 7, 2025

View reviewed changes

HFE.md Outdated Show resolved Hide resolved

HFE.md Outdated Show resolved Hide resolved

HFE.md Outdated Show resolved Hide resolved

refactor to better match the RFC template

a96c297

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

madolson reviewed May 8, 2025

View reviewed changes

hwware reviewed May 8, 2025

View reviewed changes

Update HFE.md

a58e32d

Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

xbasel reviewed May 11, 2025

View reviewed changes

ranshid added 3 commits May 11, 2025 18:42

add AOF, RDB and configuration sections

e9b9775

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

fix PR review comment about lazy expiration terminology

9c7ba43

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

Add Observability section

3ba5a46

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

ranshid marked this pull request as ready for review May 11, 2025 16:51

Add RFC header

2f16944

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

zuiderkwast reviewed May 11, 2025

View reviewed changes

xbasel reviewed May 12, 2025

View reviewed changes

This was referenced May 13, 2025

Support field level expire/TTL for hash, set and sorted set valkey-io/valkey#640

Closed

Introduce HASH items expiration valkey-io/valkey#2089

Merged

JimB123 reviewed May 16, 2025

View reviewed changes

zuiderkwast reviewed May 17, 2025

View reviewed changes

xbasel reviewed Jun 24, 2025

View reviewed changes


		### Volatile hash entry memory layout

		Currently a field is always an SDS in Valkey. Although it is possible to match a key with external metadata (eg TTL) by mapping the key to the relevant metadata, it will incur extra memory utilization to hold the mapping and will require to use extra CPU cycles in order to locate the TTL per each query. Some dictionaries use objects with embedded keys were the metadata can be set as part of the object. However that would require every dictionary which needs TTL support to use objects with embedded keys and might significantly complicate existing code paths as well as require extra memory in order hold the object metadata.

	* A new `hexpired` event will be issued whenever an item is actually being expired (either actively or lazily)
	* A new `hexpired` event will be issued whenever an expired item is detected and deleted (either actively or lazily)

Hash Field Expiration RFC #22

Are you sure you want to change the base?

Hash Field Expiration RFC #22

Uh oh!

Conversation

ranshid commented May 7, 2025

Uh oh!

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ranshid May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ranshid May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ranshid May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ranshid May 11, 2025 •

edited

Loading

ranshid May 11, 2025 •

edited

Loading

ranshid May 11, 2025 •

edited

Loading

ranshid May 18, 2025 •

edited

Loading