Skip to content

Conversation

@tejaslodayadd
Copy link

@tejaslodayadd tejaslodayadd commented Nov 24, 2025

Implements per-field expiration support for HASH data type, following Redis 7.4 specification.

Commands:

HEXPIRE key seconds FIELDS numfields field [field ...]
HTTL key FIELDS numfields field [field ...]
HPERSIST key FIELDS numfields field [field ...]

Implementation Details

  • Uses backward-compatible flag-based encoding for field values
  • Legacy format (no expiration): [raw value]
  • New format (with expiration): [0xFF][flags][8-byte timestamp][value]
  • Existing data works without migration
  • Expired fields are filtered out on read operations (HGET, HMGET, HGETALL, etc.)
  • Compaction filter cleans up expired fields with 5-minute lazy delete buffer

This extends the existing key-level expiration (HSETEXPIRE) added in #2750 to support per-field granularity, which is a common feature request similar to Redis 7.4's hash field expiration support.

Reference: https://redis.io/docs/latest/develop/data-types/hashes/

@tejaslodayadd tejaslodayadd changed the title feat(hash): add hash per-field expiration feat(hash): add per-field expiration support (HEXPIRE, HTTL, HPERSIST) Nov 24, 2025
@aleksraiden
Copy link
Contributor

@torwig could you please check this PR, as I remember, we talk about this feature? Thanks

@aleksraiden
Copy link
Contributor

@tejaslodayadd lot ot thanks, cool feature, we need support of this

@tejaslodayadd tejaslodayadd marked this pull request as ready for review November 24, 2025 22:05
Comment on lines 158 to 177
// Check for hash field expiration
if (metadata.Type() == kRedisHash) {
// First check if metadata (key-level) is expired
if (IsMetadataExpired(ikey, metadata)) {
return true;
}
// Then check per-field expiration
// Use lazy deletion with 5 minute buffer similar to metadata expiration
uint64_t lazy_expired_ts = util::GetTimeStampMS() - 300000;
HashFieldValue field_value;
if (!HashFieldValue::Decode(value.ToString(), &field_value)) {
// Failed to decode, keep the field
return false;
}
// Check if field has expiration and if it's expired (with lazy delete buffer)
if (field_value.expire > 0 && field_value.expire <= lazy_expired_ts) {
return true;
}
return false;
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key logic for hash field expiration cleanup during compaction -- when the metadata type is kRedisHash, the filter:

  • First checks if the entire key is expired (key-level expiration)
  • Then decodes the field value using HashFieldValue::Decode()
  • Checks if the field has a non-zero expiration timestamp
  • Uses a 5-minute lazy delete buffer (util::GetTimeStampMS() - 300000) to avoid race conditions between the HEXPIRE command and compaction (same pattern used for key-level expiration)

@tejaslodayadd
Copy link
Author

tejaslodayadd commented Nov 25, 2025

After discussing this internally, there's a problem:

Currently, HLEN returns the field count by reading metadata.size directly - an O(1) operation. However, with per-field expiration:

  • Expired fields remain in storage until compaction cleans them up
  • The metadata.size counter doesn't know when individual fields expire
  • Result: A hash with 5 fields where 1 has expired will still report HLEN = 5 instead of 4

Options:

  1. Accept approximate count: Keep current O(1) behavior, document that HLEN returns approximate count when using field TTL, accurate after compaction (eventual consistent)
  2. Scan-based count: HLEN scans all fields and counts non-expired ones. Con: O(n) instead of O(1), expensive for large hashes
  3. Optimized scan with metadata tracking: Add a flag in HashMetadata to track if any field has expiration set. HLEN returns metadata.size directly if no fields have TTL -- O(1), and only scan when hash has fields with expiration -- O(n)

Let us know which way you'd want us to move forward @PragmaTwice @torwig @aleksraiden

@git-hulk
Copy link
Member

@tejaslodayadd Thanks for your contribution.

I just had a glance at this PR and have two questions about the design:

  • When would the size in metadata be updated? It seems the size won't be corrected after the compaction.
  • Should we also take care of the HSCAN command? Let's filter out the expired fields.

From my personal perspective, it's acceptable to sacrifice the short-term correctness of a command like HLEN to improve performance.

@git-hulk git-hulk self-requested a review November 25, 2025 03:15
@PragmaTwice
Copy link
Member

PragmaTwice commented Nov 25, 2025

Hi, thank you for your contribution. I skimmed through and found some issue in this design:

  • "New format (with expiration): [0xFF][flags][8-byte timestamp][value]". Hmm the hash field value can actually be binary, so even in the old encoding, the first byte of the value CAN BE 0xff. How to distinguish them? I think the answer is negative.
  • -2 = field doesn't exist, 1 = expiration set, 0 = expiration not set (e.g., invalid expire time). Usually this comment indicates that you should use enum classes instead of integers.
  • the size field in metatdata seems not well mantained, in compaction filters.
  • it seems HashFieldValue can be a view (zero-copy)?
  • The code seems vibe-coded. It is fine but there are lots of useless code comment and the PR author should understand details of the llm-written code (so that the review process can be meaningful).

@ltagliamonte-dd
Copy link
Contributor

ltagliamonte-dd commented Dec 2, 2025

@git-hulk @PragmaTwice FYI I'm going to help @tejaslodayadd with this PR.
I've addressed few of the comments you folks left.

I'm currently debating how to proceed with the compaction size/problem.
during compaction of the default CF we can't update the metadata CF (where the hash size is), so i think we could go for a lazy repair pattern:

  • during compactions if a field is expired we just update it's value to a STUB value (0xFE) so we immediately reclaim the disk space
  • during read operations like HGET, HGETALL, HSCAN we do a synchronous/asynchronous Repair:
    • Create an Atomic WriteBatch.
    • Delete the key from defaultCF.
    • Update/Merge the count in metadataCF.
    • Commit.

What do you folks think about this approach?

@ltagliamonte-dd
Copy link
Contributor

ltagliamonte-dd commented Dec 2, 2025

when it comes to:

i've implemented a new format that uses a two-byte magic marker (0xFF 0xFE) instead of a single byte, let me know if this works for you @PragmaTwice

New format: [[0xFF][0xFE][1-byte flags][8-byte timestamp if flag set][value]]
Old format: [[0xFF][1-byte flags][8-byte timestamp if flag set][value]]

@ltagliamonte-dd
Copy link
Contributor

@PragmaTwice @git-hulk Just pushed the lazy repair approach discussed before, looking forward to your feedbacks!
Thanks!

subkey_opts.disable_auto_compactions = config_->rocks_db.disable_auto_compactions;
subkey_opts.table_properties_collector_factories.emplace_back(
NewCompactOnExpiredTableCollectorFactory(std::string(kPrimarySubkeyColumnFamilyName), 0.3));
subkey_opts.merge_operator = std::make_shared<HashMergeOperator>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merge operator would be invoked when iterating over any keys in this column family, and it could significantly affect performance.

return rocksdb::Status::OK();
}

void Hash::asyncRepairHash(const std::string &ns_key, const Slice &field, const HashMetadata &metadata) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a good idea to spawn a thread per expired hash field. Perhaps we can reuse the existing task runner, or have a fixed thread pool for this purpose.

@dashjay
Copy link

dashjay commented Jan 5, 2026

when it comes to:

i've implemented a new format that uses a two-byte magic marker (0xFF 0xFE) instead of a single byte, let me know if this works for you @PragmaTwice

New format: [[0xFF][0xFE][1-byte flags][8-byte timestamp if flag set][value]] Old format: [[0xFF][1-byte flags][8-byte timestamp if flag set][value]]

Just walking by here~ I have a question that if the user's value was happended to be set to [0xFF][0xFE][1-byte flags][8-byte timestamp if flag set]..... before update, will this key be treat as expired at once ?

@git-hulk
Copy link
Member

git-hulk commented Jan 5, 2026

Just walking by here~ I have a question that if the user's value was happended to be set to [0xFF][0xFE][1-byte flags][8-byte timestamp if flag set]..... before update, will this key be treat as expired at once ?

@dashjay I think so if we set the timestamp for the expiration field only.

@PragmaTwice
Copy link
Member

Just walking by here~ I have a question that if the user's value was happended to be set to [0xFF][0xFE][1-byte flags][8-byte timestamp if flag set]..... before update, will this key be treat as expired at once ?

Yup I pointed out it in the first review and the author does not address it until now.

@ltagliamonte-dd
Copy link
Contributor

@aleksraiden @git-hulk @PragmaTwice Happy New year!
Could i please get some help/directions with the CI? it is not clear to me what's complaining about.

@ltagliamonte-dd
Copy link
Contributor

@dashjay @PragmaTwice @git-hulk Good point on the marker.
I agree that using value prefixes (even multiple bytes) is not a collision free solution because the hash field value is opaque binary data.
Based on your experience what would you suggest?
Could we do a major release where we introduce a new format for hash fields that has a metadata object encoded within the hash field value?

@dashjay
Copy link

dashjay commented Jan 6, 2026

@dashjay @PragmaTwice @git-hulk Good point on the marker.关于标记物这一点说得很好。 I agree that using value prefixes (even multiple bytes) is not a collision free solution because the hash field value is opaque binary data.我同意使用值前缀(即使是多个字节)并不是一个无冲突的解决方案,因为哈希字段值是不透明的二进制数据。 Based on your experience what would you suggest?根据您的经验,您会提出什么建议? Could we do a major release where we introduce a new format for hash fields that has a metadata object encoded within the hash field value?我们能否发布一个重大版本,引入一种新的哈希字段格式,将元数据对象编码到哈希字段值中?

@ltagliamonte-dd I had a good dream lol.

I used to use encoded key/value to implement overlay key-value DB like distribute MVCC key value store, we encoded the timestamp from TSO into the user key for MVCC, we use scan for user get lol, sound very expensive, but it works, so we can store the value in key {user_key}:{TTL},if there is no TTL keep it maxint64 can make it higher in lexicographical order. If we did like this every get will be scan, need to create new iter.

but I have read the kvrocks read layer, it just simple, Its read path remains free from interference by any complex features.


It occurred to me that may be we can use an new key as following format:

escaped_user_key = escaping(user_key)
rocksdb_key = escaped_user_key + espace_codes
rocksdb_value = 8 bytes (timestamp)

all get operation need to get twice, old key-value has no such a rocksdb_key, so we know it has not TTL.

but this escaped_user_key may also affect another user's key, consider to store them in another cf... or another bad way is to use metadata.version + magic_number to make ttl keys in another key spaces.

I will try to provide a formal proposal.

@ltagliamonte-dd
Copy link
Contributor

ltagliamonte-dd commented Jan 6, 2026

@dashjay I understand the motivation behind encoding the TTL in the key, but I believe it introduces significant performance regressions for several reasons:

  • Breaks Point Lookups ($O(1)$): If the TTL is part of the sub key, the exact key is no longer known. To find a field, we would have to use an iterator to scan a range of keys rather than a direct Get. This turns every read into a much more expensive operation.
  • Expensive TTL Updates: In RocksDB, keys are immutable. To update only the TTL, we would be forced to Delete the old key and Put a new one. This doubles the write amplification and creates unnecessary tombstones.
  • Key Bloat: Sub-keys (fields) are already repeated for every entry. Adding an 8-byte timestamp to every key significantly increases the index size and memory pressure on the Block Cache/Memtables.
  • Data Consistency: It becomes harder to ensure atomicity. If a TTL update fails or is interrupted, you could potentially end up with two versions of the same field (the old TTL and the new TTL) existing simultaneously in the keyspace.

@git-hulk @PragmaTwice to fix the ambiguity what about adding a new cmd option to HSET/HMSET, etc.. that sets an ExHash flag in the key metadata. The flag will work ONLY on new hash keys.
When set all the fields value will have TTL encoded: [0x01] [Timestamp] [Value]
The only downside of this is that HEXPIRE will work only on new hashes and not be backward compatible with existing hash keys.

@dashjay
Copy link

dashjay commented Jan 7, 2026

@ltagliamonte-dd

  • For Data Consistency: Rocksdb should take care this, although rocksdb has not transaction, writebatch can help us put them all together.
  • For Key Bloat problem: this is a trade-off things to solve the problem between compatibility and the space. For kvrocks engine we flatten all hash keys to kvrocks for reduce the write amp, add more key for one TTL but not every key I think this is acceptable but need do more benchmark.
  • For Expensive TTL Updates: if we take one TTL as one key, put the TTL in value but not encoded into key, so that we just need to put the same key to engine other than (first scan, delete old, put new).
  • For Breaks Point Lookups: this must be the biggest problem, no matter where I put TTL, it's a problem. To solve this problem without breaking the original semantics, I believe that in addition to metadata, other indexes need to be introduced.

@ltagliamonte-dd
Copy link
Contributor

@PragmaTwice @git-hulk would love to hear from you as well, in order to focus the development in the right direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants