Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrease SST metadata size #955

Closed
1 of 3 tasks
jiacai2050 opened this issue May 31, 2023 · 0 comments · Fixed by #1120
Closed
1 of 3 tasks

Decrease SST metadata size #955

jiacai2050 opened this issue May 31, 2023 · 0 comments · Fixed by #1120
Assignees
Labels
feature New feature or request

Comments

@jiacai2050
Copy link
Contributor

jiacai2050 commented May 31, 2023

Describe This Problem

We currently use parquet as our underlying file format, and we find the metadata of one SST is very large. Eg:

size:348.675M, metadata:40.142M, kv:38.175M, filter:28.525M, row_num:7038949

Ideally SST's metadata should be cached to improve query perf, so we need to find solutions to decrease metadata size to avoid OOM in production.

Proposal

Additional Context

Xor filter size for different key num

key_num:0, len:54, byte_per_key:54
key_num:1, len:57, byte_per_key:57
key_num:5, len:63, byte_per_key:12.6
key_num:10, len:69, byte_per_key:6.9
key_num:100, len:177, byte_per_key:1.77
key_num:1000, len:1284, byte_per_key:1.284
key_num:2000, len:2514, byte_per_key:1.257
key_num:3000, len:3744, byte_per_key:1.248
key_num:4000, len:4974, byte_per_key:1.2435
key_num:8000, len:9894, byte_per_key:1.23675

This snippet show when key are less than 10, the cost of per key is relatively high, maybe we could use hash(u16) to store them.

@jiacai2050 jiacai2050 added the feature New feature or request label May 31, 2023
ShiKaiWi pushed a commit that referenced this issue Jun 2, 2023
## Rationale
Part of #955

## Detailed Changes
- Ignore filter for null, double, float, varbinary, boolean 
- Fix parse meta in  `meta_from_sst`

## Test Plan
- Before: max_seq:132734153, size:348.675M, metadata:40.142M,
kv:38.175M, filter:28.525M, row_num:7038949
- After: max_seq:132734153, size:334.045M, metadata:25.512M, kv:23.545M,
filter:17.562M, row_num:7038949
jiacai2050 added a commit that referenced this issue Aug 24, 2023
## Rationale
Close #955
see title
## Detailed Changes
- Introduce another independent file to store metadata of sst

## Test Plan
- UT
  - `test_parquet_build_and_read` tests write and read.
  - `test_arrow_meta_data` tests compatible with older version.
- Manually
  - Upgrade from old deployments
  - Start a new deploy, and run tsbs

---------

Co-authored-by: jiacai2050 <dev@liujiacai.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants