Vectorized hash grouping by a single text column #7586

akuzm · 2025-01-10T13:40:33Z

Use the UMASH hashes that have a guaranteed lower bound on collisions as the hash table keys.

Up to 70% improvement in tsbench: https://grafana.ops.savannah-dev.timescale.com/d/fasYic_4z/compare-akuzm?orgId=1&var-branch=All&var-run1=4078&var-run2=4080&var-threshold=0.02&var-use_historical_thresholds=true&var-threshold_expression=2%20%2A%20percentile_cont%280.90%29&var-exact_suite_version=false&from=now-2d&to=now

Use the UMASH hashes that have a guaranteed lower bound on collisions as the hash table keys.

codecov · 2025-01-10T16:30:22Z

Codecov Report

Attention: Patch coverage is 90.00000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 81.44%. Comparing base (59f50f2) to head (6252dfe).
Report is 760 commits behind head on main.

Files with missing lines	Patch %	Lines
...des/vector_agg/hashing/hash_strategy_single_text.c	88.88%	1 Missing and 4 partials ⚠️
tsl/src/nodes/vector_agg/plan.c	50.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7586      +/-   ##
==========================================
+ Coverage   80.06%   81.44%   +1.38%     
==========================================
  Files         190      245      +55     
  Lines       37181    44976    +7795     
  Branches     9450    11217    +1767     
==========================================
+ Hits        29770    36632    +6862     
- Misses       2997     3954     +957     
+ Partials     4414     4390      -24

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

erimatnor

LGTM. I added some suggestions for minor improvements. Also have some general questions.

erimatnor · 2025-01-31T14:35:12Z

tsl/src/nodes/vector_agg/hashing/hash_strategy_single_text.c

+typedef struct BytesView
+{
+	const uint8 *data;
+	uint32 len;
+} BytesView;


Instead of defining a new type, can we use StringInfo and initReadOnlyStringInfo? Just an idea.

Well, it's not a complex type, and StringInfo has some unrelated things, so I'd keep it as is.

erimatnor · 2025-01-31T14:57:28Z

tsl/src/nodes/vector_agg/hashing/hash_strategy_single_text.c

+	BytesView *restrict output_key = (BytesView *) output_key_ptr;
+	HASH_TABLE_KEY_TYPE *restrict hash_table_key = (HASH_TABLE_KEY_TYPE *) hash_table_key_ptr;
+
+	if (unlikely(params.single_grouping_column.decompression_type == DT_Scalar))


Question for my own understanding (not asking for any changes for this PR):

This is deep into vector aggregation, so I would expect that this function only gets passed an arrow array. But now we need to check for different non-array formats, including impossible cases (e.g, DT_Iterator). If we only passed in arrays, these checks would not be necessary.

The arrow array format already supports everything we need. Even scalar/segmentby values can be represented by arrow arrays (e.g., run-end encoded).

Now we need this extra code to check for different formats/cases everywhere we reach into the data. Some of them shouldn't even be possible here.

IMO, the API to retrieve a value should be something:

Datum d = arrow_array_get_value_at(array, rownum, &isnull, &valuelen);

This function can easily check the encoding of the array (dict, runend, etc.) to retrieve the value requested.

Yeah, I have already regretted supporting the scalar values throughout aggregation. Technically it should perform better, because it avoids creating e.g. arrow array with the same constant value for every row, and sometimes we perform the computations in a different way for scalar values. But the implementation complexity might be a little too high. Maybe I should look into removing this at least for the key column, and always materializing them into arrow arrays. I'm going to consider this after we merge the multi-column aggregation.

The external interface might still turn out to be more complex than what you suggest, and closer to the current CompressedColumnValues, because sometimes we have to statically generate the function that works e.g. with dictionary encoding specifically, and that won't be possible if we determine this inside an opaque callback. We can't call an opaque callback (i.e. a non-inlinable dynamic function) for every row because it's going to produce significantly less performant code.

This function can easily check the encoding of the array (dict, runend, etc.) to retrieve the value requested.

This is not enough though, the arrow arrays don't know their value size/type, for example. They should always be accompanied by some metadata.

erimatnor · 2025-02-04T08:56:30Z

tsl/src/nodes/vector_agg/hashing/hash_strategy_single_text.c

+	const int total_bytes = output_key.len + VARHDRSZ;
+	text *restrict stored = (text *) MemoryContextAlloc(policy->hashing.key_body_mctx, total_bytes);
+	SET_VARSIZE(stored, total_bytes);
+	memcpy(VARDATA(stored), output_key.data, output_key.len);


Suggest making use of PostgreSQL builtin function:

Suggested change

const int total_bytes = output_key.len + VARHDRSZ;

text *restrict stored = (text *) MemoryContextAlloc(policy->hashing.key_body_mctx, total_bytes);

SET_VARSIZE(stored, total_bytes);

memcpy(VARDATA(stored), output_key.data, output_key.len);

MemoryConext oldmcxt = MemoryContextSwitchTo(policy->hashing.key_body_mctx);

text *stored = cstring_to_text_with_len(output_key.data, output_key.len);

MemoryContextSwitchTo(oldmcxt);

I think it's better to have something that can be inlined here, because it's a part of a hot loop that builds the hash table.

tsl/src/nodes/vector_agg/hashing/hash_strategy_single_text.c

erimatnor · 2025-02-04T09:03:26Z

.github/workflows/windows-build-and-test.yaml

@@ -59,7 +59,7 @@ jobs:
        build_type: ${{ fromJson(needs.config.outputs.build_type) }}
        ignores: ["chunk_adaptive metadata telemetry"]
        tsl_ignores: ["compression_algos"]
-        tsl_skips: ["bgw_db_scheduler bgw_db_scheduler_fixed"]
+        tsl_skips: ["vector_agg_text vector_agg_groupagg bgw_db_scheduler bgw_db_scheduler_fixed"]


Why do we need to skip these tests on windows? Is it also because of UMASH?

Right, I didn't get it to compile there, so decided to disable for now.

Co-authored-by: Erik Nordström <819732+erimatnor@users.noreply.github.com> Signed-off-by: Alexander Kuzmenkov <36882414+akuzm@users.noreply.github.com>

@bjornuppeke

## 2.19.0 (2025-03-12) This release contains performance improvements and bug fixes since the 2.18.2 release. We recommend that you upgrade at the next available opportunity. **Features** * [#7586](#7586) Vectorized aggregation with grouping by a single text column. * [#7632](#7632) Optimize recompression for chunks without segmentby * [#7655](#7655) Support vectorized aggregation on Hypercore TAM * [#7669](#7669) Add support for merging compressed chunks * [#7701](#7701) Implement a custom compression algorithm for bool columns. It is experimental and can undergo backwards-incompatible changes. For testing, enable it using timescaledb.enable_bool_compression = on. * [#7707](#7707) Support ALTER COLUMN SET NOT NULL on compressed chunks * [#7765](#7765) Allow tsdb as alias for timescaledb in WITH and SET clauses * [#7786](#7786) Show warning for inefficient compress_chunk_time_interval configuration * [#7788](#7788) Add callback to mem_guard for background workers * [#7789](#7789) Do not recompress segmentwise when default order by is empty * [#7790](#7790) Add configurable Incremental CAgg Refresh Policy **Bugfixes** * [#7665](#7665) Block merging of frozen chunks * [#7673](#7673) Don't abort additional INSERTs when hitting first conflict * [#7714](#7714) Fixes a wrong result when compressed NULL values were confused with default values. This happened in very special circumstances with alter table added a new column with a default value, an update and compression in a very particular order. * [#7747](#7747) Block TAM rewrites with incompatible GUC setting * [#7748](#7748) Crash in the segmentwise recompression * [#7764](#7764) Fix compression settings handling in Hypercore TAM * [#7768](#7768) Remove costing index scan of hypertable parent * [#7799](#7799) Handle DEFAULT table access name in ALTER TABLE **Thanks** * @bjornuppeke for reporting a problem with INSERT INTO ... ON CONFLICT DO NOTHING on compressed chunks * @kav23alex for reporting a segmentation fault on ALTER TABLE with DEFAULT Signed-off-by: Philip Krauss <35487337+philkra@users.noreply.github.com>

@bjornuppeke

## 2.19.0 (2025-03-18) This release contains performance improvements and bug fixes since the 2.18.2 release. We recommend that you upgrade at the next available opportunity. * Improved concurrency of INSERT, UPDATE and DELETE operations on the columnstore by no longer blocking DML statements during the recompression of a chunk. * Improved system performance during Continuous Aggregates refreshes by breaking them into smaller batches which reduces systems pressure and minimizes the risk of spilling to disk. * Faster and more up-to-date results for queries against Continuous Aggregates by materializing the most recent data first (vs old data first in prior versions). * Faster analytical queries with SIMD vectorization of aggregations over text columns and group by over multiple column * Enable optimizing chunk size for faster query performance on the columnstore by adding support for merging columnstore chunks to the merge_chunk API. **Deprecation warning** This is the last minor release supporting PostgreSQL 14. Starting with the minor version of TimescaleDB only Postgres 15, 16 and 17 will be supported. **Downgrading of 2.19.0** This release introduces custom bool compression, if you enable this feature via the `enable_bool_compression` and must downgrade to a previous, please use the [following script](https://github.com/timescale/timescaledb-extras/blob/master/utils/2.19.0-downgrade_new_compression_algorithms.sql) to convert the columns back to their previous state. TimescaleDB versions prior to 2.19.0 do not know how to handle this new type. **Features** * [#7586](#7586) Vectorized aggregation with grouping by a single text column. * [#7632](#7632) Optimize recompression for chunks without segmentby * [#7655](#7655) Support vectorized aggregation on Hypercore TAM * [#7669](#7669) Add support for merging compressed chunks * [#7701](#7701) Implement a custom compression algorithm for bool columns. It is experimental and can undergo backwards-incompatible changes. For testing, enable it using timescaledb.enable_bool_compression = on. * [#7707](#7707) Support ALTER COLUMN SET NOT NULL on compressed chunks * [#7765](#7765) Allow tsdb as alias for timescaledb in WITH and SET clauses * [#7786](#7786) Show warning for inefficient compress_chunk_time_interval configuration * [#7788](#7788) Add callback to mem_guard for background workers * [#7789](#7789) Do not recompress segmentwise when default order by is empty * [#7790](#7790) Add configurable Incremental CAgg Refresh Policy **Bugfixes** * [#7665](#7665) Block merging of frozen chunks * [#7673](#7673) Don't abort additional INSERTs when hitting first conflict * [#7714](#7714) Fixes a wrong result when compressed NULL values were confused with default values. This happened in very special circumstances with alter table added a new column with a default value, an update and compression in a very particular order. * [#7747](#7747) Block TAM rewrites with incompatible GUC setting * [#7748](#7748) Crash in the segmentwise recompression * [#7764](#7764) Fix compression settings handling in Hypercore TAM * [#7768](#7768) Remove costing index scan of hypertable parent * [#7799](#7799) Handle DEFAULT table access name in ALTER TABLE **GUCs** * `enable_bool_compression`: enable the BOOL compression algorithm, default: `OFF` * `enable_exclusive_locking_recompression`: enable exclusive locking during recompression (legacy mode), default: `OFF` **Thanks** * @bjornuppeke for reporting a problem with INSERT INTO ... ON CONFLICT DO NOTHING on compressed chunks * @kav23alex for reporting a segmentation fault on ALTER TABLE with DEFAULT --------- Signed-off-by: Philip Krauss <35487337+philkra@users.noreply.github.com> Signed-off-by: Ramon Guiu <ramon@timescale.com> Co-authored-by: Ramon Guiu <ramon@timescale.com>

Vectorized hash grouping by a single text column

9390673

Use the UMASH hashes that have a guaranteed lower bound on collisions as the hash table keys.

github-actions bot assigned akuzm Jan 10, 2025

akuzm added 3 commits January 10, 2025 14:41

forgotten files

a68572f

cleanup

4aeecac

cleanup

979e59b

akuzm mentioned this pull request Jan 10, 2025

Vectorized hash grouping #7316

Draft

11 tasks

akuzm added 2 commits January 10, 2025 15:01

bytes view

3311861

fix the test

ef106b7

akuzm added 18 commits January 10, 2025 17:50

disable clang-tidy for imported code

89fd4ad

try to detect i386 differently

192ec2c

fix cmake

37b4cca

clang-tidy

16d4908

fix?

e4b79ef

cmake

5b08c99

changelog

c4fc939

license cleanup

8cf8c69

disable on windows

8d45d71

Merge branch 'main' into hash-text

2bac2c4

fix

dc188e6

fix

b63c057

fixes

17d4124

more fixes

cfbf9d4

Merge remote-tracking branch 'origin/main' into HEAD

7cd0424

Merge remote-tracking branch 'origin/main' into HEAD

dddd0ce

add a test for groupagg mode with nulls

1919e38

format

1529ed9

akuzm marked this pull request as ready for review January 29, 2025 12:01

akuzm added 2 commits January 29, 2025 13:54

ignore the test

c7efeb0

yaml

d0e2431

fix after merge

18f51f5

erimatnor approved these changes Feb 4, 2025

View reviewed changes

akuzm and others added 2 commits February 4, 2025 11:29

Update tsl/src/nodes/vector_agg/hashing/hash_strategy_single_text.c

09d1036

Co-authored-by: Erik Nordström <819732+erimatnor@users.noreply.github.com> Signed-off-by: Alexander Kuzmenkov <36882414+akuzm@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into HEAD

679c56d

svenklemm approved these changes Feb 5, 2025

View reviewed changes

akuzm mentioned this pull request Feb 10, 2025

Support vectorized aggregation on Hypercore TAM #7655

Merged

akuzm added 15 commits February 10, 2025 18:48

more tests for dictionary encoding

1f77422

add more tests

a55cfc7

Merge remote-tracking branch 'origin/main' into HEAD

9e0b5ca

oops it doesn't work with the reverse order

31b3b27

fixes for the reverse order

35cdd73

debug

326a452

cleanups

5f2fdb2

integrate the changes from the bug fix

28678d8

Merge commit '6ce2fc0df492892e915003b30239f5ab25a2e49e~' into HEAD

20ff501

prepare to merge

73faf1e

Merge remote-tracking branch 'origin/main' into HEAD

c929560

fix

22f1211

fix

0e905c1

fix

7179faa

remove debug output

6252dfe

akuzm enabled auto-merge (squash) February 13, 2025 13:04

akuzm merged commit eea2895 into timescale:main Feb 13, 2025
49 of 50 checks passed

akuzm deleted the hash-text branch February 13, 2025 13:08

philkra added this to the v2.19.0 milestone Mar 5, 2025

This was referenced Mar 12, 2025

CHANGELOG for 2.19.0 #7824

Closed

CHANGELOG for 2.19.0 #7829

Merged

bayandin mentioned this pull request Mar 21, 2025

timescaledb 2.19.0 bayandin/homebrew-tap#255

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorized hash grouping by a single text column #7586

Vectorized hash grouping by a single text column #7586

akuzm commented Jan 10, 2025 •

edited

Loading

codecov bot commented Jan 10, 2025 •

edited

Loading

erimatnor left a comment

erimatnor Jan 31, 2025

akuzm Feb 4, 2025

erimatnor Jan 31, 2025

akuzm Feb 4, 2025

akuzm Feb 5, 2025

erimatnor Feb 4, 2025

akuzm Feb 4, 2025

erimatnor Feb 4, 2025

akuzm Feb 4, 2025

Vectorized hash grouping by a single text column #7586

Vectorized hash grouping by a single text column #7586

Conversation

akuzm commented Jan 10, 2025 • edited Loading

codecov bot commented Jan 10, 2025 • edited Loading

Codecov Report

erimatnor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akuzm commented Jan 10, 2025 •

edited

Loading

codecov bot commented Jan 10, 2025 •

edited

Loading