-
Couldn't load subscription status.
- Fork 1.9k
enhancement(clickhouse sink): Add ArrowStream format
#24075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
benjamin-awd
wants to merge
40
commits into
vectordotdev:master
Choose a base branch
from
benjamin-awd:add-ch-arrow
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+2,668
−5
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
domain: external docs
Anything related to Vector's external, public documentation
domain: sinks
Anything related to the Vector's sinks
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds the
ArrowStreamformat option for the Clickhouse sink. This provides a more efficient binary protocol for ingesting log data into ClickHouse compared to the existing JSON formats, with improved performance at high throughput.Vector configuration
How did you test this PR?
Tested locally and in development environment using data at a rate of a few hundred thousand rows per second.
Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References
Closes #24074
Notes
Internal comparison between formats (pointing Vector at two identical tables, the only difference being format)
WITH a_log AS ( SELECT `table`, format, rows, bytes, flush_query_id FROM system.asynchronous_insert_log WHERE (status = 'Ok') AND (`table` IN ('t1', 't2')) AND (event_time >= (now() - toIntervalMinute(15))) ), q_log AS ( SELECT query_id, query_duration_ms FROM system.query_log PREWHERE (type = 'QueryFinish') AND (query_kind = 'AsyncInsertFlush') AND (event_time >= (now() - toIntervalMinute(15))) ) SELECT a.`table`, a.format, count() AS total_flushes, sum(a.rows) AS total_rows_inserted, formatReadableSize(sum(a.bytes)) AS total_data_inserted, sum(q.query_duration_ms) AS total_flush_time_ms, sum(a.rows) / sum(q.query_duration_ms / 1000.) AS avg_rows_per_second, concat(formatReadableSize(sum(a.bytes) / sum(q.query_duration_ms / 1000.)), '/s') AS avg_bytes_per_second, sum(a.rows) / count() AS avg_rows_per_flush FROM a_log AS a INNER JOIN q_log AS q ON a.flush_query_id = q.query_id GROUP BY a.`table`, a.format ORDER BY a.`table` ASC, a.format ASC Query id: d41311c0-cb10-403c-8d4d-7f6cf6cb8f13 Row 1: ────── table: jsoneachrow_table format: JSONEachRow total_flushes: 34084 total_rows_inserted: 42829429 -- 42.83 million total_data_inserted: 65.39 GiB total_flush_time_ms: 14707745 -- 14.71 million avg_rows_per_second: 2912.0323339845772 avg_bytes_per_second: 4.55 MiB/s avg_rows_per_flush: 1256.5845851425888 Row 2: ────── table: arrowstream_table format: ArrowStream total_flushes: 35934 total_rows_inserted: 45153872 -- 45.15 million total_data_inserted: 17.27 GiB total_flush_time_ms: 3356282 -- 3.36 million avg_rows_per_second: 13453.539362902164 avg_bytes_per_second: 5.27 MiB/s avg_rows_per_flush: 1256.5779484610675