Skip to content

Conversation

@benjamin-awd
Copy link
Contributor

@benjamin-awd benjamin-awd commented Oct 25, 2025

Summary

This PR adds the ArrowStream format option for the Clickhouse sink. This provides a more efficient binary protocol for ingesting log data into ClickHouse compared to the existing JSON formats, with improved performance at high throughput.

Vector configuration

  sinks:
    clickhouse:
      type: clickhouse
      endpoint: http://localhost:8123
      database: mydatabase
      table: logs
      format: arrow_stream  # New format option (defaults to JSONEachRow)
      compression: gzip
      auth:
        strategy: basic
        user: default
        password: "${CLICKHOUSE_PASSWORD}"

How did you test this PR?

Tested locally and in development environment using data at a rate of a few hundred thousand rows per second.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Closes #24074

Notes

Internal comparison between formats (pointing Vector at two identical tables, the only difference being format)

WITH
    a_log AS
    (
        SELECT
            `table`,
            format,
            rows,
            bytes,
            flush_query_id
        FROM system.asynchronous_insert_log
        WHERE (status = 'Ok') AND (`table` IN ('t1', 't2')) AND (event_time >= (now() - toIntervalMinute(15)))
    ),
    q_log AS
    (
        SELECT
            query_id,
            query_duration_ms
        FROM system.query_log
        PREWHERE (type = 'QueryFinish') AND (query_kind = 'AsyncInsertFlush') AND (event_time >= (now() - toIntervalMinute(15)))
    )
SELECT
    a.`table`,
    a.format,
    count() AS total_flushes,
    sum(a.rows) AS total_rows_inserted,
    formatReadableSize(sum(a.bytes)) AS total_data_inserted,
    sum(q.query_duration_ms) AS total_flush_time_ms,
    sum(a.rows) / sum(q.query_duration_ms / 1000.) AS avg_rows_per_second,
    concat(formatReadableSize(sum(a.bytes) / sum(q.query_duration_ms / 1000.)), '/s') AS avg_bytes_per_second,
    sum(a.rows) / count() AS avg_rows_per_flush
FROM a_log AS a
INNER JOIN q_log AS q ON a.flush_query_id = q.query_id
GROUP BY
    a.`table`,
    a.format
ORDER BY
    a.`table` ASC,
    a.format ASC

Query id: d41311c0-cb10-403c-8d4d-7f6cf6cb8f13

Row 1:
──────
table:                jsoneachrow_table
format:               JSONEachRow
total_flushes:        34084
total_rows_inserted:  42829429 -- 42.83 million
total_data_inserted:  65.39 GiB
total_flush_time_ms:  14707745 -- 14.71 million
avg_rows_per_second:  2912.0323339845772
avg_bytes_per_second: 4.55 MiB/s
avg_rows_per_flush:   1256.5845851425888

Row 2:
──────
table:                arrowstream_table
format:               ArrowStream
total_flushes:        35934
total_rows_inserted:  45153872 -- 45.15 million
total_data_inserted:  17.27 GiB
total_flush_time_ms:  3356282 -- 3.36 million
avg_rows_per_second:  13453.539362902164
avg_bytes_per_second: 5.27 MiB/s
avg_rows_per_flush:   1256.5779484610675

@benjamin-awd benjamin-awd requested a review from a team as a code owner October 25, 2025 14:15
@github-actions github-actions bot added the domain: sinks Anything related to the Vector's sinks label Oct 25, 2025
@benjamin-awd benjamin-awd requested a review from a team as a code owner October 25, 2025 14:30
@github-actions github-actions bot added the domain: external docs Anything related to Vector's external, public documentation label Oct 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add ArrowStream format to Clickhouse sink

1 participant