Skip to content

Add ArrowStream format to Clickhouse sink #24074

@benjamin-awd

Description

@benjamin-awd

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

The Clickhouse sink for Vector is currently limited to the following formats: JSONEachRow, JSONAsObject, and JSONAsString. While convenient, these formats are computationally expensive for Clickhouse to parse and do not scale well since they are 1. row-based and 2. text-based instead of binary.

This has become a significant concern for us, since we are ingesting hundreds of thousands of rows per second via Vector. Despite adding compression and batching via asynchronous inserts, we are still experiencing significant overhead.

Based on the official benchmarks, JSONEachRow is roughly 4-5x less efficient compared to the ArrowStream or Native formats.

Attempted Solutions

The ideal solution would to be to use the clickhouse-rs crate, but unfortunately it does not yet implement the Native format.

Implementing the Native format from scratch is quite tricky, and probably something that won't be easy to maintain.

Adding support for ArrowStream is nice since it's extremely performant due to zero-copy and can be potentially re-used current/future sinks (thinking of Snowflake and DuckDB among others).

Proposal

Add a sink-level encoder that automatically fetches the target table schema at sink initialization:

  • Queries system.columns to get column names and ClickHouse types
  • Maps ClickHouse types to equivalent Arrow types
  • Builds an Arrow schema used for encoding batches
  • Sends data using ClickHouse's ArrowStream format endpoint

References

https://clickhouse.com/docs/en/interfaces/formats#arrowstream

Version

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: featureA value-adding code addition that introduce new functionality.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions