Add ArrowStream format to Clickhouse sink

### A note for the community


* Please vote on this issue by adding a 👍 [reaction](https://blog.github.com/2016-03-10-add-reactions-to-pull-requests-issues-and-comments/) to the original issue to help the community and maintainers prioritize this request
* If you are interested in working on this issue or have submitted a pull request, please leave a comment



### Use Cases

The Clickhouse sink for Vector is currently limited to the following formats: `JSONEachRow`, `JSONAsObject`, and `JSONAsString`. While convenient, these formats are computationally expensive for Clickhouse to parse and do not scale well since they are 1. row-based and 2. text-based instead of binary.

This has become a significant concern for us, since we are ingesting hundreds of thousands of rows per second via Vector. Despite adding compression and batching via asynchronous inserts, we are still experiencing significant overhead.

Based on the official [benchmarks](https://clickhouse.com/blog/clickhouse-input-format-matchup-which-is-fastest-most-efficient#insert-efficiency-of-common-input-formats), JSONEachRow is roughly 4-5x less efficient compared to the ArrowStream or Native formats.

### Attempted Solutions

The ideal solution would to be to use the [clickhouse-rs](https://github.com/ClickHouse/clickhouse-rs) crate, but unfortunately it does not yet implement the `Native` format.

Implementing the Native format from scratch is quite tricky, and probably something that won't be easy to maintain.

Adding support for ArrowStream is nice since it's extremely performant due to zero-copy and can be potentially re-used current/future sinks (thinking of Snowflake and DuckDB among others).

### Proposal

Add a sink-level encoder that automatically fetches the target table schema at sink initialization:
  - Queries system.columns to get column names and ClickHouse types
  - Maps ClickHouse types to equivalent Arrow types
  - Builds an Arrow schema used for encoding batches
  - Sends data using ClickHouse's ArrowStream format endpoint


### References

https://clickhouse.com/docs/en/interfaces/formats#arrowstream

### Version

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add ArrowStream format to Clickhouse sink #24074

A note for the community

Use Cases

Attempted Solutions

Proposal

References

Version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add ArrowStream format to Clickhouse sink #24074

Description

A note for the community

Use Cases

Attempted Solutions

Proposal

References

Version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions