-
Couldn't load subscription status.
- Fork 1.9k
Description
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Use Cases
The Clickhouse sink for Vector is currently limited to the following formats: JSONEachRow, JSONAsObject, and JSONAsString. While convenient, these formats are computationally expensive for Clickhouse to parse and do not scale well since they are 1. row-based and 2. text-based instead of binary.
This has become a significant concern for us, since we are ingesting hundreds of thousands of rows per second via Vector. Despite adding compression and batching via asynchronous inserts, we are still experiencing significant overhead.
Based on the official benchmarks, JSONEachRow is roughly 4-5x less efficient compared to the ArrowStream or Native formats.
Attempted Solutions
The ideal solution would to be to use the clickhouse-rs crate, but unfortunately it does not yet implement the Native format.
Implementing the Native format from scratch is quite tricky, and probably something that won't be easy to maintain.
Adding support for ArrowStream is nice since it's extremely performant due to zero-copy and can be potentially re-used current/future sinks (thinking of Snowflake and DuckDB among others).
Proposal
Add a sink-level encoder that automatically fetches the target table schema at sink initialization:
- Queries system.columns to get column names and ClickHouse types
- Maps ClickHouse types to equivalent Arrow types
- Builds an Arrow schema used for encoding batches
- Sends data using ClickHouse's ArrowStream format endpoint
References
https://clickhouse.com/docs/en/interfaces/formats#arrowstream
Version
No response