-
Notifications
You must be signed in to change notification settings - Fork 137
Description
I've been testing clickhouse as a potential db for work and was excited to see a rust clickhouse client. I wrote a small program that randomly generates data that matches our db schema, and then inserts tons of that data into the database with the intent of both getting to know clickhouse better and seeing if it meets out insert and query needs.
Running my test though, I'm coming up against memory errors that look like this issue on the clickhouse repo. I've been trying to troubleshoot it, but I'm just not familiar enough with clickhouse yet to nail down what's causing the issue and what exactly the issue is.
Here's my little test program
#[tokio::main]
async fn main() -> Result<()> {
let row_size = std::mem::size_of::<DragonflyRow>();
let bytes_in_billion_rows = 1_000_000_000 * row_size;
// insert one billion rows in batches of 10,000
// I've done this in various batch sizes from 10 to 10,000
let total_rows_to_insert = 1_000_000_000;
let batch_size = 10_000;
// start a clickhouse client
let client = Client::default().with_url("http://localhost:8123");
// create an "inserter"
let mut inserter = client
.inserter("dragonfly")? // table name
.with_max_entries(10_000);
let mut inserted_so_far = 0;
for i in 1..((total_rows_to_insert/batch_size) + 1) {
for j in 1..batch_size+1 {
inserter.write(&DragonflyRow::rand_new()).await?; // the object inserted is a randomly generated/populated struct that matches the db schema
inserted_so_far = i * j;
}
inserter.commit().await?;
thread::sleep(time::Duration::from_secs(2)); // sleep two seconds to not potentially overwhelm clickhouse
}
// close the inserter
inserter.end().await?;
Ok(())
}My table is very simple, no nested objects and the engine is just a MergeTree using a timestamp value to order by.
When I run this with batch sizes of <1,000 rows, I get this error
Error: BadResponse("Code: 33. DB::Exception: Cannot read all data. Bytes read: 582754. Bytes expected: 1838993.: (at row 1)\n: While executing BinaryRowInputFormat. (CANNOT_READ_ALL_DATA) (version 21.11.3.6 (official build))")
When I run this with a batch size of 10,000, I get this
Error: BadResponse("Code: 49. DB::Exception: Too large size (9223372036854775808) passed to allocator. It indicates an error.: While executing BinaryRowInputFormat. (LOGICAL_ERROR) (version 21.11.3.6 (official build))")
Based on the information in the clickhouse issues that are similar to this, I think there's something going on with how the BinaryRowInputFormat queries are being executed, but being newer to clickhouse I'm not very confident that I'm correct about that. Today I hope to follow up by doing a similar test but instead of using this clickhouse client library, I'll just connect to the port and send raw http requests and see if I get the same issues or not.
Similar clickhouse issues I've found
- Consuming big amount of memory when loading via ODBC bridge. ClickHouse#23778
- Cannot read all data. Bytes read: xxx. Bytes expected:xxx ClickHouse#23719
- Frequent DB::Exception: Memory limit (total) exceeded while inserting ClickHouse#22437
- MemoryTracker wrong total: Memory limit (total) exceeded, but no real usage ClickHouse#15932
I'm on Ubuntu 20.04, 4 cpu cores, 10GB ram, 32GB disk.
Looking at htop output while the program is running, I don't see much that helps aside from a lot of clickhouse-server threads (maybe about 50).
For reference, I have no problem inserting a 4GB json file with clickhouse-client --query "<insert statement>" < file.json
I'm happy to help if there are more questions about this issue.