Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to control the maximum batch size inserted to ClickHouse #418

Closed
aadant opened this issue Dec 15, 2023 · 3 comments · Fixed by #414
Closed

Ability to control the maximum batch size inserted to ClickHouse #418

aadant opened this issue Dec 15, 2023 · 3 comments · Fixed by #414
Assignees
Labels
enhancement New feature or request GA-1 All the issues that are issues in release(Scheduled Dec 2023) high-priority qa-verified label to mark issues that were verified by QA

Comments

@aadant
Copy link
Collaborator

aadant commented Dec 15, 2023

The number of rows inserted to ClickHouse varies

SELECT
    event_time,
    written_rows
FROM system.query_log
WHERE (event_time > (now() - toIntervalDay(7))) AND (type = 'QueryFinish') AND (query ILIKE 'insert into `table`(%')
ORDER BY written_rows DESC
LIMIT 10

Query id: e22904f5-173f-40f7-9e97-59cc6a1af3f0

┌──────────event_time─┬─written_rows─┐
│ 2023-12-15 11:44:45 │       735393 │
│ 2023-12-15 08:57:55 │       144627 │
│ 2023-12-15 07:49:59 │       113469 │
│ 2023-12-15 05:02:37 │        58897 │
│ 2023-12-14 20:39:10 │        50021 │
│ 2023-12-15 06:10:03 │        40449 │
│ 2023-12-15 11:10:42 │        37999 │
│ 2023-12-15 08:58:06 │        35819 │
│ 2023-12-15 12:53:22 │        30021 │
│ 2023-12-15 12:56:12 │        30006 │
└─────────────────────┴──────────────┘

10 rows in set. Elapsed: 0.240 sec. Processed 9.62 million rows, 4.62 GB (40.14 million rows/s., 19.27 GB/s.)

I think we should have a maximum batch size configurable at the sink-connector level.

CH would insert max 1M rows by default and 256MB for the block size.

--min_insert_block_size_bytes https://clickhouse.com/docs/en/operations/settings/settings#min-insert-block-size-bytes
--min_insert_block_size_rows https://clickhouse.com/docs/en/operations/settings/settings#min-insert-block-size-rows

SELECT
    event_time,
    database,
    table,
    rows,
    duration_ms,
    size_in_bytes
FROM system.part_log
WHERE (event_time > (now() - toIntervalDay(4))) AND (database = 'db') AND (table = 'table') AND (event_type = 'NewPart')
ORDER BY rows DESC
LIMIT 20

Query id: 8691057c-acc4-45bb-9462-a8d5224398c0
┌──────────event_time─┬─database─────┬─table─┬────rows─┬─duration_ms─┬─size_in_bytes─┐
│ 2023-12-13 00:29:31 │ db           │ table │ 1048449 │        1040 │     139016156 │
│ 2023-12-13 00:24:02 │ db           │ table │ 1048449 │         958 │     133973070 │
│ 2023-12-13 00:23:58 │ db           │ table │ 1048449 │        1045 │     134714867 │
│ 2023-12-12 17:46:41 │ db           │ table │ 1048449 │         950 │     149667612 │
│ 2023-12-12 17:21:08 │ db           │ table │ 1048449 │         946 │     150494029 │
│ 2023-12-12 17:25:20 │ db           │ table │ 1048449 │        1011 │     151809027 │

More importantly the replication position should be persisted to less than the minimum of all current batches. But that's a different issue. see #230

@aadant aadant added enhancement New feature or request high-priority labels Dec 15, 2023
@subkanthi
Copy link
Collaborator

                .notifying((records, committer) -> {

                    for (ChangeEvent<String, String> r : records) {
                        assertThat(r.key()).isNotNull();
                        assertThat(r.value()).isNotNull();
                        try {
                            final Document key = DocumentReader.defaultReader().read(r.key());
                            final Document value = DocumentReader.defaultReader().read(r.value());
                            assertThat(key.getInteger("id")).isEqualTo(1);
                            assertThat(value.getDocument("after").getInteger("id")).isEqualTo(1);
                            assertThat(value.getDocument("after").getString("val")).isEqualTo("value1");
                        }
                        catch (IOException e) {
                            throw new IllegalStateException(e);
                        }
                        allLatch.countDown();
                        committer.markProcessed(r);
                    }
                    committer.markBatchFinished();

@subkanthi subkanthi added the GA-1 All the issues that are issues in release(Scheduled Dec 2023) label Dec 18, 2023
@subkanthi
Copy link
Collaborator


# Max number of records for the flush buffer.
#buffer.max.records: "10000"

@Selfeer
Copy link
Collaborator

Selfeer commented Feb 8, 2024

The issue was manually verified by the Altinity QA team and marked as qa-verified.

Build used for testing: altinityinfra/clickhouse-sink-connector:443-14a14ba5c18fb3bc9f57c48d9a5079d4d6fe15e8-lt

We've verified that modifying buffer.max.records configuration for the clickhouse-sink-connector-lt affects the maximum batch size records.

@Selfeer Selfeer added the qa-verified label to mark issues that were verified by QA label Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request GA-1 All the issues that are issues in release(Scheduled Dec 2023) high-priority qa-verified label to mark issues that were verified by QA
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants