Add support for hive partition style reads and writes #76802

arthurpassos · 2025-02-26T14:48:03Z

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add support for hive partition style reads and writes

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

src/Storages/ObjectStorage/StorageObjectStorageSink.h

src/Storages/PartitionedSink.cpp

arthurpassos · 2025-02-26T15:16:00Z

I'll add tests soon

alexey-milovidov · 2025-02-26T15:29:19Z

This is good, and it could replace a problematic feature: #23051
Change to "New Feature".

Let's make sure names and values are URL-encoded.

One potential problem is memory usage when writing to many partitions at the same time. Let's define some limit, so we will create only up to this limit number of buffers for writing to S3 at the same time.

Do we control where exactly the path fragment with partition goes in the URL?
Do we control if the partition columns will be written or omitted from the files?

clickhouse-gh · 2025-02-26T15:30:08Z

Workflow [PR], commit [27e26d9]

arthurpassos · 2025-02-27T17:41:44Z

Change to "New Feature".

Done

Let's make sure names and values are URL-encoded.

Ack

arthurpassos · 2025-02-27T19:49:34Z

Do we control where exactly the path fragment with partition goes in the URL?

As of now, it us up to the user to choose where in the path the partition goes by using the {_partition_id} macro/ placeholder when creating the table.

Maybe we can make it simpler: user is responsible for defining the table root, that's all. The rest (partition key location, filename and file extension) clickhouse will generate.

engine = s3('bucket/table_root') partition by (year, country) -> 'bucket/table_root/year=2025/country=spain/<generated_uuid>.<format from table>'.

If the user specifies the partition_id placeholder and use_hive=1, we throw exception.

What do you think? @alexey-milovidov

arthurpassos · 2025-02-28T13:05:02Z

Do we control if the partition columns will be written or omitted from the files?

I suppose that could be implemented, but perhaps we should leave it for a follow up PR?

alexey-milovidov · 2025-02-28T19:55:21Z

Maybe we can make it simpler: user is responsible for defining the table root, that's all. The rest (partition key location, filename and file extension) clickhouse will generate.

Yes, this is a great idea!

This PR is good, but what I'd like to see in addition, before merging it, is - fixing the memory consumption problem with PARTITION BY. It's an old flaw of the current mechanism. Having this new feature will make it more frequently used, and the users will bump into this problem more frequently.

arthurpassos · 2025-02-28T20:36:53Z

Maybe we can make it simpler: user is responsible for defining the table root, that's all. The rest (partition key location, filename and file extension) clickhouse will generate.

Yes, this is a great idea!

This PR is good, but what I'd like to see in addition, before merging it, is - fixing the memory consumption problem with PARTITION BY. It's an old flaw of the current mechanism. Having this new feature will make it more frequently used, and the users will bump into this problem more frequently.

Is there an issue that describes this issue in depth? I could look into that

alexey-milovidov · 2025-03-01T00:57:54Z

arthurpassos · 2025-03-03T18:39:44Z

src/Storages/PartitionedSink.cpp

@@ -42,13 +42,29 @@ Names extractPartitionRequiredColumns(const ASTPtr & partition_by, const Block &
    return exp_analyzer->getRequiredColumns();
 }

+static std::string formatToFileExtension(const std::string & format)


needs to be completed

arthurpassos · 2025-03-03T19:59:34Z

Messed up, need to re-think a couple of things

arthurpassos · 2025-03-04T12:25:04Z

The shitty part about this is keeping backwards compatibility with {_partition_id}

arthurpassos · 2025-03-05T16:49:28Z

@alexey-milovidov couple of questions:

The existing use_hive_partitioning is used for something else, and it can be tweaked in-between table creation and data insertion. We need a new variable to control the partitioning style at a table level. Should it be a new setting or a new argument to table engines? e.g, partitioning_strategy=['hive' | 'simple', others in the future]. In case we vote for argument, it should be implemented for S3, File and URL table engines.
We have settled on asking for the user to specify the table root and we generate the rest (partition style path, filename and file extension). Being that said, should we forbid the user to create a table with {_partition_id} macro in case partition strategy is hive style?

arthurpassos · 2025-03-05T16:50:04Z

Once I am done with this PR, I'll look into the max threads/streams thing

arthurpassos · 2025-06-20T12:41:37Z

@kssenii I have resolved most of the TODOS and your comments. The pending comments are: #76802 (comment) and #76802 (comment).

From a feature perspective, I think there are a few things pending:

Decide if we are going to support it for other engines as well (i.e, file, url and azure)
Implement partition expression validation. I think we should allow only primitive types to be hive partitioned, not even complex expressions.

Could you please re-review and shared your thoughts?

arthurpassos added 2 commits February 26, 2025 11:42

Add suport for s3 hive partition style writes

440ccc7

storageurl and storagefile

4224845

arthurpassos commented Feb 26, 2025

View reviewed changes

src/Storages/ObjectStorage/StorageObjectStorageSink.h Outdated Show resolved Hide resolved

simplify code

b0ec6be

arthurpassos commented Feb 26, 2025

View reviewed changes

src/Storages/PartitionedSink.cpp Outdated Show resolved Hide resolved

reduce changes

13a67df

alexey-milovidov added the can be tested Allows running workflows for external contributors label Feb 26, 2025

clickhouse-gh bot added the pr-improvement Pull request with some product improvements label Feb 26, 2025

bharatnc changed the title ~~Add suport for hive partition style writes~~ Add support for hive partition style writes Feb 26, 2025

add tests for s3, enforce some rules

00685e2

clickhouse-gh bot added pr-feature Pull request with new product feature and removed pr-improvement Pull request with some product improvements labels Mar 3, 2025

arthurpassos commented Mar 3, 2025

View reviewed changes

arthurpassos added 2 commits March 5, 2025 09:56

some refactoring

278c6da

extern not_implemented

97f67b5

arthurpassos added 2 commits March 5, 2025 14:32

copy sample bock

1c2748a

focus on engine s3 only, new argument to control partition style

2eb820d

arthurpassos added 13 commits June 17, 2025 16:58

one more todo

b006338

remove a few more todos

1c81d48

resolve one more todo

518c3ba

Merge branch 'master' into s3_hive_style_partitioned_writes

1eba18c

fix conflicts err

5402b0c

one less todo

5cf75f7

resolve a few more todos

c3bbb1d

a few more todos

bfd2f7b

resolve one more todo

5850fe3

Merge branch 'master' into s3_hive_style_partitioned_writes

79ea1d9

fix conflict issues

ac026c7

fix style

f1552f7

Merge branch 'master' into s3_hive_style_partitioned_writes

69412e1

arthurpassos added 3 commits June 20, 2025 09:55

Merge branch 'master' into s3_hive_style_partitioned_writes

c133260

resolve todo

2671ff6

fix conflict issues

86b0b88

ClickHouse deleted a comment from clickhouse-gh bot Jun 20, 2025

Add support for hive partition style reads and writes #76802

Are you sure you want to change the base?

Add support for hive partition style reads and writes #76802

Uh oh!

Conversation

arthurpassos commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

Uh oh!

Uh oh!

arthurpassos commented Feb 26, 2025

Uh oh!

alexey-milovidov commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clickhouse-gh bot commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthurpassos commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthurpassos commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthurpassos commented Feb 28, 2025

Uh oh!

alexey-milovidov commented Feb 28, 2025

Uh oh!

arthurpassos commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented Mar 1, 2025

Uh oh!

arthurpassos Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

arthurpassos commented Mar 3, 2025

Uh oh!

arthurpassos commented Mar 4, 2025

Uh oh!

arthurpassos commented Mar 5, 2025

Uh oh!

arthurpassos commented Mar 5, 2025

Uh oh!

arthurpassos commented Jun 20, 2025

Uh oh!

Uh oh!

arthurpassos commented Feb 26, 2025 •

edited

Loading

alexey-milovidov commented Feb 26, 2025 •

edited

Loading

clickhouse-gh bot commented Feb 26, 2025 •

edited

Loading

arthurpassos commented Feb 27, 2025 •

edited

Loading

arthurpassos commented Feb 27, 2025 •

edited

Loading

arthurpassos commented Feb 28, 2025 •

edited

Loading