Skip to content

Add support for hive partition style reads and writes #76802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 156 commits into
base: master
Choose a base branch
from

Conversation

arthurpassos
Copy link
Contributor

@arthurpassos arthurpassos commented Feb 26, 2025

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add support for hive partition style reads and writes

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@arthurpassos
Copy link
Contributor Author

I'll add tests soon

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Feb 26, 2025

This is good, and it could replace a problematic feature: #23051
Change to "New Feature".

Let's make sure names and values are URL-encoded.

One potential problem is memory usage when writing to many partitions at the same time. Let's define some limit, so we will create only up to this limit number of buffers for writing to S3 at the same time.

Do we control where exactly the path fragment with partition goes in the URL?
Do we control if the partition columns will be written or omitted from the files?

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Feb 26, 2025
Copy link

clickhouse-gh bot commented Feb 26, 2025

Workflow [PR], commit [27e26d9]

@clickhouse-gh clickhouse-gh bot added the pr-improvement Pull request with some product improvements label Feb 26, 2025
@bharatnc bharatnc changed the title Add suport for hive partition style writes Add support for hive partition style writes Feb 26, 2025
@arthurpassos
Copy link
Contributor Author

arthurpassos commented Feb 27, 2025

Change to "New Feature".

Done

Let's make sure names and values are URL-encoded.

Ack

@arthurpassos
Copy link
Contributor Author

arthurpassos commented Feb 27, 2025

Do we control where exactly the path fragment with partition goes in the URL?

As of now, it us up to the user to choose where in the path the partition goes by using the {_partition_id} macro/ placeholder when creating the table.

Maybe we can make it simpler: user is responsible for defining the table root, that's all. The rest (partition key location, filename and file extension) clickhouse will generate.

engine = s3('bucket/table_root') partition by (year, country) -> 'bucket/table_root/year=2025/country=spain/<generated_uuid>.<format from table>'.

If the user specifies the partition_id placeholder and use_hive=1, we throw exception.

What do you think? @alexey-milovidov

@arthurpassos
Copy link
Contributor Author

Do we control if the partition columns will be written or omitted from the files?

I suppose that could be implemented, but perhaps we should leave it for a follow up PR?

@alexey-milovidov
Copy link
Member

Maybe we can make it simpler: user is responsible for defining the table root, that's all. The rest (partition key location, filename and file extension) clickhouse will generate.

Yes, this is a great idea!

This PR is good, but what I'd like to see in addition, before merging it, is - fixing the memory consumption problem with PARTITION BY. It's an old flaw of the current mechanism. Having this new feature will make it more frequently used, and the users will bump into this problem more frequently.

@arthurpassos
Copy link
Contributor Author

arthurpassos commented Feb 28, 2025

Maybe we can make it simpler: user is responsible for defining the table root, that's all. The rest (partition key location, filename and file extension) clickhouse will generate.

Yes, this is a great idea!

This PR is good, but what I'd like to see in addition, before merging it, is - fixing the memory consumption problem with PARTITION BY. It's an old flaw of the current mechanism. Having this new feature will make it more frequently used, and the users will bump into this problem more frequently.

Is there an issue that describes this issue in depth? I could look into that

@clickhouse-gh clickhouse-gh bot added pr-feature Pull request with new product feature and removed pr-improvement Pull request with some product improvements labels Mar 3, 2025
@@ -42,13 +42,29 @@ Names extractPartitionRequiredColumns(const ASTPtr & partition_by, const Block &
return exp_analyzer->getRequiredColumns();
}

static std::string formatToFileExtension(const std::string & format)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be completed

@arthurpassos
Copy link
Contributor Author

Messed up, need to re-think a couple of things

@arthurpassos
Copy link
Contributor Author

The shitty part about this is keeping backwards compatibility with {_partition_id}

@arthurpassos
Copy link
Contributor Author

@alexey-milovidov couple of questions:

  1. The existing use_hive_partitioning is used for something else, and it can be tweaked in-between table creation and data insertion. We need a new variable to control the partitioning style at a table level. Should it be a new setting or a new argument to table engines? e.g, partitioning_strategy=['hive' | 'simple', others in the future]. In case we vote for argument, it should be implemented for S3, File and URL table engines.
  2. We have settled on asking for the user to specify the table root and we generate the rest (partition style path, filename and file extension). Being that said, should we forbid the user to create a table with {_partition_id} macro in case partition strategy is hive style?

@arthurpassos
Copy link
Contributor Author

Once I am done with this PR, I'll look into the max threads/streams thing

@arthurpassos
Copy link
Contributor Author

@kssenii I have resolved most of the TODOS and your comments. The pending comments are: #76802 (comment) and #76802 (comment).

From a feature perspective, I think there are a few things pending:

  1. Decide if we are going to support it for other engines as well (i.e, file, url and azure)
  2. Implement partition expression validation. I think we should allow only primitive types to be hive partitioned, not even complex expressions.

Could you please re-review and shared your thoughts?

@ClickHouse ClickHouse deleted a comment from clickhouse-gh bot Jun 20, 2025
@ClickHouse ClickHouse deleted a comment from clickhouse-gh bot Jun 20, 2025
@ClickHouse ClickHouse deleted a comment from clickhouse-gh bot Jun 20, 2025
@ClickHouse ClickHouse deleted a comment from clickhouse-gh bot Jun 20, 2025
@ClickHouse ClickHouse deleted a comment from clickhouse-gh bot Jun 20, 2025
@ClickHouse ClickHouse deleted a comment from clickhouse-gh bot Jun 20, 2025
@ClickHouse ClickHouse deleted a comment from clickhouse-gh bot Jun 20, 2025
@ClickHouse ClickHouse deleted a comment from clickhouse-gh bot Jun 20, 2025
@ClickHouse ClickHouse deleted a comment from clickhouse-gh bot Jun 20, 2025
@ClickHouse ClickHouse deleted a comment from clickhouse-gh bot Jun 20, 2025
@ClickHouse ClickHouse deleted a comment from clickhouse-gh bot Jun 20, 2025
@ClickHouse ClickHouse deleted a comment from clickhouse-gh bot Jun 20, 2025
@ClickHouse ClickHouse deleted a comment from clickhouse-gh bot Jun 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants