Skip to content

Add {_snowflake_id} wildcard support to object storage #789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

arthurpassos
Copy link
Collaborator

Add {_snowflake_id} wildcard support to object storage paths. Upon writing, ClickHouse will generate a snowflakeid on the fly and replace the wildcard. This will help us with parallel and concurrent writes to object storage.

Also introduce a new setting object_storage_treat_key_related_wildcards_as_star to allow symmetrical reads & writes using a single table. Why is it needed? Consider the following:

CREATE TABLE ... s3('path_to_table_root/**.parquet')

Ok, we can select from it, but how do we write? How do we name the files? In which directory?

Therefore, we introduced the snowflake id.

CREATE TABLE ... s3('path_to_table_root/{_snowflake_id}.parquet') - we can now write to it because we know the file location and we know how to name it.

But how do we read now? The path isn't globbed anymore. That's what the setting is for.

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add {_snowflake_id} wildcard support to object storage paths. Also add a new setting object_storage_treat_key_related_wildcards_as_star to allow symmetrical reads & writes using a single table.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@arthurpassos
Copy link
Collaborator Author

Cheap alternative to #697

The StorageObjectStorage::Configuration::path handling is messy because when originally implemented, it was assumed the path would always be the same. Well, it is no longer true AND the configuration object is spread across 50+ classes. To make it right it would require a big refactoring

@arthurpassos arthurpassos changed the title Add {_snowflake_id} wildcard support to object s torage Add {_snowflake_id} wildcard support to object storage May 20, 2025
@@ -5951,6 +5951,9 @@ This only affects operations performed on the client side, in particular parsing
Normally this setting should be set in user profile (users.xml or queries like `ALTER USER`), not through the client (client command line arguments, `SET` query, or `SETTINGS` section of `SELECT` query). Through the client it can be changed to false, but can't be changed to true (because the server won't send the settings if user profile has `apply_settings_from_server = false`).

Note that initially (24.12) there was a server setting (`send_settings_to_client`), but latter it got replaced with this client setting, for better usability.
)", 0) \
DECLARE(Bool, object_storage_treat_key_wildcard_as_star, false, R"(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three options here:

  1. Off by default
  2. On by default
  3. No setting at all, default behavior

Copy link

@ianton-ru ianton-ru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pair of minor comments

@@ -373,21 +374,35 @@ void StorageObjectStorage::read(
if (update_configuration_on_read)
configuration->update(object_storage, local_context);

if (partition_by && configuration->withPartitionWildcard())
auto config_clone = configuration->clone();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We make a clone every time, but actually change only in specific cases.
May be make here a smart_ptr on original config, and make a clone only when required?

@@ -659,13 +674,29 @@ bool StorageObjectStorage::Configuration::withPartitionWildcard() const
|| getNamespace().find(PARTITION_ID_WILDCARD) != String::npos;
}

bool StorageObjectStorage::Configuration::withSnowflakeIdWildcard() const
{
static const String PARTITION_ID_WILDCARD = "{_snowflake_id}";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SNOWFLAKE_ID_WILDCARD

@@ -5951,6 +5951,9 @@ This only affects operations performed on the client side, in particular parsing
Normally this setting should be set in user profile (users.xml or queries like `ALTER USER`), not through the client (client command line arguments, `SET` query, or `SETTINGS` section of `SELECT` query). Through the client it can be changed to false, but can't be changed to true (because the server won't send the settings if user profile has `apply_settings_from_server = false`).

Note that initially (24.12) there was a server setting (`send_settings_to_client`), but latter it got replaced with this client setting, for better usability.
)", 0) \
DECLARE(Bool, object_storage_treat_key_related_wildcards_as_star, false, R"(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three options here:

  1. make it the default behavior, not behind a setting
  2. make the setting on by default
  3. keep it off by default

@vzakaznikov
Copy link
Collaborator

From the usability point of view ChatGPT suggest to consider using write_path_template setting. Here is an example,

CREATE TABLE my_table
(
    id UInt64
)
ENGINE = S3('s3://bucket/data/*.parquet', 'Parquet')
SETTINGS write_path_template = '{_snowflake_id}.parquet'

@arthurpassos
Copy link
Collaborator Author

From the usability point of view ChatGPT suggest to consider using write_path_template setting. Here is an example,

CREATE TABLE my_table
(
    id UInt64
)
ENGINE = S3('s3://bucket/data/*.parquet', 'Parquet')
SETTINGS write_path_template = '{_snowflake_id}.parquet'

That sounds like a terrible idea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants