Skip to content

Conversation

Bill-hbrhbr
Copy link
Contributor

@Bill-hbrhbr Bill-hbrhbr commented Apr 17, 2025

Description

  • Add a function to create the clp_datasets table.
  • Move creation logic for various tables into a reusable Python utility.
  • Add a helper function to create the set of following tables with an optional dataset specification that defaults to None:
    • clp_{dataset}_archives
    • clp_{dataset}_tags
    • clp_{dataset}_archive_tags
    • clp_{dataset}_files
      • If dataset is specified, also create clp_{dataset}_column_metadata
    • If not, create the default versions of the table (used by clp):
      • clp_archives, clp_tags, clp_archive_tags, clp_files
  • Pass --storage-engine flag from start-clp.py all the way to initialize-clp-metadata-db.py and create clp_datasets if the storage engine is clp-s

This PR succeeds #819, #839 and covers the following parts of the dataset feature implementation plan:

  • The part of Step 2 regarding creating the clp_datasets table.
  • Will also help Step 3 when we create new tables upon seeing a new dataset

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Tested with clp-package. Tables are created successfully as usual. With CLP_S, the clp_datasets table is also created.

For CLP_S:
image

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features
    • Introduced new utilities for creating and managing metadata tables in MariaDB or MySQL databases.
  • Refactor
    • Streamlined the metadata database initialization process by consolidating multiple table creation steps into a single function call.
  • Chores
    • Added new string constants for table suffixes to improve table naming consistency.
    • Enhanced command-line interfaces to require and propagate storage engine configuration for database setup.

Copy link
Contributor

coderabbitai bot commented Apr 17, 2025

Walkthrough

This update introduces a new module containing utility functions for creating metadata tables in a MariaDB or MySQL database. Several string constants representing table suffixes are added to the configuration file. The table creation logic in an initialization script is refactored to use a new consolidated function from the utility module, replacing multiple explicit SQL statements with a single function call. The new module defines individual functions for creating each table and a coordinating function to create all required tables with optional dataset-specific behaviour. Additionally, command-line argument handling is enhanced in related scripts to include a required --storage-engine parameter, which influences table creation behaviour.

Changes

File(s) Change Summary
components/clp-py-utils/clp_py_utils/clp_config.py Added five string constants for table suffixes: ARCHIVE_TAGS_TABLE_SUFFIX, COLUMN_METADATA_TABLE_SUFFIX, DATASETS_TABLE_SUFFIX, FILES_TABLE_SUFFIX, and TAGS_TABLE_SUFFIX.
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py New module introducing functions for creating metadata tables, including helpers for individual tables and a coordinator function.
components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py Refactored to use create_metadata_db_tables function replacing multiple explicit SQL CREATE TABLE statements; added required --storage-engine argument; converts config path to pathlib.Path; calls create_datasets_table if storage engine is CLP_S.
components/clp-py-utils/clp_py_utils/create-db-tables.py Added required --storage-engine command-line argument validated against StorageEngine enum; passes this argument to subprocess call of initialization script.
components/clp-package-utils/clp_package_utils/scripts/start_clp.py Updated invocation of create-db-tables.py script to include the --storage-engine argument with configured storage engine value.

Sequence Diagram(s)

sequenceDiagram
    participant StartScript as start_clp.py
    participant CreateScript as create-db-tables.py
    participant InitScript as initialize-clp-metadata-db.py
    participant Utils as clp_metadata_db_utils.py
    participant DB as Database

    StartScript->>CreateScript: Run with --storage-engine argument
    CreateScript->>InitScript: Run with --storage-engine argument
    InitScript->>Utils: create_metadata_db_tables(metadata_db_cursor, table_prefix, dataset?)
    Utils->>DB: CREATE TABLE IF NOT EXISTS datasets
    Utils->>DB: CREATE TABLE IF NOT EXISTS archives
    Utils->>DB: CREATE TABLE IF NOT EXISTS tags
    Utils->>DB: CREATE TABLE IF NOT EXISTS archive_tags
    Utils->>DB: CREATE TABLE IF NOT EXISTS files
    alt If dataset specified (e.g., for CLP_S storage engine)
        Utils->>DB: CREATE TABLE IF NOT EXISTS column_metadata
    end
    InitScript->>DB: Commit transaction
Loading
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Bill-hbrhbr Bill-hbrhbr changed the title Add dataset tables feat(package): Add --dataset option to compression scripts and create dataset-specific tables for clp-s compression jobs. Apr 17, 2025
@Bill-hbrhbr Bill-hbrhbr changed the title feat(package): Add --dataset option to compression scripts and create dataset-specific tables for clp-s compression jobs. feat(clp-package): Add --dataset option to compression scripts and create dataset-specific tables for clp-s compression jobs. Apr 17, 2025
@Bill-hbrhbr Bill-hbrhbr marked this pull request as ready for review April 17, 2025 09:51
@Bill-hbrhbr Bill-hbrhbr requested a review from a team as a code owner April 17, 2025 09:51
@Bill-hbrhbr Bill-hbrhbr requested a review from gibber9809 April 17, 2025 09:51
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ffbd533 and 3f19234.

📒 Files selected for processing (7)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py (2 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3 hunks)
  • components/clp-py-utils/clp_py_utils/clp_config.py (1 hunks)
  • components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py (3 hunks)
  • components/clp-py-utils/clp_py_utils/sql_table_schema_utils.py (1 hunks)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (5 hunks)
  • components/job-orchestration/job_orchestration/scheduler/job_config.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py (3)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • Database (76-155)
components/clp-py-utils/clp_py_utils/core.py (1)
  • read_yaml_config_file (38-44)
components/clp-py-utils/clp_py_utils/sql_table_schema_utils.py (2)
  • create_archives_table (1-18)
  • create_column_metadata_table (21-30)
🔇 Additional comments (17)
components/clp-py-utils/clp_py_utils/clp_config.py (1)

40-41: Constants for dataset-related table suffixes look good

The addition of these constants follows the established pattern for table naming and will provide consistency when constructing dataset-specific table names throughout the codebase.

components/job-orchestration/job_orchestration/scheduler/job_config.py (2)

28-28: Dataset field added with sensible default

The addition of the dataset field with a default value of "default" maintains backward compatibility while enabling the new dataset feature.


34-34: Dataset field added with sensible default

The addition of the dataset field with a default value of "default" maintains backward compatibility while enabling the new dataset feature.

components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py (4)

9-16: Updated imports for new constants and utilities

The import statements have been updated to include the new constants and utility functions, which is appropriate.


44-44: Good refactoring to use the utility function

Replacing inline SQL with a call to the utility function improves maintainability and ensures consistency in table schemas.


90-93: Simplified primary key for datasets table

The primary key for the datasets table has been simplified to use only the name column, which is appropriate since dataset names should be unique.


97-99: Good refactoring to use the utility function

Using the create_column_metadata_table utility function instead of inline SQL improves maintainability and ensures consistency.

components/clp-py-utils/clp_py_utils/sql_table_schema_utils.py (2)

1-18: Well-structured utility function for archives table creation

This utility function encapsulates the SQL for creating an archives table, making it reusable across the codebase. The table structure includes all necessary fields and proper indexing.


21-30: Well-structured utility function for column metadata table creation

This utility function encapsulates the SQL for creating a column metadata table, which will be reused when creating dataset-specific metadata tables. The composite primary key on name and type is appropriate.

components/clp-package-utils/clp_package_utils/scripts/compress.py (2)

85-87: Dataset flag handling looks good!

The conditional check ensures the --dataset flag is only added when the dataset argument has a value, which is a good practice.


151-153: Implementation of the --dataset parameter looks good!

The default value "default" is appropriate for maintaining backward compatibility, and the help text clearly communicates the purpose of this parameter.

components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3)

143-143: Dataset parameter inclusion in FsInputConfig looks good!

The parameter is properly passed to the input configuration object, enabling dataset-specific compression for filesystem inputs.


156-156: Dataset parameter inclusion in S3InputConfig looks good!

The parameter is properly passed to the input configuration object, enabling dataset-specific compression for S3 inputs.


208-210: Implementation of the --dataset parameter looks good!

The default value "default" matches the implementation in compress.py, maintaining consistency across the codebase.

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (3)

156-158: Function parameter update looks good!

The search_and_schedule_new_tasks function is properly updated to accept the storage engine parameter, which is needed for the dataset-specific logic.


411-412: Storage engine extraction looks good!

The storage engine is properly extracted from the configuration to be passed to the task scheduling function.


418-418: Parameter passing looks good!

The storage engine is correctly passed to the search_and_schedule_new_tasks function.

Comment on lines 218 to 228
dataset = input_config.dataset
if StorageEngine.CLP_S == clp_storage_engine and dataset not in seen_datasets:
seen_datasets.add(dataset)
query = f'INSERT INTO {table_prefix}{DATASETS_TABLE_SUFFIX} (name) VALUES ("{dataset}")'
db_cursor.execute(query)
create_archives_table(db_cursor, f"{table_prefix}{dataset}_{ARCHIVES_TABLE_SUFFIX}")
create_column_metadata_table(
db_cursor, f"{table_prefix}{dataset}_{COLUMN_METADATA_TABLE_SUFFIX}"
)
db_conn.commit()

Copy link
Contributor

@coderabbitai coderabbitai bot Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix SQL injection vulnerability and add error handling

There are several issues with the dataset registration implementation:

  1. SQL injection vulnerability: The dataset name is directly inserted into the query without parameter binding
  2. Missing error handling for table creation failures
  3. No validation of dataset names (length, special characters, etc.)

Implement parameter binding and add error handling:

 dataset = input_config.dataset
 if StorageEngine.CLP_S == clp_storage_engine and dataset not in seen_datasets:
     seen_datasets.add(dataset)
-    query = f'INSERT INTO {table_prefix}{DATASETS_TABLE_SUFFIX} (name) VALUES ("{dataset}")'
-    db_cursor.execute(query)
-    create_archives_table(db_cursor, f"{table_prefix}{dataset}_{ARCHIVES_TABLE_SUFFIX}")
-    create_column_metadata_table(
-        db_cursor, f"{table_prefix}{dataset}_{COLUMN_METADATA_TABLE_SUFFIX}"
-    )
-    db_conn.commit()
+    try:
+        # Validate dataset name (add appropriate validation logic)
+        if not re.match(r'^[a-zA-Z0-9_-]+$', dataset):
+            logger.error(f"Invalid dataset name: {dataset}. Using 'default' instead.")
+            dataset = "default"
+        
+        # Use parameter binding to prevent SQL injection
+        query = f"INSERT INTO {table_prefix}{DATASETS_TABLE_SUFFIX} (name) VALUES (%s)"
+        db_cursor.execute(query, (dataset,))
+        
+        # Create dataset-specific tables
+        create_archives_table(db_cursor, f"{table_prefix}{dataset}_{ARCHIVES_TABLE_SUFFIX}")
+        create_column_metadata_table(
+            db_cursor, f"{table_prefix}{dataset}_{COLUMN_METADATA_TABLE_SUFFIX}"
+        )
+        db_conn.commit()
+        logger.info(f"Created tables for new dataset: {dataset}")
+    except Exception as ex:
+        logger.error(f"Failed to create tables for dataset {dataset}: {ex}")
+        db_conn.rollback()

Don't forget to add the required import at the top of the file:

import re

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a todo for checking dataset name is valid for sql tables

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bill-hbrhbr Thank you for adding the TODO to check that dataset names are valid for SQL tables. That's a good step toward addressing the issue.

While the TODO will remind you to implement validation in the future, you might also want to consider addressing the SQL injection vulnerability sooner rather than later by using parameterized queries. This is a security best practice that prevents attackers from executing malicious SQL commands through user input.

A simple fix would be to change:

query = f'INSERT INTO {table_prefix}{DATASETS_TABLE_SUFFIX} (name) VALUES ("{dataset}")'
db_cursor.execute(query)

To:

query = f"INSERT INTO {table_prefix}{DATASETS_TABLE_SUFFIX} (name) VALUES (%s)"
db_cursor.execute(query, (dataset,))

Would you like me to show a more complete implementation that includes both validation and parameterized queries?


✏️ Learnings added
Learnt from: Bill-hbrhbr
PR: y-scope/clp#831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:23.759Z
Learning: In the CLP project, SQL queries should use parameterized queries with placeholders (%s) and pass values as a tuple to `db_cursor.execute()` to prevent SQL injection, rather than directly interpolating values into the query string.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


scheduled_jobs = {}

seen_datasets = set()
Copy link
Contributor

@coderabbitai coderabbitai bot Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider using database checks instead of in-memory tracking

The seen_datasets global set will be lost if the process restarts, potentially causing redundant table creation attempts for datasets that were already processed.

Consider querying the database to check if a dataset exists instead of relying on in-memory tracking:

-seen_datasets = set()

Then modify the dataset check logic to query the database instead:

-if StorageEngine.CLP_S == clp_storage_engine and dataset not in seen_datasets:
-    seen_datasets.add(dataset)
+if StorageEngine.CLP_S == clp_storage_engine:
+    # Check if dataset exists in the database
+    query = f"SELECT name FROM {table_prefix}{DATASETS_TABLE_SUFFIX} WHERE name = %s"
+    db_cursor.execute(query, (dataset,))
+    if not db_cursor.fetchone():

Committable suggestion skipped: line range outside the PR's diff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've addressed the issue by updating in memory cache upon startup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bill-hbrhbr That's a great approach! Initializing the cache at startup gives you the performance benefits of in-memory checks while ensuring you don't lose track of datasets between restarts. Thanks for explaining how you addressed it.

🐰👍


✏️ Learnings added
Learnt from: Bill-hbrhbr
PR: y-scope/clp#831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:06.632Z
Learning: In the compression scheduler, the team prefers initializing in-memory caches from the database at startup rather than performing repeated database queries for efficiency reasons. This approach maintains both performance and reliability across process restarts.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@kirkrodrigues
Copy link
Member

Why are we exposing the --dataset option to the user? I thought we discussed during our meeting yesterday that we don't want to do that until the complete feature is ready, right?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ffcf92 and 9d9d0d6.

📒 Files selected for processing (3)
  • components/clp-py-utils/clp_py_utils/clp_config.py (1 hunks)
  • components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py (2 hunks)
  • components/clp-py-utils/clp_py_utils/sql_table_schema_utils.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/clp-py-utils/clp_py_utils/clp_config.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py (3)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • Database (80-159)
components/clp-py-utils/clp_py_utils/core.py (1)
  • read_yaml_config_file (38-44)
components/clp-py-utils/clp_py_utils/sql_table_schema_utils.py (4)
  • create_archives_table (1-18)
  • create_tags_table (21-31)
  • create_archive_tags_table (34-45)
  • create_files_table (48-66)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: lint-check (macos-latest)
  • GitHub Check: build (macos-latest)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
components/clp-py-utils/clp_py_utils/sql_table_schema_utils.py (1)

105-123: 🛠️ Refactor suggestion

create_metadata_db_tables never records the dataset in clp_datasets

The PR description emphasises the new clp_datasets registry table, yet this helper never invokes create_datasets_table. Consequently, compression jobs that rely solely on this helper will create dataset‑specific tables without inserting the dataset into the global registry, leaving the schema in an inconsistent state.

Consider:

@@
     db_cursor, table_prefix: str, dataset: typing.Optional[str] = None
 ) -> None:
     if dataset is not None:
         table_prefix = f"{table_prefix}{dataset}_"
         create_column_metadata_table(db_cursor, f"{table_prefix}{COLUMN_METADATA_TABLE_SUFFIX}")
+        # Ensure the dataset itself is recorded
+        create_datasets_table(db_cursor, f"{table_prefix}{DATASETS_TABLE_SUFFIX}")

(N.B. you will also need to import DATASETS_TABLE_SUFFIX alongside the earlier import fix.)

This change keeps the registry in sync with the per‑dataset tables.

♻️ Duplicate comments (1)
components/clp-py-utils/clp_py_utils/sql_table_schema_utils.py (1)

60-62: ⚠️ Potential issue

Still using the wrong cursor variable – will raise NameError

The earlier review already pointed this out, but the fix has not been applied.
metadata_db_cursor is undefined; use the function argument db_cursor.

-    metadata_db_cursor.execute(
+    db_cursor.execute(
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 01618f8 and 45258ce.

📒 Files selected for processing (2)
  • components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py (2 hunks)
  • components/clp-py-utils/clp_py_utils/sql_table_schema_utils.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: lint-check (macos-latest)
  • GitHub Check: lint-check (ubuntu-latest)

@Bill-hbrhbr Bill-hbrhbr changed the title refactor(clp-py-utils): Move CLP metadata DB table-creation logic into utility module; Add creation function for the datasets table. refactor(clp-py-utils): Move CLP metadata DB table-creation logic into utility module; Conditionally create the datasets table when using clp-s. Apr 22, 2025
Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more suggestions.

Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py (1)

114-143: 🛠️ Refactor suggestion

Consider adding error handling to the table creation process

The function is well-documented and implements the functionality as described in the PR objectives. However, there's no error handling for database operations which could lead to incomplete table creation if an operation fails.

Consider adding try-except blocks to handle potential database errors:

def create_metadata_db_tables(db_cursor, table_prefix: str, dataset: str | None = None) -> None:
    """
    Creates the standard set of tables for CLP's metadata.

    :param db_cursor: The database cursor to execute the table creations.
    :param table_prefix: A string to prepend to all table names.
    :param dataset: If set, all tables will be named in a dataset-specific manner.
    """
    if dataset is not None:
        table_prefix = f"{table_prefix}{dataset}_"

    archives_table_name = f"{table_prefix}{ARCHIVES_TABLE_SUFFIX}"
    tags_table_name = f"{table_prefix}{TAGS_TABLE_SUFFIX}"
    archive_tags_table_name = f"{table_prefix}{ARCHIVE_TAGS_TABLE_SUFFIX}"

    # TODO: Update this to
    # {table_prefix}{CLP_DEFAULT_DATASET_NAME}_{COLUMN_METADATA_TABLE_SUFFIX} when we can also
    # change the indexer to match.
    column_metadata_table_name = (
        f"{table_prefix}{COLUMN_METADATA_TABLE_SUFFIX}_{CLP_DEFAULT_DATASET_NAME}"
    )

+    try:
        _create_archives_table(db_cursor, archives_table_name)
        _create_tags_table(db_cursor, tags_table_name)
        _create_archive_tags_table(
            db_cursor, archive_tags_table_name, archives_table_name, tags_table_name
        )
        _create_files_table(db_cursor, table_prefix)
        _create_column_metadata_table(db_cursor, column_metadata_table_name)
+    except Exception as e:
+        # Log the error or handle it appropriately
+        raise Exception(f"Error creating metadata tables: {str(e)}") from e
🧹 Nitpick comments (5)
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py (5)

14-32: Add docstring to helper function

The _create_archives_table function lacks documentation. Consider adding a docstring to explain its purpose, parameters, and return value.

def _create_archives_table(db_cursor, archives_table_name: str) -> None:
+    """
+    Creates a table for storing archive metadata.
+
+    :param db_cursor: The database cursor to execute the table creation.
+    :param archives_table_name: The name of the table to create.
+    """
    db_cursor.execute(
        f"""

34-45: Add docstring to helper function

The _create_tags_table function lacks documentation. Consider adding a docstring to explain its purpose, parameters, and return value.

def _create_tags_table(db_cursor, tags_table_name: str) -> None:
+    """
+    Creates a table for storing tag information.
+
+    :param db_cursor: The database cursor to execute the table creation.
+    :param tags_table_name: The name of the table to create.
+    """
    db_cursor.execute(
        f"""

47-61: Add docstring to helper function

The _create_archive_tags_table function lacks documentation. Consider adding a docstring to explain its purpose, parameters, and return value.

def _create_archive_tags_table(
    db_cursor, archive_tags_table_name: str, archives_table_name: str, tags_table_name: str
) -> None:
+    """
+    Creates a table for storing relationships between archives and tags.
+
+    :param db_cursor: The database cursor to execute the table creation.
+    :param archive_tags_table_name: The name of the relation table to create.
+    :param archives_table_name: The name of the archives table to reference.
+    :param tags_table_name: The name of the tags table to reference.
+    """
    db_cursor.execute(
        f"""

63-82: Add docstring and standardize parameters

The _create_files_table function lacks documentation and uses table_prefix instead of a full table name like the other helper functions, which may cause confusion.

def _create_files_table(db_cursor, table_prefix: str) -> None:
+    """
+    Creates a table for storing file metadata.
+
+    :param db_cursor: The database cursor to execute the table creation.
+    :param table_prefix: A string to prepend to the files table suffix.
+    """
    db_cursor.execute(
        f"""

For consistency with other helper methods, consider refactoring to accept the full table name:

-def _create_files_table(db_cursor, table_prefix: str) -> None:
+def _create_files_table(db_cursor, files_table_name: str) -> None:
    """
    Creates a table for storing file metadata.

    :param db_cursor: The database cursor to execute the table creation.
-    :param table_prefix: A string to prepend to the files table suffix.
+    :param files_table_name: The name of the table to create.
    """
    db_cursor.execute(
        f"""
-        CREATE TABLE IF NOT EXISTS `{table_prefix}{FILES_TABLE_SUFFIX}` (
+        CREATE TABLE IF NOT EXISTS `{files_table_name}` (

Then update the call in create_metadata_db_tables:

-    _create_files_table(db_cursor, table_prefix)
+    files_table_name = f"{table_prefix}{FILES_TABLE_SUFFIX}"
+    _create_files_table(db_cursor, files_table_name)

84-94: Add docstring to helper function

The _create_column_metadata_table function lacks documentation. Consider adding a docstring to explain its purpose, parameters, and return value.

def _create_column_metadata_table(db_cursor, table_name: str) -> None:
+    """
+    Creates a table for storing column metadata.
+
+    :param db_cursor: The database cursor to execute the table creation.
+    :param table_name: The name of the table to create.
+    """
    db_cursor.execute(
        f"""
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bc5ee6e and 9a82570.

📒 Files selected for processing (2)
  • components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py (1 hunks)
  • components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: lint-check (macos-latest)
🔇 Additional comments (3)
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py (3)

1-11: Excellent use of future imports!

The inclusion of from __future__ import annotations is perfect for supporting the Python 3.8 compatibility with the | syntax in type hints. This is aligned with a past review comment and ensures backward compatibility.


96-112: Well-documented function implementation

The create_datasets_table function is well-documented and properly implements the functionality described in the PR objectives for creating the datasets table.


129-134: Ensure the TODO comment is tracked

The TODO comment indicates future work needed on the column metadata table naming. Make sure this is tracked in your issue tracking system to avoid it being forgotten.

@kirkrodrigues kirkrodrigues changed the title refactor(clp-py-utils): Move CLP metadata DB table-creation logic into utility module; Conditionally create the datasets table when using clp-s. refactor(clp-py-utils): Move CLP metadata DB table-creation logic into utility module; Conditionally create the datasets table when using the CLP_S storage engine. Apr 23, 2025
@Bill-hbrhbr Bill-hbrhbr merged commit 6300fa5 into y-scope:main Apr 23, 2025
7 checks passed
@Bill-hbrhbr Bill-hbrhbr deleted the add-dataset-tables branch April 23, 2025 20:00
anlowee pushed a commit to anlowee/clp that referenced this pull request Apr 25, 2025
…o utility module; Conditionally create the datasets table when using the `CLP_S` storage engine. (y-scope#831)

Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants