Skip to content

Update database.py so that the sample database will use bigint for fi… #2603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

markykchua
Copy link

…le_size.

For users that have an existing database when upgrading CAPEv2, you need to alter the database schema before the upgrade:

ALTER TABLE samples
ALTER COLUMN file_size TYPE BIGINT;

And restart the related services.

…le_size.

For users that have an existing database when upgrading CAPEv2, you need to alter the database schema before the upgrade: 

ALTER TABLE samples
ALTER COLUMN file_size TYPE BIGINT;

And restart the related services.
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @markykchua, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request to help everyone get up to speed quickly.

This pull request aims to address a potential limitation in the database schema by updating the data type used for storing file sizes. Currently, the file_size column in the samples table uses an Integer type, which might not be sufficient for very large files. This change updates the column type to BigInteger to accommodate larger values, preventing potential overflow issues.

For users who are upgrading CAPEv2 and have an existing database, a manual schema alteration step is required before applying the upgrade. The necessary SQL command (ALTER TABLE samples ALTER COLUMN file_size TYPE BIGINT;) is provided in the pull request description.

Highlights

  • Database Schema Update: The primary change is updating the file_size column in the samples database table from Integer to BigInteger to support larger file sizes.
  • Manual Migration Required: Users with existing CAPEv2 databases will need to manually run an ALTER TABLE SQL command to update their schema before applying this change.

Changelog

  • lib/cuckoo/core/database.py
    • Added BigInteger to the list of imported SQLAlchemy types.
    • Changed the data type of the file_size column in the Sample model from Integer() to BigInteger().
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Size of file grows,
Integer just won't quite fit,
BigInt saves the day.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request correctly updates the file_size column in the Sample model to use BigInteger, which is essential for supporting larger files. The core code change is good.

My main concern, detailed in the review comment, revolves around the database schema migration strategy. The project appears to use Alembic for migrations, but this PR relies on a manual ALTER TABLE command provided in the PR description. This approach can be risky and lead to inconsistencies if not handled carefully by all users. An Alembic migration script would be a more robust and standardized way to manage this schema change.

The PR description helpfully includes the ALTER TABLE command for PostgreSQL. If a manual DDL remains the chosen path (despite the preference for Alembic), it would be beneficial to either provide equivalent commands for other supported databases (like MySQL, SQLite) or advise users to consult their database's documentation for the correct syntax.

Summary of Findings

  • Database Schema Migration Strategy: The PR modifies the database schema by changing the file_size column type. However, it relies on a manual ALTER TABLE command instead of an Alembic migration script, which is the established tool in this project. This can lead to inconsistencies and runtime errors if users miss the manual step. An Alembic migration is strongly recommended.
  • SQLAlchemy Import Order: In lib/cuckoo/core/database.py, the BigInteger import could be placed in better alphabetical order within its group (e.g., before Boolean). This is a minor stylistic point and was not commented on directly due to review settings. (Severity: low)

Merge Readiness

This pull request addresses an important issue by allowing larger file sizes. However, due to the high severity concern regarding the database migration strategy (the lack of an Alembic migration script), I recommend that these changes not be merged until this is addressed. Using Alembic for schema changes will ensure better consistency, safety, and maintainability for users upgrading their Cuckoo instances. As a reviewer, I am not authorized to approve pull requests, so please ensure further review and approval from maintainers after addressing the feedback.

@@ -307,7 +308,7 @@ class Sample(Base):
__tablename__ = "samples"

id = Column(Integer(), primary_key=True)
file_size = Column(Integer(), nullable=False)
file_size = Column(BigInteger(), nullable=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The change from Integer to BigInteger for file_size is correct for supporting larger files. However, this is a database schema modification, and the project uses Alembic for managing schema migrations (as evidenced by the AlembicVersion table and existing migration scripts).

Could you clarify why an Alembic migration script was not included for this change? Relying on a manual ALTER TABLE command (as mentioned in the PR description) has a few potential drawbacks:

  1. Risk of Missed Steps: Users might upgrade the Cuckoo codebase but forget to apply the manual DDL, leading to runtime errors when encountering large files that don't fit the old INTEGER column type.
  2. Inconsistency: Alembic provides a standardized, version-controlled way to apply schema changes. Manual DDLs bypass this.
  3. SCHEMA_VERSION Management: The SCHEMA_VERSION constant in this file is typically updated with Alembic revisions. Without an Alembic migration and a corresponding SCHEMA_VERSION bump, the Cuckoo startup check (if last.version_num != SCHEMA_VERSION) won't be able to verify if this specific schema change has been applied.
  4. Database Portability: The ALTER TABLE command in the PR description is specific to PostgreSQL. Alembic handles generating the correct DDL for different database backends.

It's generally recommended to use Alembic to generate a new migration script for schema changes like this. This script would contain operations like op.alter_column('samples', 'file_size', type_=sa.BigInteger(), existing_type=sa.Integer()).

If there's a specific reason Alembic can't be used here (e.g., performance concerns with ALTER COLUMN TYPE on very large tables), could that be elaborated upon? In such a scenario, bumping SCHEMA_VERSION and instructing users to manually update their alembic_version table after the DDL would be a safer alternative to ensure consistency.

Style Guide References

@doomedraven
Copy link
Collaborator

hello thank you, the fix is partial, we need alembic migration to properly make users aware of it, we can't just hope that they will check readme. i will implement it once im back in june

@markykchua
Copy link
Author

hello thank you, the fix is partial, we need alembic migration to properly make users aware of it, we can't just hope that they will check readme. i will implement it once im back in june

perfect, thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants