Update database.py so that the sample database will use bigint for fi… #2603

markykchua · 2025-05-20T14:34:39Z

…le_size.

For users that have an existing database when upgrading CAPEv2, you need to alter the database schema before the upgrade:

ALTER TABLE samples
ALTER COLUMN file_size TYPE BIGINT;

And restart the related services.

…le_size. For users that have an existing database when upgrading CAPEv2, you need to alter the database schema before the upgrade: ALTER TABLE samples ALTER COLUMN file_size TYPE BIGINT; And restart the related services.

gemini-code-assist

Hello @markykchua, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request to help everyone get up to speed quickly.

This pull request aims to address a potential limitation in the database schema by updating the data type used for storing file sizes. Currently, the file_size column in the samples table uses an Integer type, which might not be sufficient for very large files. This change updates the column type to BigInteger to accommodate larger values, preventing potential overflow issues.

For users who are upgrading CAPEv2 and have an existing database, a manual schema alteration step is required before applying the upgrade. The necessary SQL command (ALTER TABLE samples ALTER COLUMN file_size TYPE BIGINT;) is provided in the pull request description.

Highlights

Database Schema Update: The primary change is updating the file_size column in the samples database table from Integer to BigInteger to support larger file sizes.
Manual Migration Required: Users with existing CAPEv2 databases will need to manually run an ALTER TABLE SQL command to update their schema before applying this change.

Changelog

lib/cuckoo/core/database.py
- Added BigInteger to the list of imported SQLAlchemy types.
- Changed the data type of the file_size column in the Sample model from Integer() to BigInteger().

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Size of file grows,
Integer just won't quite fit,
BigInt saves the day.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request correctly updates the file_size column in the Sample model to use BigInteger, which is essential for supporting larger files. The core code change is good.

My main concern, detailed in the review comment, revolves around the database schema migration strategy. The project appears to use Alembic for migrations, but this PR relies on a manual ALTER TABLE command provided in the PR description. This approach can be risky and lead to inconsistencies if not handled carefully by all users. An Alembic migration script would be a more robust and standardized way to manage this schema change.

The PR description helpfully includes the ALTER TABLE command for PostgreSQL. If a manual DDL remains the chosen path (despite the preference for Alembic), it would be beneficial to either provide equivalent commands for other supported databases (like MySQL, SQLite) or advise users to consult their database's documentation for the correct syntax.

Summary of Findings

Database Schema Migration Strategy: The PR modifies the database schema by changing the file_size column type. However, it relies on a manual ALTER TABLE command instead of an Alembic migration script, which is the established tool in this project. This can lead to inconsistencies and runtime errors if users miss the manual step. An Alembic migration is strongly recommended.
SQLAlchemy Import Order: In lib/cuckoo/core/database.py, the BigInteger import could be placed in better alphabetical order within its group (e.g., before Boolean). This is a minor stylistic point and was not commented on directly due to review settings. (Severity: low)

Merge Readiness

This pull request addresses an important issue by allowing larger file sizes. However, due to the high severity concern regarding the database migration strategy (the lack of an Alembic migration script), I recommend that these changes not be merged until this is addressed. Using Alembic for schema changes will ensure better consistency, safety, and maintainability for users upgrading their Cuckoo instances. As a reviewer, I am not authorized to approve pull requests, so please ensure further review and approval from maintainers after addressing the feedback.

gemini-code-assist · 2025-05-20T14:37:17Z

lib/cuckoo/core/database.py

@@ -307,7 +308,7 @@ class Sample(Base):
    __tablename__ = "samples"

    id = Column(Integer(), primary_key=True)
-    file_size = Column(Integer(), nullable=False)
+    file_size = Column(BigInteger(), nullable=False)


The change from Integer to BigInteger for file_size is correct for supporting larger files. However, this is a database schema modification, and the project uses Alembic for managing schema migrations (as evidenced by the AlembicVersion table and existing migration scripts).

Could you clarify why an Alembic migration script was not included for this change? Relying on a manual ALTER TABLE command (as mentioned in the PR description) has a few potential drawbacks:

Risk of Missed Steps: Users might upgrade the Cuckoo codebase but forget to apply the manual DDL, leading to runtime errors when encountering large files that don't fit the old INTEGER column type.

Inconsistency: Alembic provides a standardized, version-controlled way to apply schema changes. Manual DDLs bypass this.

SCHEMA_VERSION Management: The SCHEMA_VERSION constant in this file is typically updated with Alembic revisions. Without an Alembic migration and a corresponding SCHEMA_VERSION bump, the Cuckoo startup check (if last.version_num != SCHEMA_VERSION) won't be able to verify if this specific schema change has been applied.

Database Portability: The ALTER TABLE command in the PR description is specific to PostgreSQL. Alembic handles generating the correct DDL for different database backends.

It's generally recommended to use Alembic to generate a new migration script for schema changes like this. This script would contain operations like op.alter_column('samples', 'file_size', type_=sa.BigInteger(), existing_type=sa.Integer()).

If there's a specific reason Alembic can't be used here (e.g., performance concerns with ALTER COLUMN TYPE on very large tables), could that be elaborated upon? In such a scenario, bumping SCHEMA_VERSION and instructing users to manually update their alembic_version table after the DDL would be a safer alternative to ensure consistency.

Style Guide References

doomedraven · 2025-05-25T02:48:23Z

hello thank you, the fix is partial, we need alembic migration to properly make users aware of it, we can't just hope that they will check readme. i will implement it once im back in june

markykchua · 2025-05-26T13:39:36Z

hello thank you, the fix is partial, we need alembic migration to properly make users aware of it, we can't just hope that they will check readme. i will implement it once im back in june

perfect, thanks a lot!

gemini-code-assist bot reviewed May 20, 2025

View reviewed changes

markykchua mentioned this pull request May 20, 2025

File of size greater than 1.7GB cannot be accepted #2602

Open

gemini-code-assist bot suggested changes May 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update database.py so that the sample database will use bigint for fi… #2603

Update database.py so that the sample database will use bigint for fi… #2603

Uh oh!

markykchua commented May 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot May 20, 2025

Uh oh!

doomedraven commented May 25, 2025

Uh oh!

markykchua commented May 26, 2025

Uh oh!

Uh oh!

Update database.py so that the sample database will use bigint for fi… #2603

Are you sure you want to change the base?

Update database.py so that the sample database will use bigint for fi… #2603

Uh oh!

Conversation

markykchua commented May 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot May 20, 2025

Choose a reason for hiding this comment

Style Guide References

Uh oh!

doomedraven commented May 25, 2025

Uh oh!

markykchua commented May 26, 2025

Uh oh!

Uh oh!