Skip to content

Conversation

@swest50
Copy link
Collaborator

@swest50 swest50 commented Jan 28, 2026

Reworks the database interface on the python side using SQLModel ORM.

Should allow both for easier use of the datasets without needing to write/read/parse raw SQL commands (Both on our end when creating new datasets, and by users wanting to use the open-source datasets that are not familiar with/don't want to use SQL).

Syntax should also be easier to read what tables exist and what data they contain. As the syntax is similar to a dataclass. Ex:

class Package(SQLModel, table=True):
    id: Optional[int]
    package_name: str
    last_serial: int

    # Relationships
    imports: list["PackageImport"]
    files: list["PackageFile"]

class PackageImport(SQLModel, table=True):
    id: Optional[int] 
    package_id: Optional[int]
    import_as: str

    # Relationships
    package: "Package"

class PackageFile(SQLModel, table=True):
    id: Optional[int] 
    package_id: Optional[int] 
    file_name: str
    normalized_file_name: str
    file_path: PurePosixPath
    mime_type: str
    magic_string: str

    # Relationships
    package: "Package"

vs reading the following SQL commands to try and figure out what data exists, the type of the data, and relationships. And also trying to not make mistakes when creating it in the first place

            create_table_cmd = """
                CREATE TABLE
                IF NOT EXISTS packages(
                    id INTEGER PRIMARY KEY,
                    package_name TEXT,
                    last_serial INTEGER
                )
            """
            cursor.execute(create_table_cmd)

            create_table_cmd = """
                CREATE TABLE
                IF NOT EXISTS package_imports(
                    id INTEGER PRIMARY KEY,
                    package_id INTEGER,
                    import_as TEXT,
                    FOREIGN KEY (package_id) REFERENCES packages(id) ON DELETE CASCADE
                )
            """
            cursor.execute(create_table_cmd)
            create_index_cmd = """
                CREATE INDEX
                IF NOT EXISTS idx_package_id
                ON package_imports(package_id);
            """
            cursor.execute(create_index_cmd)
            create_index_cmd = """
                CREATE INDEX
                IF NOT EXISTS idx_import_as
                ON package_imports(import_as);
            """
            cursor.execute(create_index_cmd)

            create_view_cmd = """
                CREATE VIEW
                IF NOT EXISTS v_package_imports
                AS 
                    SELECT package_name, import_as
                    FROM packages
                    JOIN package_imports
                    ON packages.id = package_imports.package_id
            """
            cursor.execute(create_view_cmd)

            create_table_cmd = """
            CREATE TABLE
            IF NOT EXISTS package_files(
                id INTEGER PRIMARY KEY,
                package_id INTEGER,
                file_name TEXT,
                normalized_file_name TEXT,
                file_path TEXT,
                mime_type TEXT,
                magic_string TEXT,
                FOREIGN KEY (package_id) REFERENCES packages(id) ON DELETE CASCADE
            )
            """
            cursor.execute(create_table_cmd)

            create_view_cmd = """
                CREATE VIEW
                IF NOT EXISTS v_package_files
                AS 
                    SELECT package_name, normalized_file_name, file_name, file_path, mime_type, magic_string
                    FROM packages
                    JOIN package_files
                    ON packages.id = package_files.package_id
            """
            cursor.execute(create_view_cmd)

            create_table_cmd = """
                CREATE TABLE
                IF NOT EXISTS dataset_version(
                    version INTEGER PRIMARY KEY,
                    format TEXT,
                    timestamp INTEGER
                )
            """
            cursor.execute(create_table_cmd)

The format of the database itself should be identical. Thus making it backwards compatible with everything we already have.
All the old code also currently remains. The old database interface is located under databases while the new interface/code is under databases_v2.
In the future if we prefer, we can deprecate the old version and replace it with this. But for now, both still exist.

 

Reworked the database creation scripts to use the new database_v2 interface. The created product should remain identical, thus still being backwards compatible.
Previous database creation scripts have been placed in a directory labeled "deprecated" for now. Can be removed now if desired, or at a future date when no longer needed

Future TODO: Apply the same refactoring to the Nuget .net DB as it is currently the only create_* script that has not been reworked

 

Added script to generate MinGW database.

@swest50 swest50 requested a review from nightlark January 28, 2026 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants