INTEGRITY: Continuing set.dat's processing #32

ShivangNagta · 2025-07-04T14:33:16Z

No description provided.

…echecksum. Add new equal checksums for set.dat's fileset match

…rtial in the same run.

…modification time

… and recreated.

…which is never closed.

rvanlaar

Hi, I have less and less feedback, good job.

Parameterize all your sql queries, and limit the line length :-)

rvanlaar · 2025-07-10T15:36:18Z

compute_hash.py

@@ -73,9 +77,87 @@ def get_dirs_at_depth(directory, depth):
        if depth == num_sep_this - num_sep:
            yield root

-def read_be_32(byte_stream):
+
+def my_escape_string(s: str) -> str:


What's up with this function name?

This name was given by the previous developer. The reason was most probably to avoid conflict with pymysql's escape string function - from pymysql.converters import escape_string. Though in this particular file, that is not being used. So there should not be any issue if I make it escape_string to match dumper companion's implementation.

If this is the case, a solution for would be to use as. And example: from pymysql.converters import escape_string as pymysql_escape_string

rvanlaar · 2025-07-10T15:42:01Z

compute_hash.py

+    return False
+
+
+def split_path_recursive(path):


Is path a string?
If so use: [i for i in path.split(os.sep) if i ]

rvanlaar · 2025-07-10T15:43:01Z

compute_hash.py


 def appledouble_get_resfork_data(file_byte_stream):
-    """ Returns the resource fork's data section as bytes of an appledouble file as well as its size """
+    """ Returns the resource fork's data section as bytes, size of resource fork (size-r) and size of data section of resource fork (size-rd) of an appledouble file"""


Did you run the formatter? This line seem long.

There were some components in this file like CRC16_XMODEM_TABLE that were getting messed up after running the formatter. So after talking with sev, we decided to not run the formatter on compute_hash.py and clear.py.

Check how this is handled in dumper-companion: https://github.com/scummvm/scummvm/blob/master/devtools/dumper-companion.py#L39 and https://github.com/scummvm/scummvm/blob/master/devtools/dumper-companion.py#L87

rvanlaar · 2025-07-11T13:05:33Z

compute_hash.py

-                data = f"name \"{filename}\" size {filesize}"
+            for filename, (hashes, size, size_r, size_rd, timestamp) in hash_of_dir.items():
+                filename = encode_path_components(filename)
+                data = f"name \"{filename}\" size {size} size-r {size_r} size-rd {size_rd} timestamp {timestamp}"


use triple quotes around the string, then it doesn't need escaping inside.

rvanlaar · 2025-07-11T13:09:23Z

db_functions.py

@@ -144,7 +163,7 @@ def insert_fileset(
        return (existing_entry, True)

    # $game and $key should not be parsed as a mysql string, hence no quotes
-    query = f"INSERT INTO fileset (game, status, src, `key`, megakey, `timestamp`) VALUES ({game}, '{status}', '{src}', {key}, {megakey}, FROM_UNIXTIME(@fileset_time_last))"
+    query = f"INSERT INTO fileset (game, status, src, `key`, megakey, `timestamp`, set_dat_metadata) VALUES ({game}, '{status}', '{src}', {key}, {megakey}, FROM_UNIXTIME(@fileset_time_last), '{escape_string(set_dat_metadata)}')"


As I said before, use parameterized queries and don't let the lines get too long.
Where's the formatter?

Yes, I do use parameterized queries whenever I write a new one. I'll definitely fix the older ones one day. Though I did run the formatter on this file.

rvanlaar · 2025-07-11T13:15:11Z

db_functions.py

            cursor.execute(
                "SELECT status FROM fileset WHERE id = %s", (matched_fileset_id,)
            )
            status = cursor.fetchone()["status"]
            if status == "detection":
-                update_fileset_status(cursor, matched_fileset_id, "partial")
+                update_fileset_status(cursor, matched_fileset_id, "parital")


Should this be partial? not parital?

rvanlaar · 2025-07-11T13:16:46Z

db_functions.py

+                    AND f.name = %s
+                    AND f.size = %s
+                """
+                cursor.execute(query, (parent_fileset, file["name"], file["size"]))


Since you're only checking if there is a result, you count also do a COUNT() in the query.

Would it be more performant to use LIMIT 1 instead of COUNT?
This would make the SQL engine stop as soon as a result is found.

rvanlaar · 2025-07-11T13:20:01Z

schema.py

-    return "".join(random.choices(string.ascii_letters + string.digits, k=length))
-
+        cursor.execute("ALTER TABLE file ADD COLUMN detection_type VARCHAR(20);")
+    except Exception:


How come this is a check for a generic exception and not a specific one?

You are right, I should catch a more specific exception related to the existing column. Or I can directly check if the column exists instead of relying on error handling.

ShivangNagta · 2025-07-11T15:42:14Z

Hi, I have less and less feedback, good job.

Parameterize all your sql queries, and limit the line length :-)

Thank you, and yes, some of the older queries are still left to be parametrised.

…smatch

ShivangNagta changed the title ~~Continuing set.dat's processing~~ INTEGRITY: Continuing set.dat's processing Jul 4, 2025

ShivangNagta force-pushed the integrity_gsoc_2025_2 branch from 93d01e2 to 1fee585 Compare July 7, 2025 17:42

ShivangNagta added 11 commits July 10, 2025 16:32

INTEGRITY: Increase character limit size for log text

161abd5

INTEGRITY: Separate the additional checksum add logic from insert_fil…

94cd1d9

…echecksum. Add new equal checksums for set.dat's fileset match

INTEGRITY: Add filtering by platform for set.dat to reduce manual merge.

cc82c6f

INTEGRITY: Avoid adding a fileset as candidate if it was marked as pa…

599d2f1

…rtial in the same run.

INTEGRITY: Add additional filtering logic for glk engines

6ea2bba

INTEGRITY: Add timestamp field in scan.dat and filtering support via …

3028d19

…modification time

INTEGRITY: Add all size variants to scan.dat - size, size-r and size-rd.

19b19c0

INTEGRITY: Fix clear database hang issue. Now the database is dropped…

074da92

… and recreated.

INTEGRITY: Remove global database connection object from fileset.py, …

31b7d4f

…which is never closed.

INTEGRITY: Filter manual merge candidates if size mismatch.

e125227

INTEGRITY: Add metadata for set.dat

898ffd0

ShivangNagta force-pushed the integrity_gsoc_2025_2 branch from 094d152 to 898ffd0 Compare July 10, 2025 11:03

ShivangNagta added 4 commits July 10, 2025 18:30

INTEGRITY: Add navbar with logo.

8970cd6

INTEGRITY: Add modification timestamps for macfiles

bd3f2f4

INTEGIRTY: Add punycode encoding for scan utlity.

8575f8e

INTEGRITY: Fix the navbar on top.

df41d7a

rvanlaar reviewed Jul 11, 2025

View reviewed changes

ShivangNagta added 2 commits July 14, 2025 15:55

INTEGRITY: Limit match fileset to 1 in remove_manual_merge_if_size_mi…

c38881c

…smatch

INTEGRITY: Improve console logging with progress update.

e86f982

INTEGRITY: Continuing set.dat's processing #32

Are you sure you want to change the base?

INTEGRITY: Continuing set.dat's processing #32

Uh oh!

Conversation

ShivangNagta commented Jul 4, 2025

Uh oh!

rvanlaar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShivangNagta commented Jul 11, 2025

Uh oh!

Uh oh!