Skip to content

INTEGRITY: Continuing set.dat's processing #32

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: integrity
Choose a base branch
from

Conversation

ShivangNagta
Copy link

No description provided.

@ShivangNagta ShivangNagta changed the title Continuing set.dat's processing INTEGRITY: Continuing set.dat's processing Jul 4, 2025
@ShivangNagta ShivangNagta force-pushed the integrity_gsoc_2025_2 branch from 93d01e2 to 1fee585 Compare July 7, 2025 17:42
@ShivangNagta ShivangNagta force-pushed the integrity_gsoc_2025_2 branch from 094d152 to 898ffd0 Compare July 10, 2025 11:03
Copy link

@rvanlaar rvanlaar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I have less and less feedback, good job.

Parameterize all your sql queries, and limit the line length :-)

@@ -73,9 +77,87 @@ def get_dirs_at_depth(directory, depth):
if depth == num_sep_this - num_sep:
yield root

def read_be_32(byte_stream):

def my_escape_string(s: str) -> str:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's up with this function name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name was given by the previous developer. The reason was most probably to avoid conflict with pymysql's escape string function - from pymysql.converters import escape_string. Though in this particular file, that is not being used. So there should not be any issue if I make it escape_string to match dumper companion's implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the case, a solution for would be to use as. And example: from pymysql.converters import escape_string as pymysql_escape_string

return False


def split_path_recursive(path):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is path a string?
If so use: [i for i in path.split(os.sep) if i ]


def appledouble_get_resfork_data(file_byte_stream):
""" Returns the resource fork's data section as bytes of an appledouble file as well as its size """
""" Returns the resource fork's data section as bytes, size of resource fork (size-r) and size of data section of resource fork (size-rd) of an appledouble file"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you run the formatter? This line seem long.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were some components in this file like CRC16_XMODEM_TABLE that were getting messed up after running the formatter. So after talking with sev, we decided to not run the formatter on compute_hash.py and clear.py.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data = f"name \"{filename}\" size {filesize}"
for filename, (hashes, size, size_r, size_rd, timestamp) in hash_of_dir.items():
filename = encode_path_components(filename)
data = f"name \"{filename}\" size {size} size-r {size_r} size-rd {size_rd} timestamp {timestamp}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use triple quotes around the string, then it doesn't need escaping inside.

@@ -144,7 +163,7 @@ def insert_fileset(
return (existing_entry, True)

# $game and $key should not be parsed as a mysql string, hence no quotes
query = f"INSERT INTO fileset (game, status, src, `key`, megakey, `timestamp`) VALUES ({game}, '{status}', '{src}', {key}, {megakey}, FROM_UNIXTIME(@fileset_time_last))"
query = f"INSERT INTO fileset (game, status, src, `key`, megakey, `timestamp`, set_dat_metadata) VALUES ({game}, '{status}', '{src}', {key}, {megakey}, FROM_UNIXTIME(@fileset_time_last), '{escape_string(set_dat_metadata)}')"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said before, use parameterized queries and don't let the lines get too long.
Where's the formatter?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I do use parameterized queries whenever I write a new one. I'll definitely fix the older ones one day. Though I did run the formatter on this file.

db_functions.py Outdated
cursor.execute(
"SELECT status FROM fileset WHERE id = %s", (matched_fileset_id,)
)
status = cursor.fetchone()["status"]
if status == "detection":
update_fileset_status(cursor, matched_fileset_id, "partial")
update_fileset_status(cursor, matched_fileset_id, "parital")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be partial? not parital?

db_functions.py Outdated
AND f.name = %s
AND f.size = %s
"""
cursor.execute(query, (parent_fileset, file["name"], file["size"]))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're only checking if there is a result, you count also do a COUNT() in the query.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more performant to use LIMIT 1 instead of COUNT?
This would make the SQL engine stop as soon as a result is found.

return "".join(random.choices(string.ascii_letters + string.digits, k=length))

cursor.execute("ALTER TABLE file ADD COLUMN detection_type VARCHAR(20);")
except Exception:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come this is a check for a generic exception and not a specific one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I should catch a more specific exception related to the existing column. Or I can directly check if the column exists instead of relying on error handling.

@ShivangNagta
Copy link
Author

Hi, I have less and less feedback, good job.

Parameterize all your sql queries, and limit the line length :-)

Thank you, and yes, some of the older queries are still left to be parametrised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants