-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplicate files between runs. #756
Conversation
70be9a4
to
6ca63af
Compare
6cfbe5c
to
0030536
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add test cases
return NeedFileResult(needed, f.id) | ||
|
||
@timeit | ||
def addFileContent(self, fid, content, encoding): | ||
def addFileContent(self, content_hash, content, encoding): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not trust the client that it calculates the content_hash correctly. A malicious client could corrupt the whole file storage. Verify the content hash here and throw an exception if invalid hash was sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is redundant. We already trust the client in other places, and hashing is not cheap.
api/report_server.thrift
Outdated
@@ -363,12 +363,12 @@ service codeCheckerDBAccess { | |||
throws (1: shared.RequestFailed requestError), | |||
|
|||
NeedFileResult needFileContent( | |||
1: i64 run_id, | |||
2: string filepath) | |||
1: string filepath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add documentation to the API function.
If matching filepath, content_hash is found in the DB return the correspondig,existing NeedFileResult.fileId, NeedFileResult.needed=false
- If only the content_hash matches, a new a new NeedFileResult.fileId is returned, NeedFileResult.needed=false
- If the filepath matches, but content_hash not, a new file_id is generated and NeedFileResult.needed=true.
- If none of the conent_hash and the filepath matches a new file_id is generated and NeedFileResult.needed is false in the response.
88fd70a
to
a981c4e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some issues with the database layout and migration, and some minor nits.
On a small project, I can confirm this feature working well.
Thank you, this is a valuable blow to database sizes. 😉
|
||
|
||
def downgrade(): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put sys.exit(1)
here just in case, to make this akin to the other migration scripts, with a note of # no downgrading supported
.
@@ -8,12 +8,11 @@ | |||
import base64 | |||
import codecs | |||
import os | |||
import traceback | |||
from hashlib import sha256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imports are alphabetically sorted on the package name: base64
, codecs
, hashlib
, os
.
@@ -13,13 +13,15 @@ | |||
import random | |||
import string | |||
import unittest | |||
from hashlib import sha256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import ordering is bad here too.
libcodechecker/orm_model.py
Outdated
self.run_id, self.filepath = run_id, filepath | ||
self.inc_count = 0 | ||
__table_args__ = (UniqueConstraint('filepath', 'content_hash', | ||
name='_filpath_content_hash_uc'),) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the UNIQUE KEY
should be created by SQLAlchemy per the CC_META
naming convention. There isn't a need to specify a name here.
Make sure to update the migration script to the observed name when you change this! I think the observed name will be uq_file_contents_filepath
, but do check beforehand!
libcodechecker/orm_model.py
Outdated
@@ -89,25 +87,31 @@ def __init__(self, run_id, checker_name, attribute, value): | |||
self.checker_name, self.run_id = checker_name, run_id | |||
|
|||
|
|||
class FileContent(Base): | |||
__tablename__ = 'file_content' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usual convention is singular class name and plural table name. (Yes I know this is already violated at some places, but no need to violate it further, I have added an action point to my meta-issue #745.)
|
||
|
||
def upgrade(): | ||
content = 'dummy content, backward incompatible update' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's enclose non-machine strings in "
instead of '
.
op.drop_column('files', 'inc_count') | ||
op.drop_column('files', 'run_id') | ||
op.add_column('files', | ||
sa.Column('content_hash', sa.String(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FOREIGN KEY
relation is missing from this table, and thus the resulting schema doesn't match the ORM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing: while this massively helps deduplication when adding new data, orphaning is not handled at all.
Note to self: A
FOREIGN KEY ON DELETE CASCADE
relation subjugates the table you are defining it in to the table the reference is set, a.k.a. if the referenced key is deleted (in the remote table), all rows referencing that value (in the table in whichFOREIGN KEY
is defined) is removed.
Our problem is the other way around. If a run gets deleted, the files referred to by their bugpathelements and the contents referred by said files are kept, resulting in massive dangling.
Possibly we should check when a run gets deleted what files are referenced by that run's content, and delete the files whom became dangling, and for these files, do the same with the hashes and the actual content.
43ab7ab
to
57927c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insertion works well.
Deletion doesn't.
session.delete(run_to_delete) | ||
session.commit() | ||
|
||
s1 = select([BugPathEvent.file_id]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this code more verbose. It's really clunky to read, especially the ~something
instead of not something
.
But if this code doesn't do anything (seemingly it does not...), we shouldn't introduce it in the first place.
At this point, when I try to delete a run from the database, I get the following error in the server console:
[ERROR] [12:05] - 'DeclarativeMeta' object is not iterable
And while run is deleted, the files and file_contents table is left intact.
primary_key=True), | ||
sa.Column('content', sa.Binary(), | ||
nullable=False)) | ||
op.bulk_insert(content_table, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bulk_insert
should check if there are a file in the database or not. For freshly created databases, the migration script automatically insert this dummy dead weight into the database, which is not good.
afaa3e1
to
0e5c519
Compare
If two file has the same SHA256, store the content only once. The chance of hash collision is very small, so we do not handle that case.
Since several runs might share files with the same content this update should reduce the database size significantly.
Closes #728.