Deduplicate files between runs. #756

Xazax-hun · 2017-07-25T08:33:32Z

If two file has the same SHA256, store the content only once. The chance of hash collision is very small, so we do not handle that case.

Since several runs might share files with the same content this update should reduce the database size significantly.

Closes #728.

dkrupp

add test cases

dkrupp · 2017-07-25T11:49:32Z

libcodechecker/server/client_db_access_handler.py

        return NeedFileResult(needed, f.id)

    @timeit
-    def addFileContent(self, fid, content, encoding):
+    def addFileContent(self, content_hash, content, encoding):


Do not trust the client that it calculates the content_hash correctly. A malicious client could corrupt the whole file storage. Verify the content hash here and throw an exception if invalid hash was sent.

I think it is redundant. We already trust the client in other places, and hashing is not cheap.

dkrupp · 2017-07-25T11:53:37Z

api/report_server.thrift

@@ -363,12 +363,12 @@ service codeCheckerDBAccess {
                 throws (1: shared.RequestFailed requestError),

  NeedFileResult needFileContent(
-                                 1: i64 run_id,
-                                 2: string filepath)
+                                 1: string filepath,


Add documentation to the API function.

If matching filepath, content_hash is found in the DB return the correspondig,existing NeedFileResult.fileId, NeedFileResult.needed=false

If only the content_hash matches, a new a new NeedFileResult.fileId is returned, NeedFileResult.needed=false

If the filepath matches, but content_hash not, a new file_id is generated and NeedFileResult.needed=true.

If none of the conent_hash and the filepath matches a new file_id is generated and NeedFileResult.needed is false in the response.

whisperity

There are some issues with the database layout and migration, and some minor nits.

On a small project, I can confirm this feature working well.
Thank you, this is a valuable blow to database sizes. 😉

whisperity · 2017-07-26T17:42:50Z

db_migrate/versions/1dd7cd2978aa_deduplicate_files_with_same_content.py

+
+
+def downgrade():
+    pass


Let's put sys.exit(1) here just in case, to make this akin to the other migration scripts, with a note of # no downgrading supported.

whisperity · 2017-07-26T17:43:07Z

libcodechecker/analyze/analyzers/result_handler_plist_to_db.py

@@ -8,12 +8,11 @@
 import base64
 import codecs
 import os
-import traceback
+from hashlib import sha256


Imports are alphabetically sorted on the package name: base64, codecs, hashlib, os.

whisperity · 2017-07-26T17:48:07Z

tests/functional/report_viewer_api/test_hash_clash.py

@@ -13,13 +13,15 @@
 import random
 import string
 import unittest
+from hashlib import sha256


Import ordering is bad here too.

whisperity · 2017-07-26T18:03:57Z

libcodechecker/orm_model.py

-        self.run_id, self.filepath = run_id, filepath
-        self.inc_count = 0
+    __table_args__ = (UniqueConstraint('filepath', 'content_hash',
+                                       name='_filpath_content_hash_uc'),)


The name of the UNIQUE KEY should be created by SQLAlchemy per the CC_META naming convention. There isn't a need to specify a name here.

Make sure to update the migration script to the observed name when you change this! I think the observed name will be uq_file_contents_filepath, but do check beforehand!

whisperity · 2017-07-26T18:04:22Z

libcodechecker/orm_model.py

@@ -89,25 +87,31 @@ def __init__(self, run_id, checker_name, attribute, value):
        self.checker_name, self.run_id = checker_name, run_id


+class FileContent(Base):
+    __tablename__ = 'file_content'


The usual convention is singular class name and plural table name. (Yes I know this is already violated at some places, but no need to violate it further, I have added an action point to my meta-issue #745.)

whisperity · 2017-07-26T18:06:01Z

db_migrate/versions/1dd7cd2978aa_deduplicate_files_with_same_content.py

+
+
+def upgrade():
+    content = 'dummy content, backward incompatible update'


Let's enclose non-machine strings in " instead of '.

whisperity · 2017-07-26T18:06:59Z

db_migrate/versions/1dd7cd2978aa_deduplicate_files_with_same_content.py

+    op.drop_column('files', 'inc_count')
+    op.drop_column('files', 'run_id')
+    op.add_column('files',
+                  sa.Column('content_hash', sa.String(),


The FOREIGN KEY relation is missing from this table, and thus the resulting schema doesn't match the ORM.

whisperity

Another thing: while this massively helps deduplication when adding new data, orphaning is not handled at all.

Note to self: A FOREIGN KEY ON DELETE CASCADE relation subjugates the table you are defining it in to the table the reference is set, a.k.a. if the referenced key is deleted (in the remote table), all rows referencing that value (in the table in which FOREIGN KEY is defined) is removed.

Our problem is the other way around. If a run gets deleted, the files referred to by their bugpathelements and the contents referred by said files are kept, resulting in massive dangling.

Possibly we should check when a run gets deleted what files are referenced by that run's content, and delete the files whom became dangling, and for these files, do the same with the hashes and the actual content.

whisperity

Insertion works well.
Deletion doesn't.

whisperity · 2017-07-27T10:00:51Z

libcodechecker/server/client_db_access_handler.py

            session.delete(run_to_delete)
            session.commit()

+        s1 = select([BugPathEvent.file_id])


Let's make this code more verbose. It's really clunky to read, especially the ~something instead of not something.

But if this code doesn't do anything (seemingly it does not...), we shouldn't introduce it in the first place.

At this point, when I try to delete a run from the database, I get the following error in the server console:

[ERROR] [12:05] - 'DeclarativeMeta' object is not iterable

And while run is deleted, the files and file_contents table is left intact.

whisperity · 2017-07-27T10:04:01Z

db_migrate/versions/1dd7cd2978aa_deduplicate_files_with_same_content.py

+                                              primary_key=True),
+                                    sa.Column('content', sa.Binary(),
+                                              nullable=False))
+    op.bulk_insert(content_table,


This bulk_insert should check if there are a file in the database or not. For freshly created databases, the migration script automatically insert this dummy dead weight into the database, which is not good.

whisperity requested review from gyorb and whisperity July 25, 2017 08:34

whisperity added this to the 6.0 pre2 milestone Jul 25, 2017

Xazax-hun force-pushed the DedupFiles branch 8 times, most recently from 70be9a4 to 6ca63af Compare July 25, 2017 10:24

Xazax-hun added the WIP 💣 Work In Progress label Jul 25, 2017

Xazax-hun force-pushed the DedupFiles branch 3 times, most recently from 6cfbe5c to 0030536 Compare July 25, 2017 11:15

dkrupp requested changes Jul 25, 2017

View reviewed changes

Xazax-hun force-pushed the DedupFiles branch 10 times, most recently from 88fd70a to a981c4e Compare July 26, 2017 12:44

Xazax-hun removed the WIP 💣 Work In Progress label Jul 26, 2017

whisperity modified the milestones: 6.0 pre1 , 6.0 pre2 Jul 26, 2017

whisperity requested changes Jul 26, 2017

View reviewed changes

Xazax-hun force-pushed the DedupFiles branch 2 times, most recently from 43ab7ab to 57927c0 Compare July 27, 2017 09:16

Xazax-hun requested review from whisperity and dkrupp July 27, 2017 09:48

whisperity requested changes Jul 27, 2017

View reviewed changes

Xazax-hun force-pushed the DedupFiles branch 5 times, most recently from afaa3e1 to 0e5c519 Compare July 27, 2017 12:12

Gabor Horvath added 2 commits July 27, 2017 14:22

Support deduplication of file contents.

9bf0b3a

Remove migration support to version 6.

f681fe6

Xazax-hun force-pushed the DedupFiles branch from 0e5c519 to f681fe6 Compare July 27, 2017 12:23

whisperity approved these changes Jul 27, 2017

View reviewed changes

whisperity merged commit ace1b53 into Ericsson:version6 Jul 27, 2017

Xazax-hun deleted the DedupFiles branch July 31, 2017 13:44

whisperity mentioned this pull request Sep 1, 2017

Contradiction expensive to evaluate #864

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplicate files between runs. #756

Deduplicate files between runs. #756

Xazax-hun commented Jul 25, 2017

dkrupp left a comment

dkrupp Jul 25, 2017

Xazax-hun Jul 25, 2017

dkrupp Jul 25, 2017

whisperity left a comment

whisperity Jul 26, 2017 •

edited

Loading

whisperity Jul 26, 2017

whisperity Jul 26, 2017

whisperity Jul 26, 2017 •

edited

Loading

whisperity Jul 26, 2017

whisperity Jul 26, 2017

whisperity Jul 26, 2017

whisperity left a comment

whisperity left a comment

whisperity Jul 27, 2017

whisperity Jul 27, 2017



		def upgrade():
		content = 'dummy content, backward incompatible update'

Deduplicate files between runs. #756

Deduplicate files between runs. #756

Conversation

Xazax-hun commented Jul 25, 2017

dkrupp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

whisperity left a comment

Choose a reason for hiding this comment

whisperity Jul 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

whisperity Jul 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

whisperity left a comment

Choose a reason for hiding this comment

whisperity left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

whisperity Jul 26, 2017 •

edited

Loading

whisperity Jul 26, 2017 •

edited

Loading