Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate files between runs. #756

Merged
merged 2 commits into from
Jul 27, 2017
Merged

Conversation

Xazax-hun
Copy link
Contributor

If two file has the same SHA256, store the content only once. The chance of hash collision is very small, so we do not handle that case.

Since several runs might share files with the same content this update should reduce the database size significantly.

Closes #728.

@whisperity whisperity added this to the 6.0 pre2 milestone Jul 25, 2017
@Xazax-hun Xazax-hun force-pushed the DedupFiles branch 8 times, most recently from 70be9a4 to 6ca63af Compare July 25, 2017 10:24
@Xazax-hun Xazax-hun added the WIP 💣 Work In Progress label Jul 25, 2017
@Xazax-hun Xazax-hun force-pushed the DedupFiles branch 3 times, most recently from 6cfbe5c to 0030536 Compare July 25, 2017 11:15
Copy link
Member

@dkrupp dkrupp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add test cases

return NeedFileResult(needed, f.id)

@timeit
def addFileContent(self, fid, content, encoding):
def addFileContent(self, content_hash, content, encoding):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not trust the client that it calculates the content_hash correctly. A malicious client could corrupt the whole file storage. Verify the content hash here and throw an exception if invalid hash was sent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is redundant. We already trust the client in other places, and hashing is not cheap.

@@ -363,12 +363,12 @@ service codeCheckerDBAccess {
throws (1: shared.RequestFailed requestError),

NeedFileResult needFileContent(
1: i64 run_id,
2: string filepath)
1: string filepath,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add documentation to the API function.

If matching filepath, content_hash is found in the DB return the correspondig,existing NeedFileResult.fileId, NeedFileResult.needed=false

  • If only the content_hash matches, a new a new NeedFileResult.fileId is returned, NeedFileResult.needed=false
  • If the filepath matches, but content_hash not, a new file_id is generated and NeedFileResult.needed=true.
  • If none of the conent_hash and the filepath matches a new file_id is generated and NeedFileResult.needed is false in the response.

@Xazax-hun Xazax-hun force-pushed the DedupFiles branch 10 times, most recently from 88fd70a to a981c4e Compare July 26, 2017 12:44
@Xazax-hun Xazax-hun removed the WIP 💣 Work In Progress label Jul 26, 2017
@whisperity whisperity modified the milestones: 6.0 pre1 , 6.0 pre2 Jul 26, 2017
Copy link
Contributor

@whisperity whisperity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some issues with the database layout and migration, and some minor nits.

On a small project, I can confirm this feature working well.
Thank you, this is a valuable blow to database sizes. 😉



def downgrade():
pass
Copy link
Contributor

@whisperity whisperity Jul 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put sys.exit(1) here just in case, to make this akin to the other migration scripts, with a note of # no downgrading supported.

@@ -8,12 +8,11 @@
import base64
import codecs
import os
import traceback
from hashlib import sha256
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imports are alphabetically sorted on the package name: base64, codecs, hashlib, os.

@@ -13,13 +13,15 @@
import random
import string
import unittest
from hashlib import sha256
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import ordering is bad here too.

self.run_id, self.filepath = run_id, filepath
self.inc_count = 0
__table_args__ = (UniqueConstraint('filepath', 'content_hash',
name='_filpath_content_hash_uc'),)
Copy link
Contributor

@whisperity whisperity Jul 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the UNIQUE KEY should be created by SQLAlchemy per the CC_META naming convention. There isn't a need to specify a name here.

Make sure to update the migration script to the observed name when you change this! I think the observed name will be uq_file_contents_filepath, but do check beforehand!

@@ -89,25 +87,31 @@ def __init__(self, run_id, checker_name, attribute, value):
self.checker_name, self.run_id = checker_name, run_id


class FileContent(Base):
__tablename__ = 'file_content'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usual convention is singular class name and plural table name. (Yes I know this is already violated at some places, but no need to violate it further, I have added an action point to my meta-issue #745.)



def upgrade():
content = 'dummy content, backward incompatible update'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's enclose non-machine strings in " instead of '.

op.drop_column('files', 'inc_count')
op.drop_column('files', 'run_id')
op.add_column('files',
sa.Column('content_hash', sa.String(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FOREIGN KEY relation is missing from this table, and thus the resulting schema doesn't match the ORM.

Copy link
Contributor

@whisperity whisperity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing: while this massively helps deduplication when adding new data, orphaning is not handled at all.

Note to self: A FOREIGN KEY ON DELETE CASCADE relation subjugates the table you are defining it in to the table the reference is set, a.k.a. if the referenced key is deleted (in the remote table), all rows referencing that value (in the table in which FOREIGN KEY is defined) is removed.

Our problem is the other way around. If a run gets deleted, the files referred to by their bugpathelements and the contents referred by said files are kept, resulting in massive dangling.

Possibly we should check when a run gets deleted what files are referenced by that run's content, and delete the files whom became dangling, and for these files, do the same with the hashes and the actual content.

@Xazax-hun Xazax-hun force-pushed the DedupFiles branch 2 times, most recently from 43ab7ab to 57927c0 Compare July 27, 2017 09:16
Copy link
Contributor

@whisperity whisperity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insertion works well.
Deletion doesn't.

session.delete(run_to_delete)
session.commit()

s1 = select([BugPathEvent.file_id])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this code more verbose. It's really clunky to read, especially the ~something instead of not something.

But if this code doesn't do anything (seemingly it does not...), we shouldn't introduce it in the first place.

At this point, when I try to delete a run from the database, I get the following error in the server console:

[ERROR] [12:05] - 'DeclarativeMeta' object is not iterable

And while run is deleted, the files and file_contents table is left intact.

primary_key=True),
sa.Column('content', sa.Binary(),
nullable=False))
op.bulk_insert(content_table,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bulk_insert should check if there are a file in the database or not. For freshly created databases, the migration script automatically insert this dummy dead weight into the database, which is not good.

@Xazax-hun Xazax-hun force-pushed the DedupFiles branch 5 times, most recently from afaa3e1 to 0e5c519 Compare July 27, 2017 12:12
@whisperity whisperity merged commit ace1b53 into Ericsson:version6 Jul 27, 2017
@Xazax-hun Xazax-hun deleted the DedupFiles branch July 31, 2017 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants