Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File extension logic cleanup #29

Merged
merged 9 commits into from
Sep 21, 2018

Conversation

michaelweiser
Copy link
Contributor

Hi, I took a crack at guessing file extensions from mime types.

This lead to various cleanup in preparation to it, e.g. avoiding a trailing dot if we have no extension and only creating the workdir and symlink if we do have one to work with.

The guesswork itself is somewhat overenthusiastic in that it returns e.g. .ksh for text/plain. Otherwise it works fine, coming back with e.g. .py for text/x-python.

So I put this up for testing and comments for the time being.

Copy link

@SebastianDeiss SebastianDeiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thanks.

@michaelweiser
Copy link
Contributor Author

I have one quirm about this: Isn't it a problem that file extension .ksh is guessed for all text/plain parts? In practice it doesn't matter right now because we ignore text/plain and don't hand it to cuckoo anyway. But if that ever changed, we might get a lot of false positives from text containing what looks like shell code that triggers cuckoo but would never pose a risk to any mail client.

Other mime types might have similar problems.

Any thoughts on that?

@michaelweiser michaelweiser added this to the 1.7 milestone Sep 6, 2018
peekaboo/sample.py Show resolved Hide resolved
peekaboo/sample.py Show resolved Hide resolved
peekaboo/toolbox/files.py Outdated Show resolved Hide resolved
If the declared name conained no dots we would use the whole name as
file extension: name_declared == foo -> file_extension == foo -> ln -s
p001 -> <hash>.foo. Use os.path.splitext() similar to the filename case.
Always return the attribute value if known and remove override logic
because there is no reason why a sample would and should change its file
extension over its lifetime now that we've removed possibly delayed
loading of a meta info file. Fold actual extension splitting and
attribute setting into one routine.
Do not append a single dot and no actual file extension if we can't
figure out any.
There's no point in creating a workdir and symlink in it if we don't
know any file extension anyway.
@michaelweiser michaelweiser changed the title RFC: Guess file extensions and various cleanup around it File extension logic cleanup Sep 21, 2018
More selectively catch exceptions only from the actual attempt of
deleting the temporary directory instead of the decision and logging
logic as well - which shouldn't error, particularly with OSError.
Ignoring OSError was introducted in commit 737eaa3 likely to ignore
EEXISTS errors when another thread had already created an identical
symlink due to a race condition.

While this was never good because the underlying race condition would
still be there, it also made us blind to other errors such as missing
permissions or exhaustion of inodes on the target file system. Using
tempfile.mkdtemp to create the workdir should reliably avoid any
race condition now and make this selective blindspot unnecessary (as
long as the same sample isn't analysed by multiple threads - whoa).
The set of mime types is an actual set of unique elements. No need to
convert back and forth from list.
Reorder mime type detection logic to avoid redoing work whose result
won't be used. Makes it clear that we detect only once and then use the
attribute's value. The logic of merging in newly detected types hinted
at by a comment has long been gone. Remove that comment.
Leaving an empty "else" execution path, requiring the reader to know
that the default return value is None, seems needlessly confusing.
@michaelweiser
Copy link
Contributor Author

@Jack28: Rebased to master and left out the actual file extension guessing for now. The code is parked in a branch in my fork. Can you give this another quick once-over so I can merge?

@Jack28
Copy link
Member

Jack28 commented Sep 21, 2018

LGTM (haven't done any additional testing though)

@michaelweiser
Copy link
Contributor Author

Let's roll with it. ;)

@michaelweiser michaelweiser merged commit a575158 into scVENUS:master Sep 21, 2018
@michaelweiser michaelweiser modified the milestones: 1.7, 1.6.2 Nov 30, 2018
@michaelweiser michaelweiser deleted the guess-fileext branch February 17, 2019 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants