-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File extension logic cleanup #29
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Thanks.
I have one quirm about this: Isn't it a problem that file extension .ksh is guessed for all text/plain parts? In practice it doesn't matter right now because we ignore text/plain and don't hand it to cuckoo anyway. But if that ever changed, we might get a lot of false positives from text containing what looks like shell code that triggers cuckoo but would never pose a risk to any mail client. Other mime types might have similar problems. Any thoughts on that? |
d7233d3
to
a225cd1
Compare
If the declared name conained no dots we would use the whole name as file extension: name_declared == foo -> file_extension == foo -> ln -s p001 -> <hash>.foo. Use os.path.splitext() similar to the filename case.
Always return the attribute value if known and remove override logic because there is no reason why a sample would and should change its file extension over its lifetime now that we've removed possibly delayed loading of a meta info file. Fold actual extension splitting and attribute setting into one routine.
Do not append a single dot and no actual file extension if we can't figure out any.
There's no point in creating a workdir and symlink in it if we don't know any file extension anyway.
More selectively catch exceptions only from the actual attempt of deleting the temporary directory instead of the decision and logging logic as well - which shouldn't error, particularly with OSError.
Ignoring OSError was introducted in commit 737eaa3 likely to ignore EEXISTS errors when another thread had already created an identical symlink due to a race condition. While this was never good because the underlying race condition would still be there, it also made us blind to other errors such as missing permissions or exhaustion of inodes on the target file system. Using tempfile.mkdtemp to create the workdir should reliably avoid any race condition now and make this selective blindspot unnecessary (as long as the same sample isn't analysed by multiple threads - whoa).
The set of mime types is an actual set of unique elements. No need to convert back and forth from list.
Reorder mime type detection logic to avoid redoing work whose result won't be used. Makes it clear that we detect only once and then use the attribute's value. The logic of merging in newly detected types hinted at by a comment has long been gone. Remove that comment.
Leaving an empty "else" execution path, requiring the reader to know that the default return value is None, seems needlessly confusing.
a225cd1
to
7da2a9a
Compare
@Jack28: Rebased to master and left out the actual file extension guessing for now. The code is parked in a branch in my fork. Can you give this another quick once-over so I can merge? |
LGTM (haven't done any additional testing though) |
Let's roll with it. ;) |
Hi, I took a crack at guessing file extensions from mime types.
This lead to various cleanup in preparation to it, e.g. avoiding a trailing dot if we have no extension and only creating the workdir and symlink if we do have one to work with.
The guesswork itself is somewhat overenthusiastic in that it returns e.g. .ksh for text/plain. Otherwise it works fine, coming back with e.g. .py for text/x-python.
So I put this up for testing and comments for the time being.