File extension logic cleanup #29

michaelweiser · 2018-08-31T15:21:17Z

Hi, I took a crack at guessing file extensions from mime types.

This lead to various cleanup in preparation to it, e.g. avoiding a trailing dot if we have no extension and only creating the workdir and symlink if we do have one to work with.

The guesswork itself is somewhat overenthusiastic in that it returns e.g. .ksh for text/plain. Otherwise it works fine, coming back with e.g. .py for text/x-python.

So I put this up for testing and comments for the time being.

SebastianDeiss

Great work! Thanks.

michaelweiser · 2018-09-05T12:34:23Z

I have one quirm about this: Isn't it a problem that file extension .ksh is guessed for all text/plain parts? In practice it doesn't matter right now because we ignore text/plain and don't hand it to cuckoo anyway. But if that ever changed, we might get a lot of false positives from text containing what looks like shell code that triggers cuckoo but would never pose a risk to any mail client.

Other mime types might have similar problems.

Any thoughts on that?

peekaboo/sample.py

peekaboo/toolbox/files.py

If the declared name conained no dots we would use the whole name as file extension: name_declared == foo -> file_extension == foo -> ln -s p001 -> <hash>.foo. Use os.path.splitext() similar to the filename case.

Always return the attribute value if known and remove override logic because there is no reason why a sample would and should change its file extension over its lifetime now that we've removed possibly delayed loading of a meta info file. Fold actual extension splitting and attribute setting into one routine.

Do not append a single dot and no actual file extension if we can't figure out any.

There's no point in creating a workdir and symlink in it if we don't know any file extension anyway.

More selectively catch exceptions only from the actual attempt of deleting the temporary directory instead of the decision and logging logic as well - which shouldn't error, particularly with OSError.

Ignoring OSError was introducted in commit 737eaa3 likely to ignore EEXISTS errors when another thread had already created an identical symlink due to a race condition. While this was never good because the underlying race condition would still be there, it also made us blind to other errors such as missing permissions or exhaustion of inodes on the target file system. Using tempfile.mkdtemp to create the workdir should reliably avoid any race condition now and make this selective blindspot unnecessary (as long as the same sample isn't analysed by multiple threads - whoa).

The set of mime types is an actual set of unique elements. No need to convert back and forth from list.

Reorder mime type detection logic to avoid redoing work whose result won't be used. Makes it clear that we detect only once and then use the attribute's value. The logic of merging in newly detected types hinted at by a comment has long been gone. Remove that comment.

Leaving an empty "else" execution path, requiring the reader to know that the default return value is None, seems needlessly confusing.

michaelweiser · 2018-09-21T14:20:35Z

@Jack28: Rebased to master and left out the actual file extension guessing for now. The code is parked in a branch in my fork. Can you give this another quick once-over so I can merge?

Jack28 · 2018-09-21T14:46:43Z

LGTM (haven't done any additional testing though)

michaelweiser · 2018-09-21T15:10:57Z

Let's roll with it. ;)

michaelweiser added the enhancement label Aug 31, 2018

michaelweiser requested a review from SebastianDeiss August 31, 2018 15:22

SebastianDeiss approved these changes Sep 5, 2018

View reviewed changes

michaelweiser added this to the 1.7 milestone Sep 6, 2018

Jack28 reviewed Sep 6, 2018

View reviewed changes

peekaboo/sample.py Show resolved Hide resolved

peekaboo/sample.py Show resolved Hide resolved

peekaboo/toolbox/files.py Outdated Show resolved Hide resolved

michaelweiser mentioned this pull request Sep 21, 2018

Look into other ways than file extension to give Cuckoo file-type hints #47

Open

michaelweiser force-pushed the guess-fileext branch from d7233d3 to a225cd1 Compare September 21, 2018 14:06

michaelweiser added 4 commits September 21, 2018 15:07

Fix file extension extraction from declared name

b7be138

If the declared name conained no dots we would use the whole name as file extension: name_declared == foo -> file_extension == foo -> ln -s p001 -> <hash>.foo. Use os.path.splitext() similar to the filename case.

Do not append empty file extension

595c324

Do not append a single dot and no actual file extension if we can't figure out any.

Only create a workdir and submit symlink if we have a file extension

e38db17

There's no point in creating a workdir and symlink in it if we don't know any file extension anyway.

michaelweiser changed the title ~~RFC: Guess file extensions and various cleanup around it~~ File extension logic cleanup Sep 21, 2018

michaelweiser added 5 commits September 21, 2018 15:11

More closely wrap exception around temp dir deletion

a246726

More selectively catch exceptions only from the actual attempt of deleting the temporary directory instead of the decision and logging logic as well - which shouldn't error, particularly with OSError.

Switch mime types to set for good

cb18250

The set of mime types is an actual set of unique elements. No need to convert back and forth from list.

Make mime type guessing functions more reader-friendly

7da2a9a

Leaving an empty "else" execution path, requiring the reader to know that the default return value is None, seems needlessly confusing.

michaelweiser force-pushed the guess-fileext branch from a225cd1 to 7da2a9a Compare September 21, 2018 14:16

michaelweiser mentioned this pull request Sep 21, 2018

SMIME signature exception does not seem to work #39

Closed

michaelweiser merged commit a575158 into scVENUS:master Sep 21, 2018

michaelweiser modified the milestones: 1.7, 1.6.2 Nov 30, 2018

michaelweiser deleted the guess-fileext branch February 17, 2019 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File extension logic cleanup #29

File extension logic cleanup #29

michaelweiser commented Aug 31, 2018

SebastianDeiss left a comment

michaelweiser commented Sep 5, 2018

michaelweiser commented Sep 21, 2018

Jack28 commented Sep 21, 2018

michaelweiser commented Sep 21, 2018

File extension logic cleanup #29

File extension logic cleanup #29

Conversation

michaelweiser commented Aug 31, 2018

SebastianDeiss left a comment

Choose a reason for hiding this comment

michaelweiser commented Sep 5, 2018

michaelweiser commented Sep 21, 2018

Jack28 commented Sep 21, 2018

michaelweiser commented Sep 21, 2018