-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add method for unknown extensions in binary extractions #343
Comments
I noticed when I was filtering for URLs ending in ".doc" that I was getting a lot of non-doc formats (HTML and text formats). I think it's less likely there will be such incorrect file extensions for the other binary formats we're targeting, but if we want a generic algorithm for determining the extension, I'd cover the .doc case by switching steps 1 and 2. I'd also use the plural method We're already getting the MimeType with Tika, which is the only expensive operation in this process. My method:
|
I have an idea since I noticed these methods as I'm hacking on #346 Use what you have above, and combine it there. Maybe do something consistent with what we have with |
Not sure what you mean by this line, unless you're just saying to put the method I wrote above in that section of |
@jrwiebe yep! that plus potentially mimicking those two exiting functions as well. That make sense? |
Actually, now that I'm trying it I realize we don't want a |
Let's do a separate one after #346. I want to make sure it comes in under your name 😃 |
With the implementation of #302, #306, and #307, we will occasionally get binaries that are extracted, and do not have file extensions on them. We should create a method/helper account for this:
UNKNOWN
or something else as the extension.The text was updated successfully, but these errors were encountered: