Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method for unknown extensions in binary extractions #343

Closed
ruebot opened this issue Aug 13, 2019 · 7 comments
Closed

Add method for unknown extensions in binary extractions #343

ruebot opened this issue Aug 13, 2019 · 7 comments

Comments

@ruebot
Copy link
Member

ruebot commented Aug 13, 2019

With the implementation of #302, #306, and #307, we will occasionally get binaries that are extracted, and do not have file extensions on them. We should create a method/helper account for this:

  1. Try existing method
  2. Try to guess file extension from MimeType (@jrwiebe is working on this in get-extension
  3. If both fail, use UNKNOWN or something else as the extension.
@jrwiebe
Copy link
Contributor

jrwiebe commented Aug 13, 2019

I noticed when I was filtering for URLs ending in ".doc" that I was getting a lot of non-doc formats (HTML and text formats). I think it's less likely there will be such incorrect file extensions for the other binary formats we're targeting, but if we want a generic algorithm for determining the extension, I'd cover the .doc case by switching steps 1 and 2. I'd also use the plural method getExtensions and if the extension returned by FilenameUtils is contained in this list, I'd select that one.

We're already getting the MimeType with Tika, which is the only expensive operation in this process.

My method:

  def getExt(mimeType: String, url: String): String = {
    val tikaExtensions = DetectMimeTypeTika.getExtensions(mimetype)
    var ext = "unknown"
    // Tika method
    if (tikaExtensions.size == 1) {
      ext = tikaExtensions(0).substring(1)
    } else {
      // FilenameUtils method
      val fnuExt = FilenameUtils.getExtension(url)
      if (fnuExt != null && !fnuExt.isEmpty) {
        // Reconcile Tika list and FilenameUtils extension
        if (tikaExtensions.size > 1) {
          if (tikaExtensions.contains(fnuExt)) {
            ext = fnuExt
          } else {
            ext = tikaExtensions(0).substring(1)
          }
        } else { // tikaExtensions.size == 0 && fnuExt exists
          ext = fnuExt
        }
      } // else => unknown
    }
    ext
  }

@ruebot
Copy link
Member Author

ruebot commented Aug 16, 2019

I have an idea since I noticed these methods as I'm hacking on #346

https://github.com/archivesunleashed/aut/blob/master/src/main/scala/io/archivesunleashed/package.scala#L307-L313

https://github.com/archivesunleashed/aut/blob/master/src/main/scala/io/archivesunleashed/package.scala#L374-L380

Use what you have above, and combine it there. Maybe do something consistent with what we have with getMimeType where it uses web server Mime Type, and have keepTikaMimeTypes and discardTikaMimeTypes. It could clean-up a whole lot of what we have put in there the last couple of days.

@jrwiebe
Copy link
Contributor

jrwiebe commented Aug 16, 2019

Use what you have above, and combine it there.

Not sure what you mean by this line, unless you're just saying to put the method I wrote above in that section of package.scala. (Which I was about to do.)

@ruebot
Copy link
Member Author

ruebot commented Aug 16, 2019

@jrwiebe yep! that plus potentially mimicking those two exiting functions as well. That make sense?

@jrwiebe
Copy link
Contributor

jrwiebe commented Aug 16, 2019

Actually, now that I'm trying it I realize we don't want a getExtension method that applies to RDDs. I'm putting it in the matchbox.

@jrwiebe
Copy link
Contributor

jrwiebe commented Aug 16, 2019

Did it.

@ruebot Want to integrate this into PR #346? Or I could make a separate one after that's merged.

@ruebot
Copy link
Member Author

ruebot commented Aug 16, 2019

Let's do a separate one after #346. I want to make sure it comes in under your name 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants