Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342

Closed
jrwiebe opened this issue Aug 13, 2019 · 9 comments · Fixed by #344
Closed

Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342

jrwiebe opened this issue Aug 13, 2019 · 9 comments · Fixed by #344

Comments

@jrwiebe
Copy link
Contributor

jrwiebe commented Aug 13, 2019

I wasn't sure whether to add this as a comment in #306 (or one of the other extraction method threads), or to make it an issue.

I observed that while we're using DetectMimeTypeTika in methods like extractPDFDetailsDF, extractVideoDetailsDF, etc., we're using ArchiveRecord.getMimeType for reporting the MIME type. I think Tika's MIME type will tend to be more accurate than what is set by the web server. Do we want to go for accurate meta-description of the contents of our archive records, or do we want fidelity to what was collected from the web?

If we were wanted to use what we're getting from Tika, a method like extractAudioDetailsDF() could look something like this:

def extractAudioDetailsDF(rdd: RDD[ArchiveRecord]): DataFrame = {
    val records = rdd
    .map(r => 
        (r, DetectMimeTypeTika(r.getContentString)) // <------- (record, mime_type) tuple
    )
    .filter(r => r._2.startsWith("audio/")) // for example
    .map(r => {
      val bytes = r._1.getBinaryBytes
      val hash = new String(Hex.encodeHex(MessageDigest.getInstance("MD5").digest(bytes)))
      val encodedBytes = Base64.getEncoder.encodeToString(bytes)
      (r._1.getUrl, r._2, hash, encodedBytes)
    })
    .map(t => Row(t._1, t._2, t._3, t._4))

    val schema = new StructType()
    .add(StructField("url", StringType, true))
    .add(StructField("mime_type", StringType, true))
    .add(StructField("md5", StringType, true))
    .add(StructField("encodedBytes", StringType, true))

    val sqlContext = SparkSession.builder();
    sqlContext.getOrCreate().createDataFrame(records, schema)
}

I'm not sure about the memory implications of the r => (r, DetectMimeTypeTika(r.getContentString)) mapping.

@ruebot
Copy link
Member

ruebot commented Aug 13, 2019

I agree with moving to Tika for all the MimeType detection for identification, but I also think we should tweak the DF columns to be mime_type_tika and mime_type_web_server since there might be a use case for comparing the two. Fun research study about how awful web servers are at identifying MimeTypes?

@ruebot
Copy link
Member

ruebot commented Aug 13, 2019

Interesting, if I switch all the ExtractMediaDetails from r.getMimeType to DetectMimeTypeTika(r.getContentString), for the columns, the tests fail because the identification seems to get worse:

Results :

Tests in error: 
  Image DF extraction(io.archivesunleashed.ExtractImageDetailsTest): "[image/gif]" did not equal "[application/octet-stream]"
  Audio DF extraction(io.archivesunleashed.ExtractAudioDetailsTest): "a[udio/mpeg]" did not equal "a[pplication/octet-stream]"
  Video DF extraction(io.archivesunleashed.ExtractVideoDetailsTest): "[video/mp4]" did not equal "[application/octet-stream]"

ruebot added a commit that referenced this issue Aug 13, 2019
@ruebot
Copy link
Member

ruebot commented Aug 13, 2019

@jrwiebe this what you're thinking? 54c1643

@jrwiebe
Copy link
Contributor Author

jrwiebe commented Aug 13, 2019

extractPDFDetailsDF is what I was thinking, but I see the the audio and video methods don't use the same approach (i.e., the map(r => (r, DetectMimeTypeTika(r.getBinaryBytes)))). Is that intentional?

@ruebot
Copy link
Member

ruebot commented Aug 14, 2019

Yeah intentional. I just picked one out of the three to implement before I got a 👍 or 👎 from you 😄

@jrwiebe
Copy link
Contributor Author

jrwiebe commented Aug 14, 2019

👍

@jrwiebe
Copy link
Contributor Author

jrwiebe commented Aug 14, 2019

I wish I knew more about how Spark runs this code. I wrote it this way to avoid calling Tika twice, but it's very possible the return value is cached and read from cache later on.

@ruebot
Copy link
Member

ruebot commented Aug 14, 2019

Cool, I'll update the code later on this morning or afternoon, and compare the output to the last job I ran on #341 testing.

@ruebot
Copy link
Member

ruebot commented Aug 14, 2019

Same numbers. I'm going to do a time test now on HEAD on master, and on what I'll push up here in a second.

4809 audio files
644 PDF files
232 video files
5685 total

ruebot added a commit that referenced this issue Aug 14, 2019
- Move audio, pdf, and video DF extraction to tuple map
- Provide to MimeType columns; mime_type_web_server and mime_type_tika
- Update tests
- Resolves #342
ruebot added a commit that referenced this issue Aug 14, 2019
- Move audio, pdf, and video DF extraction to tuple map
- Provide two MimeType columns; mime_type_web_server and mime_type_tika
- Update tests
- Resolves #342
ianmilligan1 pushed a commit that referenced this issue Aug 14, 2019
…344)

- Move audio, pdf, and video DF extraction to tuple map
- Provide two MimeType columns; mime_type_web_server and mime_type_tika
- Update tests
- Resolves #342
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants