Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342

jrwiebe · 2019-08-13T03:53:21Z

I wasn't sure whether to add this as a comment in #306 (or one of the other extraction method threads), or to make it an issue.

I observed that while we're using DetectMimeTypeTika in methods like extractPDFDetailsDF, extractVideoDetailsDF, etc., we're using ArchiveRecord.getMimeType for reporting the MIME type. I think Tika's MIME type will tend to be more accurate than what is set by the web server. Do we want to go for accurate meta-description of the contents of our archive records, or do we want fidelity to what was collected from the web?

If we were wanted to use what we're getting from Tika, a method like extractAudioDetailsDF() could look something like this:

def extractAudioDetailsDF(rdd: RDD[ArchiveRecord]): DataFrame = {
    val records = rdd
    .map(r => 
        (r, DetectMimeTypeTika(r.getContentString)) // <------- (record, mime_type) tuple
    )
    .filter(r => r._2.startsWith("audio/")) // for example
    .map(r => {
      val bytes = r._1.getBinaryBytes
      val hash = new String(Hex.encodeHex(MessageDigest.getInstance("MD5").digest(bytes)))
      val encodedBytes = Base64.getEncoder.encodeToString(bytes)
      (r._1.getUrl, r._2, hash, encodedBytes)
    })
    .map(t => Row(t._1, t._2, t._3, t._4))

    val schema = new StructType()
    .add(StructField("url", StringType, true))
    .add(StructField("mime_type", StringType, true))
    .add(StructField("md5", StringType, true))
    .add(StructField("encodedBytes", StringType, true))

    val sqlContext = SparkSession.builder();
    sqlContext.getOrCreate().createDataFrame(records, schema)
}

I'm not sure about the memory implications of the r => (r, DetectMimeTypeTika(r.getContentString)) mapping.

The text was updated successfully, but these errors were encountered:

ruebot · 2019-08-13T12:22:43Z

I agree with moving to Tika for all the MimeType detection for identification, but I also think we should tweak the DF columns to be mime_type_tika and mime_type_web_server since there might be a use case for comparing the two. Fun research study about how awful web servers are at identifying MimeTypes?

ruebot · 2019-08-13T12:39:02Z

Interesting, if I switch all the ExtractMediaDetails from r.getMimeType to DetectMimeTypeTika(r.getContentString), for the columns, the tests fail because the identification seems to get worse:

Results :

Tests in error: 
  Image DF extraction(io.archivesunleashed.ExtractImageDetailsTest): "[image/gif]" did not equal "[application/octet-stream]"
  Audio DF extraction(io.archivesunleashed.ExtractAudioDetailsTest): "a[udio/mpeg]" did not equal "a[pplication/octet-stream]"
  Video DF extraction(io.archivesunleashed.ExtractVideoDetailsTest): "[video/mp4]" did not equal "[application/octet-stream]"

ruebot · 2019-08-13T22:35:10Z

@jrwiebe this what you're thinking? 54c1643

jrwiebe · 2019-08-13T23:15:48Z

extractPDFDetailsDF is what I was thinking, but I see the the audio and video methods don't use the same approach (i.e., the map(r => (r, DetectMimeTypeTika(r.getBinaryBytes)))). Is that intentional?

ruebot · 2019-08-14T01:18:25Z

Yeah intentional. I just picked one out of the three to implement before I got a 👍 or 👎 from you 😄

jrwiebe · 2019-08-14T01:35:50Z

👍

jrwiebe · 2019-08-14T01:43:38Z

I wish I knew more about how Spark runs this code. I wrote it this way to avoid calling Tika twice, but it's very possible the return value is cached and read from cache later on.

ruebot · 2019-08-14T11:38:38Z

Cool, I'll update the code later on this morning or afternoon, and compare the output to the last job I ran on #341 testing.

ruebot · 2019-08-14T16:56:51Z

Same numbers. I'm going to do a time test now on HEAD on master, and on what I'll push up here in a second.

4809 audio files
644 PDF files
232 video files
5685 total

- Move audio, pdf, and video DF extraction to tuple map - Provide to MimeType columns; mime_type_web_server and mime_type_tika - Update tests - Resolves #342

- Move audio, pdf, and video DF extraction to tuple map - Provide two MimeType columns; mime_type_web_server and mime_type_tika - Update tests - Resolves #342

…344) - Move audio, pdf, and video DF extraction to tuple map - Provide two MimeType columns; mime_type_web_server and mime_type_tika - Update tests - Resolves #342

ruebot added DataFrames enhancement Scala labels Aug 13, 2019

ruebot mentioned this issue Aug 13, 2019

Add Audio & Video binary extraction #341

Merged

ruebot added a commit that referenced this issue Aug 13, 2019

hacking on #342

54c1643

ruebot mentioned this issue Aug 14, 2019

Use Tika's detected MIME type instead of ArchiveRecord getMimeType. #344

Merged

ianmilligan1 closed this as completed in #344 Aug 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342

Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342

jrwiebe commented Aug 13, 2019

ruebot commented Aug 13, 2019

ruebot commented Aug 13, 2019

ruebot commented Aug 13, 2019

jrwiebe commented Aug 13, 2019

ruebot commented Aug 14, 2019

jrwiebe commented Aug 14, 2019

jrwiebe commented Aug 14, 2019

ruebot commented Aug 14, 2019

ruebot commented Aug 14, 2019

Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342

Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342

Comments

jrwiebe commented Aug 13, 2019

ruebot commented Aug 13, 2019

ruebot commented Aug 13, 2019

ruebot commented Aug 13, 2019

jrwiebe commented Aug 13, 2019

ruebot commented Aug 14, 2019

jrwiebe commented Aug 14, 2019

jrwiebe commented Aug 14, 2019

ruebot commented Aug 14, 2019

ruebot commented Aug 14, 2019