-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342
Comments
I agree with moving to Tika for all the MimeType detection for identification, but I also think we should tweak the DF columns to be |
Interesting, if I switch all the ExtractMediaDetails from
|
|
Yeah intentional. I just picked one out of the three to implement before I got a 👍 or 👎 from you 😄 |
👍 |
I wish I knew more about how Spark runs this code. I wrote it this way to avoid calling Tika twice, but it's very possible the return value is cached and read from cache later on. |
Cool, I'll update the code later on this morning or afternoon, and compare the output to the last job I ran on #341 testing. |
Same numbers. I'm going to do a time test now on HEAD on master, and on what I'll push up here in a second. 4809 audio files |
- Move audio, pdf, and video DF extraction to tuple map - Provide to MimeType columns; mime_type_web_server and mime_type_tika - Update tests - Resolves #342
- Move audio, pdf, and video DF extraction to tuple map - Provide two MimeType columns; mime_type_web_server and mime_type_tika - Update tests - Resolves #342
I wasn't sure whether to add this as a comment in #306 (or one of the other extraction method threads), or to make it an issue.
I observed that while we're using
DetectMimeTypeTika
in methods likeextractPDFDetailsDF
,extractVideoDetailsDF
, etc., we're usingArchiveRecord.getMimeType
for reporting the MIME type. I think Tika's MIME type will tend to be more accurate than what is set by the web server. Do we want to go for accurate meta-description of the contents of our archive records, or do we want fidelity to what was collected from the web?If we were wanted to use what we're getting from Tika, a method like
extractAudioDetailsDF()
could look something like this:I'm not sure about the memory implications of the
r => (r, DetectMimeTypeTika(r.getContentString))
mapping.The text was updated successfully, but these errors were encountered: