Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ghostscript backup when validating a PDF #16

Closed
wants to merge 2 commits into from
Closed

Add ghostscript backup when validating a PDF #16

wants to merge 2 commits into from

Conversation

alexcu
Copy link
Contributor

@alexcu alexcu commented Apr 30, 2016

Problem

Whenever PDFs are validated using pdftk (under OS X El Capitan) there seems to be some stalling issues where pdftk fails (refer to this SO post).

E.g., under OS X 10.11 try:

$ pdftk some_pdf.pdf  output /dev/null dont_ask

It stalls for every El Capitan install using pdftk 2.02. Instead, the following SO post shows how we can do this speedily using ghostscript:

$ gs -o /dev/null -sDEVICE=nullpage -r36x36 -dNOPAUSE -q some_pdf.pdf

TL;DR

This PR adds ghostscript validation when pdftk fails. As an alternative to pdftk for validation, we could use ghostscript instead for development purposes, but depends on what you think RE this @macite...

@macite
Copy link
Member

macite commented Apr 30, 2016

What about switching to gs and removing use of pdftk. It isn't used for anything else now.

Also need to wrap all system calls in timeouts so that if there is a bug in the other software and it doesn't return in a short time we keep working. Need to check if the terminator code kills gs or if you also need to use the terminate script.

@alexcu
Copy link
Contributor Author

alexcu commented Apr 30, 2016

@macite Thought about removing pdftk from the validation altogether but thought I'd check with you first.

We still use pdtk to aggregate PDFs together. I can try and swapping this with ghostscript also.

I could add a system call helper that will wrap all system calls with the terminate script? What do you think?

@macite
Copy link
Member

macite commented Apr 30, 2016

Not any more. Task and portfolio pdfs are created using Latex. The old code is still there, but needs to be removed.

@alexcu
Copy link
Contributor Author

alexcu commented Apr 30, 2016

Ah right. Well all updated there... not sure how much is actually used now after the LaTeX updates but we can always remove it.

@macite
Copy link
Member

macite commented May 4, 2016

Just did a quick time check and gs is about 3x slower than PDFtk at this task 😞

We should think about this further... we also had issues with gs not terminating with some PDFs in the compression. This also wasn't killed by terminator, so we added the timeout script. So if we do use this then need to include the timeout in all locations.

Also thinking about including: https://github.com/yob/pdf-reader/tree/master/lib/pdf/reader

We could test that as an alternative to system calls for checking pdf integrity. We also need this to check page count etc.

@alexcu
Copy link
Contributor Author

alexcu commented May 4, 2016

I added an extension to the try_within extension that adds the timeout script to all executions 😊

I think there are some other options that we can add that reduces the time. Quality reduction etc. I'll look into it as soon as I can.

I'll look into PDF-reader too 😊

On 4 May 2016, at 7:13 PM, Andrew Cain notifications@github.com wrote:

Just did a quick time check and gs is about 3x slower than PDFtk at this task 😞

We should think about this further... we also had issues with gs not terminating with some PDFs in the compression. This also wasn't killed by terminator, so we added the timeout script. So if we do use this then need to include the timeout in all locations.

Also thinking about including: https://github.com/yob/pdf-reader/tree/master/lib/pdf/reader

We could test that as an alternative to system calls for checking pdf integrity. We also need this to check page count etc.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub

@macite
Copy link
Member

macite commented May 4, 2016

system_try_within looks good 😄 didn't notice it before.

I tested using unix time command.

@alexcu
Copy link
Contributor Author

alexcu commented May 4, 2016

Thanks I'll try with time also and see how they compare.

Do you remember what the benchmark was for pdftk?

On 4 May 2016, at 7:39 PM, Andrew Cain notifications@github.com wrote:

system_try_within looks good 😄 didn't notice it before.

I tested using unix time command.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub

@macite
Copy link
Member

macite commented May 4, 2016

It was about 100ms-150ms... gs was 300-400ms but that depends on the machine...

@alexcu
Copy link
Contributor Author

alexcu commented Jul 5, 2016

@macite Looking into to resolving this PR. Noticed that FileHelper.aggregate is still being used to aggregate portfolio files, but you said that it's no longer used? (see app/models/project.rb:817 and :850). Can you confirm if we're still using PDFtk to aggregate? I thought we'd switched to LaTeX like you'd said... but reading the code I'm a little confused now...

I'm going to replace the gs commands with pdf-reader — looks better. And I'll make sure it fits within the right benchmark.

@alexcu
Copy link
Contributor Author

alexcu commented Jul 5, 2016

@macite Sorry! Ignore the above... realised I hadn't been up-to-date with develop.

@alexcu
Copy link
Contributor Author

alexcu commented Jul 5, 2016

Getting some issues with thread_safe dependencies once installing the pdf-gem. Updating gems seems to work. Seeing that https://github.com/doubtfire-lms/doubtfire-api/tree/fix/update-gems may resolve the issue, I will continue investigating this once that branch is merged.

@macite
Copy link
Member

macite commented Jul 5, 2016

The latest changes do this using a simple file read and scan for %%END in the last 1024 characters. See http://stackoverflow.com/questions/28156467/fastest-way-to-check-that-a-pdf-is-corrupted-or-just-missing-eof-in-ruby

Unless we need the PDF reader for other purposes, I think the file scanning option should be sufficient.

@alexcu
Copy link
Contributor Author

alexcu commented Jul 5, 2016

@macite Yeah I bumped into this after updating... cool I agree. If that suffices, then we don't need it :)

@macite
Copy link
Member

macite commented Jul 5, 2016

@alexcu Ok, lets close this pull request.

@macite macite closed this Jul 5, 2016
@alexcu
Copy link
Contributor Author

alexcu commented Jul 5, 2016

@macite I think something went wrong with GitHub that isn't showing other the additions made in this PR that haven't been merged in (e.g., removing PDFtk from the documentation). See develop...final-year-project:enhance/validate-pdfs-using-ghostscript. I think these should be merged in still? Shall I create another PR with only these changes/removal of references to PDFtk in documentation etc.?

@alexcu
Copy link
Contributor Author

alexcu commented Jul 5, 2016

We killed Ghostscript, right? I'll remove it from the documentation if so...

@macite
Copy link
Member

macite commented Jul 5, 2016

The others look like changes I made and are in develop already.

Ghostscript is used to compress the PDFs... or did we remove that as well?

@alexcu
Copy link
Contributor Author

alexcu commented Jul 5, 2016

Ah my bad... yes Ghostscript is still used to compress PDFs. Thought we had switched to something else.

Not quite... the documentation is still out of date on develop (still makes reference to PDFtk not Ghostscript). develop...final-year-project:enhance/validate-pdfs-using-ghostscript shows that the branch is 6 commits ahead. I remember adding the system_try_within which helped clean up a bit of the processing, which is also missing from develop.

Let me double check. I'll make sure is_valid_pdf? works using the new method you wrote.

@macite
Copy link
Member

macite commented Jul 5, 2016

Ah yes, I think the valid_pdf? is only in new branch with updated ruby version and gems.

@alexcu
Copy link
Contributor Author

alexcu commented Jul 5, 2016

Ahh... makes sense. Was trying to find it in develop but could not find it anywhere! I'll copy just that portion of the code to this branch as it is a related change, and then create a new PR with these related changes. I'll double check to make sure that nothing has changed—if anything it's just a QUALITY for some minor refactoring to help clean up the file_helper a bit, as well as the updated README 😄

@alexcu
Copy link
Contributor Author

alexcu commented Jul 5, 2016

@macite Created #21 to resolve the discussion around this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants