Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance #26

Closed
sirupsen opened this issue Feb 18, 2020 · 7 comments · Fixed by #29
Closed

Performance #26

sirupsen opened this issue Feb 18, 2020 · 7 comments · Fixed by #29

Comments

@sirupsen
Copy link
Contributor

This is not overly a nuisance for me because I run the script async, but it does make it a little less pleasant to use. This 41 page PDF takes over one minute!

dotfiles $ time p2r https://www.nber.org/papers/w26752.pdf --filename test

real    1m4.323s
user    0m47.352s
sys     0m3.790s
@GjjvdBurg
Copy link
Owner

Hi @sirupsen, thanks for the bug reports! Yes, performance is a bit of an issue. The majority of the time is spent on cropping the pages of the pdf to remove excess white space. After some profiling it seems that the expensive part of that operation is creating an image of a pdf page using the pdfplumber package and I don't see a way to avoid that (the reason this is done in Python at all is that the bounding box returned by pdfcrop isn't very accurate (4cb3af8)).

I did just push a commit (a2e833a) that speeds up using the --center flag, but you don't use that in your example.

Alternatively, I could add a --no-crop flag, which would speed up the process but lead to less nice results. Would that help?

@sirupsen
Copy link
Contributor Author

Might it be possible to remove whitespace without converting to an image first? 👂

@GjjvdBurg
Copy link
Owner

I'm not sure, it might be necessary to render the pdf to figure out what the page looks like. That said, there is a pdfparser package that links directly to libpoppler, which seems a lot faster but is not a drop-in replacement for the current method with pdfplumber. I'll look into this a bit more and see what I can do.

@GjjvdBurg
Copy link
Owner

@sirupsen Thanks again for reporting this issue! I've just pushed some changes that give about an 8x speedup. I'll prepare a new release of the package soon.

@sirupsen
Copy link
Contributor Author

Excellent!!! I’ll upgrade as soon as it’s released.

I do think a no crop flag would be useful. Some may prefer to have tools (change pen, highlight, etc) visible at all times — or perhaps the crop could leave just enough space to always have the tools available on the left side?

@GjjvdBurg
Copy link
Owner

I do think a no crop flag would be useful. Some may prefer to have tools (change pen, highlight, etc) visible at all times — or perhaps the crop could leave just enough space to always have the tools available on the left side?

Try version 0.5.4, both a --no-crop and a --right option are now available! :)

@sirupsen
Copy link
Contributor Author

😍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants