Skip to content

perf: optimze figure parser #7392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

liuzhenghua
Copy link
Contributor

@liuzhenghua liuzhenghua commented Apr 28, 2025

What problem does this PR solve?

When parsing documents containing images, the current code uses a single-threaded approach to call the VL model, resulting in extremely slow parsing speed (e.g., parsing a Word document with dozens of images takes over 20 minutes).

By switching to a multithreaded approach to call the VL model, the parsing speed can be improved to an acceptable level.

Type of change

  • Performance Improvement

@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Apr 28, 2025
@yingfeng yingfeng added the ci Continue Integration label Apr 28, 2025
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Apr 28, 2025
@KevinHuSh KevinHuSh requested a review from asiroliu April 29, 2025 02:28
@asiroliu
Copy link
Contributor

@liuzhenghua
Could you provide a test Word document?

@liuzhenghua
Copy link
Contributor Author

@liuzhenghua Could you provide a test Word document?

@asiroliu It’s a simple Word document serving as an operation manual, containing around 90 screenshots. Sorry, I can’t provide the file as it contains sensitive company data.

@asiroliu
Copy link
Contributor

asiroliu commented Apr 30, 2025

@liuzhenghua
Thank you for your response. Based on your suggestion, i can use the following search terms on Google to find the relevant documentation:

filetype:doc OR filetype:docx "User Manual" OR "Instruction Manual" OR "Operation Guide"

@liuzhenghua
Copy link
Contributor Author

@liuzhenghua Thank you for your response. Based on your suggestion, i can use the following search terms on Google to find the relevant documentation:

filetype:doc OR filetype:docx "User Manual" OR "Instruction Manual" OR "Operation Guide"

@asiroliu Sorry for my previous response — the document I mentioned belongs to the company and can't be shared. You'll need to create a Microsoft Word document yourself, including some text and around 90+ images.

@asiroliu
Copy link
Contributor

@liuzhenghua
I've compared the multi-image document parsing performance between the nightly build and your latest commit. There doesn't appear to be any noticeable efficiency improvement.

  • Test word docment: https://irtfweb.ifa.hawaii.edu/~tcs3/oldstuff/osprey/userman.doc

  • nightly(78b00d61fd59)(2025_04_29): 207 secs

nightly

  • you lastest commit(3281d47): 215 secs
image

@liuzhenghua
Copy link
Contributor Author

  • https://irtfweb.ifa.hawaii.edu/~tcs3/oldstuff/osprey/userman.doc

@asiroliu
The test document you used didn't contain any images. The one I tested had around 90+ images. Before the optimization, it took 20 minutes to parse the images and another 20 minutes to upload them to MinIO. After the changes, both steps now only take 2 minutes each.
image

@liuzhenghua
Copy link
Contributor Author

@asiroliu My local version is 0.17.2. When the log message "Visual model detected. Attempting to enhance figure extraction" appears, I debugged and found that it processes the 90 images in the document by calling the VL model one by one in a single queue, which leads to a long processing time.

You observed a similar processing time in your test, but that might be due to one or more of the following reasons:

  • Your document contains very few images.
  • You haven't configured a VL model.
  • Version 0.18.0 doesn't have this issue.

@asiroliu
Copy link
Contributor

Got it, I'll verify this later per your suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continue Integration size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants