Optimize XLSX subtable detection memory usage#4357
Draft
CyMule wants to merge 4 commits into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Optimizes XLSX subtable detection by replacing dense worksheet-sized NetworkX graph construction with a sparse traversal over populated cells.
The public
partition_xlsx()behavior is intended to remain unchanged:Also removes
networkxfrom thexlsxruntime extra, since it is now only needed by the local benchmark's dense reference implementation.Benchmark
Fresh subprocess benchmark on generated
.xlsxfiles:dense_tablesparse_wide_edgesseparated_blocksTesting
uv run --extra xlsx --group test pytest -q test_unstructured/partition/test_xlsx.pyuv run ruff check unstructured/partition/xlsx.py test_unstructured/partition/test_xlsx.py scripts/performance/benchmark_xlsx_connected_components.pyuv run --extra xlsx --with networkx python scripts/performance/benchmark_xlsx_connected_components.py --repeat 3Summary by cubic
Optimized XLSX subtable detection by replacing a dense worksheet-sized graph with a sparse traversal of populated cells, reducing peak memory and speeding up processing. Public
partition_xlsx()output is unchanged.Refactors
networkxgrid with a sparse 4-neighbor traversal over non-empty cells; preserved row-overlap merge.Dependencies
networkxfrom thexlsxextra; it’s now only used by the local benchmark; updateduv.lock.0.22.32and added a CHANGELOG entry.Written for commit 40b5173. Summary will update on new commits.