Skip to content

Optimize XLSX subtable detection memory usage#4357

Draft
CyMule wants to merge 4 commits into
mainfrom
xlsx-sparse-connected-components
Draft

Optimize XLSX subtable detection memory usage#4357
CyMule wants to merge 4 commits into
mainfrom
xlsx-sparse-connected-components

Conversation

@CyMule

@CyMule CyMule commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Optimizes XLSX subtable detection by replacing dense worksheet-sized NetworkX graph construction with a sparse traversal over populated cells.

The public partition_xlsx() behavior is intended to remain unchanged:

  • populated cells are grouped by 4-neighbor connectivity
  • existing row-overlap merge behavior is preserved
  • emitted element text and table HTML metadata are covered by generated XLSX tests

Also removes networkx from the xlsx runtime extra, since it is now only needed by the local benchmark's dense reference implementation.

Benchmark

Fresh subprocess benchmark on generated .xlsx files:

Case Components Sparse Time Dense Time Sparse Peak Delta Dense Peak Delta
dense_table 1 / 1 0.0103s 0.0339s 5.8 MB 19.8 MB
sparse_wide_edges 26 / 26 0.0039s 6.5139s 2.9 MB 1578.1 MB
separated_blocks 23 / 23 0.0032s 0.7332s 0.5 MB 208.8 MB

Testing

  • uv run --extra xlsx --group test pytest -q test_unstructured/partition/test_xlsx.py
  • uv run ruff check unstructured/partition/xlsx.py test_unstructured/partition/test_xlsx.py scripts/performance/benchmark_xlsx_connected_components.py
  • uv run --extra xlsx --with networkx python scripts/performance/benchmark_xlsx_connected_components.py --repeat 3

Summary by cubic

Optimized XLSX subtable detection by replacing a dense worksheet-sized graph with a sparse traversal of populated cells, reducing peak memory and speeding up processing. Public partition_xlsx() output is unchanged.

  • Refactors

    • Replaced dense networkx grid with a sparse 4-neighbor traversal over non-empty cells; preserved row-overlap merge.
    • Added generated-XLSX tests to verify element text, HTML metadata, and page info.
    • Added a local benchmark utility.
  • Dependencies

    • Removed networkx from the xlsx extra; it’s now only used by the local benchmark; updated uv.lock.
    • Bumped version to 0.22.32 and added a CHANGELOG entry.

Written for commit 40b5173. Summary will update on new commits.

Review in cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant