alto xml parsing #262

rlskoeser · 2025-10-20T14:06:32Z

Associated Issue(s): #251 #252 #253 #254 #255

Changes in this PR

alto xmlobject classes for document, block, line
integration with alto xmlobject and alto input class
validation / warning / error reporting for non-xml, empty xml, invalid, etc
use natsort to sort files in natural order
add timing logs to report duration of processing (towards Test how long it takes to process a zipfile of ALTO XML for a full volume #255 )

Reviewer Checklist @tanhaow

Review the xmlobject code to make sure it make sense to you
Test manually with the alto zip file to confirm you see plain text output
Review changes and adjustments to validation to confirm I haven't missed anything
Test manually with alto zip file and confirm that:
- xml file name appears in sentence output rather than zip file
- xml files are sorted naturally / logically
- logging includes timing for processing the zipfile

1. perform a single zip pass. 2. turn each parsed ALTO document into newline‑joined text via _yield_text_for_document, so cached chunks hold real page content. 3. add unit tests,

codecov · 2025-10-20T17:41:26Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.52%. Comparing base (3d2e79e) to head (b1f1f82).
⚠️ Report is 1 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #262      +/-   ##
===========================================
+ Coverage    99.29%   99.52%   +0.22%     
===========================================
  Files           26       26              
  Lines         1135     1260     +125     
  Branches        37       38       +1     
===========================================
+ Hits          1127     1254     +127     
+ Misses           4        2       -2     
  Partials         4        4

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tanhaow · 2025-10-20T21:46:02Z

The following changes have been added to the PR to address more issues in milestone 0.3:

Introduced _chunk_cache and updated get_text() to yield from the cached chunks, so validation and parsing can happen in a single pass through the ALTO zip file. (add validation method to check ALTO XML zipfile #253 )
Enhanced zip file validation: add validation method to check ALTO XML zipfile #253
- skip-but-log: non‑XML, non‑ALTO, or malformed files
- warning: an Alto page has no text content
Sort TextBlock and TextLine by horizontal position.
Extended alto_sample.zip fixture with empty_page.xml and unsorted_page.xml to test the above features.
Add a unit test to confirm that sentence index is sequential across all files within the zipfile #254

rlskoeser · 2025-10-21T14:41:02Z

@tanhaow the full volume is ~ 800 pages, I don't think we should be caching content - I think we should do everything in one pass and validate as we go. I know when we did the whiteboarding, we imagined the validation as a pre-step, but now that we have the code nearly working, I don't think that's needed.

It looks like sorting the files is also a bad thing - alpha sort is not numeric sort in this case, but when I download a zipfile of all the contents in drive they are in logical order.

What you've implemented looks good, but I'm going to shift some of it around - I think it makes more sense for the alto check and the line sorting to be part of the xmlobject class logic. That should make it simpler and easier to test directly. I'll also add some unit tests for the xml objects, since right now they are not tested directly.

rlskoeser · 2025-10-21T20:23:25Z

@tanhaow I've finished updating - shifted your logic and validation to the xml objects as possible, simplified the validation, removed the caching. I added natsort to sort the filenames in order — zipfile order doesn't correspond to logical order! I also added timing logging because I wanted to check some things, but it will help us for #255 .

tanhaow · 2025-10-21T20:25:03Z

xml file name appears in sentence output rather than zip file

and thanks for making xml filename appears in sentence output rather than zip file. Now we can see the page number indicated in the filename

tanhaow

🚀

tanhaow · 2025-10-21T20:40:06Z

tests/test_sentence/test_corpus/test_alto_input.py

+        "langsamer und deshalb auch viel häßlicher und viel widerlicher. Und wie die"
+    )
+
+    processing_prefix = "Processing XML file "


I like how you refactored this!

tanhaow · 2025-10-21T20:41:48Z

tests/test_sentence/test_corpus/test_alto_input.py

    ]
    assert sorted(processed_files) == sorted(expected_files)

+    # last log entry should report time to process, # of files


rlskoeser added 2 commits October 20, 2025 10:01

Preliminary xmlobject code for alto xml #251

221af95

Fix class reference for xmlobject (eulxml -> neuxml)

a18fdf7

rlskoeser requested a review from tanhaow October 20, 2025 14:06

add unit tests and enhance logic

209b3c4

1. perform a single zip pass. 2. turn each parsed ALTO document into newline‑joined text via _yield_text_for_document, so cached chunks hold real page content. 3. add unit tests,

tanhaow added 2 commits October 20, 2025 14:29

add code coverage

89f4549

Update test_alto_input.py

2dbc235

Add HPOS sorting logic and test if sentence is sequential

6813ad8

rlskoeser added 11 commits October 21, 2025 11:08

Simplify test for xml is alto; simplify alto test fixture handling

377cb7c

Add tests for alto textline & textblock classes

1dfd64b

Move sorting and text chunk generator logic to xml classes

01b8ed9

Adapt validation tests to refactored code

5302251

Add time logging for time to process zipfile

f055c62

Restore warning for alto file with no text lines

de4b2f8

Make chunk-specific filename take precedence for sentence output

9b63f2b

Use natural sort to order files within zipfile

20e23c9

Add test init files so we can reuse simple segmenter mock

9399e19

Revise changelog description of alto input support

c181a90

Test alto sorted blocks property for doc with no textblocks

b1f1f82

rlskoeser marked this pull request as ready for review October 21, 2025 20:21

tanhaow approved these changes Oct 21, 2025

View reviewed changes

rlskoeser merged commit d945507 into develop Oct 21, 2025
8 checks passed

rlskoeser deleted the feature/alto-xml-parsing branch October 21, 2025 21:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

alto xml parsing #262

alto xml parsing #262

Uh oh!

rlskoeser commented Oct 20, 2025 •

edited by tanhaow

Loading

Uh oh!

codecov bot commented Oct 20, 2025 •

edited

Loading

Uh oh!

tanhaow commented Oct 20, 2025 •

edited

Loading

Uh oh!

rlskoeser commented Oct 21, 2025

Uh oh!

rlskoeser commented Oct 21, 2025

Uh oh!

tanhaow commented Oct 21, 2025

Uh oh!

tanhaow left a comment

Uh oh!

tanhaow Oct 21, 2025

Uh oh!

tanhaow Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

alto xml parsing #262

alto xml parsing #262

Uh oh!

Conversation

rlskoeser commented Oct 20, 2025 • edited by tanhaow Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes in this PR

Reviewer Checklist @tanhaow

Uh oh!

codecov bot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tanhaow commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rlskoeser commented Oct 21, 2025

Uh oh!

rlskoeser commented Oct 21, 2025

Uh oh!

tanhaow commented Oct 21, 2025

Uh oh!

tanhaow left a comment

Choose a reason for hiding this comment

Uh oh!

tanhaow Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

tanhaow Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rlskoeser commented Oct 20, 2025 •

edited by tanhaow

Loading

codecov bot commented Oct 20, 2025 •

edited

Loading

tanhaow commented Oct 20, 2025 •

edited

Loading