Skip to content

Conversation

@rlskoeser
Copy link
Contributor

@rlskoeser rlskoeser commented Oct 20, 2025

Associated Issue(s): #251 #252 #253 #254 #255

Changes in this PR

Reviewer Checklist @tanhaow

  • Review the xmlobject code to make sure it make sense to you
  • Test manually with the alto zip file to confirm you see plain text output
  • Review changes and adjustments to validation to confirm I haven't missed anything
  • Test manually with alto zip file and confirm that:
    • xml file name appears in sentence output rather than zip file
    • xml files are sorted naturally / logically
    • logging includes timing for processing the zipfile

@rlskoeser rlskoeser requested a review from tanhaow October 20, 2025 14:06
1. perform a single zip pass.
2. turn each parsed ALTO document into newline‑joined text via _yield_text_for_document, so cached chunks hold real page content.
3. add unit tests,
@codecov
Copy link

codecov bot commented Oct 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.52%. Comparing base (3d2e79e) to head (b1f1f82).
⚠️ Report is 1 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #262      +/-   ##
===========================================
+ Coverage    99.29%   99.52%   +0.22%     
===========================================
  Files           26       26              
  Lines         1135     1260     +125     
  Branches        37       38       +1     
===========================================
+ Hits          1127     1254     +127     
+ Misses           4        2       -2     
  Partials         4        4              
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tanhaow
Copy link
Contributor

tanhaow commented Oct 20, 2025

The following changes have been added to the PR to address more issues in milestone 0.3:

@rlskoeser
Copy link
Contributor Author

@tanhaow the full volume is ~ 800 pages, I don't think we should be caching content - I think we should do everything in one pass and validate as we go. I know when we did the whiteboarding, we imagined the validation as a pre-step, but now that we have the code nearly working, I don't think that's needed.

It looks like sorting the files is also a bad thing - alpha sort is not numeric sort in this case, but when I download a zipfile of all the contents in drive they are in logical order.

What you've implemented looks good, but I'm going to shift some of it around - I think it makes more sense for the alto check and the line sorting to be part of the xmlobject class logic. That should make it simpler and easier to test directly. I'll also add some unit tests for the xml objects, since right now they are not tested directly.

@rlskoeser rlskoeser marked this pull request as ready for review October 21, 2025 20:21
@rlskoeser
Copy link
Contributor Author

@tanhaow I've finished updating - shifted your logic and validation to the xml objects as possible, simplified the validation, removed the caching. I added natsort to sort the filenames in order — zipfile order doesn't correspond to logical order! I also added timing logging because I wanted to check some things, but it will help us for #255 .

@tanhaow
Copy link
Contributor

tanhaow commented Oct 21, 2025

  • xml file name appears in sentence output rather than zip file

and thanks for making xml filename appears in sentence output rather than zip file. Now we can see the page number indicated in the filename

Copy link
Contributor

@tanhaow tanhaow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

"langsamer und deshalb auch viel häßlicher und viel widerlicher. Und wie die"
)

processing_prefix = "Processing XML file "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how you refactored this!

]
assert sorted(processed_files) == sorted(expected_files)

# last log entry should report time to process, # of files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@rlskoeser rlskoeser merged commit d945507 into develop Oct 21, 2025
8 checks passed
@rlskoeser rlskoeser deleted the feature/alto-xml-parsing branch October 21, 2025 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants