-
Couldn't load subscription status.
- Fork 1
alto xml parsing #262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
alto xml parsing #262
Conversation
1. perform a single zip pass. 2. turn each parsed ALTO document into newline‑joined text via _yield_text_for_document, so cached chunks hold real page content. 3. add unit tests,
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #262 +/- ##
===========================================
+ Coverage 99.29% 99.52% +0.22%
===========================================
Files 26 26
Lines 1135 1260 +125
Branches 37 38 +1
===========================================
+ Hits 1127 1254 +127
+ Misses 4 2 -2
Partials 4 4 🚀 New features to boost your workflow:
|
|
The following changes have been added to the PR to address more issues in milestone 0.3:
|
|
@tanhaow the full volume is ~ 800 pages, I don't think we should be caching content - I think we should do everything in one pass and validate as we go. I know when we did the whiteboarding, we imagined the validation as a pre-step, but now that we have the code nearly working, I don't think that's needed. It looks like sorting the files is also a bad thing - alpha sort is not numeric sort in this case, but when I download a zipfile of all the contents in drive they are in logical order. What you've implemented looks good, but I'm going to shift some of it around - I think it makes more sense for the alto check and the line sorting to be part of the xmlobject class logic. That should make it simpler and easier to test directly. I'll also add some unit tests for the xml objects, since right now they are not tested directly. |
|
@tanhaow I've finished updating - shifted your logic and validation to the xml objects as possible, simplified the validation, removed the caching. I added |
and thanks for making xml filename appears in sentence output rather than zip file. Now we can see the page number indicated in the filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
| "langsamer und deshalb auch viel häßlicher und viel widerlicher. Und wie die" | ||
| ) | ||
|
|
||
| processing_prefix = "Processing XML file " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how you refactored this!
| ] | ||
| assert sorted(processed_files) == sorted(expected_files) | ||
|
|
||
| # last log entry should report time to process, # of files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Associated Issue(s): #251 #252 #253 #254 #255
Changes in this PR
natsortto sort files in natural orderReviewer Checklist @tanhaow