Skip to content

Conversation

@tanhaow
Copy link
Contributor

@tanhaow tanhaow commented Oct 22, 2025

Associated Issue(s): #

Changes in this PR

Include all key changes in this pull request

  • Change 1
  • Change 2

Notes

Include any additional notes that will help in the reviewing of this pull request

  • Note 1
  • Note 2

Reviewer Checklist

Include discrete checks that should be done by the reviewer beyond looking through
code and/or file changes. Note that this check list will correspond to tasks within
the PR overview page.

  • Check 1
  • Check 2

tanhaow and others added 8 commits October 15, 2025 10:04
* add sample alto zipfile as fixture for testing

* Add alto file input

* Update __init__.py

* Update test_base_input.py

* Add unit tests

* add logging

* increase code coverage

* Update CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: Rebecca Sutton Koeser <rlskoeser@users.noreply.github.com>

---------

Co-authored-by: Rebecca Sutton Koeser <rlskoeser@users.noreply.github.com>
* Preliminary xmlobject code for alto xml #251

* Fix class reference for xmlobject (eulxml -> neuxml)

* add unit tests and enhance logic

1. perform a single zip pass.
2. turn each parsed ALTO document into newline‑joined text via _yield_text_for_document, so cached chunks hold real page content.
3. add unit tests,

* add code coverage

* Update test_alto_input.py

* Add HPOS sorting logic and test if sentence is sequential

* Simplify test for xml is alto; simplify alto test fixture handling

* Add tests for alto textline & textblock classes

* Move sorting and text chunk generator logic to xml classes

* Adapt validation tests to refactored code

* Add time logging for time to process zipfile

* Restore warning for alto file with no text lines

* Make chunk-specific filename take precedence for sentence output

* Use natural sort to order files within zipfile

* Add test init files so we can reuse simple segmenter mock

* Revise changelog description of alto input support

* Test alto sorted blocks property for doc with no textblocks

---------

Co-authored-by: hao <97079365+tanhaow@users.noreply.github.com>
* add line numbers to each sentence when build the corpus

* change to hook method

* refactor sentence line number

* use regex to clean text part

* Update tei_input.py

* Update CHANGELOG.md

* Update tei_input.py

* increase code coverage

* Update tei_input.py

* Update tei_input.py

* Update tei_input.py

* Update base_input.py

* Update tei_input.py

* Update src/remarx/sentence/corpus/tei_input.py

Co-authored-by: Rebecca Sutton Koeser <rlskoeser@users.noreply.github.com>

* revise per suggesitons

* Add comment noting why we need to call lstrip() first

* fix the multiple lb tags bug found by laure

* handle the case where text immediately after an <lb/> nested inside inline markup

* handle inline-markup cases

* Update test_tei_input.py

* Revert "Update test_tei_input.py"

This reverts commit c6de485.

* add more tests to pass coverage check

* Refactor preceding lb method; remove unnecessary footnote method

* Add unit test for find preceding lb method

* Add time logging for TEI input handling

---------

Co-authored-by: Rebecca Sutton Koeser <rlskoeser@users.noreply.github.com>
Co-authored-by: rlskoeser <rebecca.s.koeser@princeton.edu>
@codecov
Copy link

codecov bot commented Oct 22, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.71%. Comparing base (c16bffb) to head (e1fc298).
⚠️ Report is 19 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #265      +/-   ##
==========================================
+ Coverage   99.40%   99.71%   +0.31%     
==========================================
  Files          24       26       +2     
  Lines        1012     1427     +415     
  Branches       28       51      +23     
==========================================
+ Hits         1006     1423     +417     
+ Misses          2        1       -1     
+ Partials        4        3       -1     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tanhaow and others added 4 commits October 21, 2025 23:58
* Configure logging for corpus creation script; add verbose/debug mode

* Add page-level timing debug logging; use dictionary lookup for pages

* Test logging config & verbose option

* Add test for TEI pages by number dict property

* Document create script logging improvement in change log
@tanhaow tanhaow merged commit a9ecafc into main Oct 27, 2025
8 checks passed
@tanhaow tanhaow deleted the release/0.3 branch October 27, 2025 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants