-
Couldn't load subscription status.
- Fork 1
Release/0.3 #265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Release/0.3 #265
+1,776
−47
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
remarx release 0.2
* add sample alto zipfile as fixture for testing * Add alto file input * Update __init__.py * Update test_base_input.py * Add unit tests * add logging * increase code coverage * Update CHANGELOG.md * Update CHANGELOG.md Co-authored-by: Rebecca Sutton Koeser <rlskoeser@users.noreply.github.com> --------- Co-authored-by: Rebecca Sutton Koeser <rlskoeser@users.noreply.github.com>
* Preliminary xmlobject code for alto xml #251 * Fix class reference for xmlobject (eulxml -> neuxml) * add unit tests and enhance logic 1. perform a single zip pass. 2. turn each parsed ALTO document into newline‑joined text via _yield_text_for_document, so cached chunks hold real page content. 3. add unit tests, * add code coverage * Update test_alto_input.py * Add HPOS sorting logic and test if sentence is sequential * Simplify test for xml is alto; simplify alto test fixture handling * Add tests for alto textline & textblock classes * Move sorting and text chunk generator logic to xml classes * Adapt validation tests to refactored code * Add time logging for time to process zipfile * Restore warning for alto file with no text lines * Make chunk-specific filename take precedence for sentence output * Use natural sort to order files within zipfile * Add test init files so we can reuse simple segmenter mock * Revise changelog description of alto input support * Test alto sorted blocks property for doc with no textblocks --------- Co-authored-by: hao <97079365+tanhaow@users.noreply.github.com>
* add line numbers to each sentence when build the corpus * change to hook method * refactor sentence line number * use regex to clean text part * Update tei_input.py * Update CHANGELOG.md * Update tei_input.py * increase code coverage * Update tei_input.py * Update tei_input.py * Update tei_input.py * Update base_input.py * Update tei_input.py * Update src/remarx/sentence/corpus/tei_input.py Co-authored-by: Rebecca Sutton Koeser <rlskoeser@users.noreply.github.com> * revise per suggesitons * Add comment noting why we need to call lstrip() first * fix the multiple lb tags bug found by laure * handle the case where text immediately after an <lb/> nested inside inline markup * handle inline-markup cases * Update test_tei_input.py * Revert "Update test_tei_input.py" This reverts commit c6de485. * add more tests to pass coverage check * Refactor preceding lb method; remove unnecessary footnote method * Add unit test for find preceding lb method * Add time logging for TEI input handling --------- Co-authored-by: Rebecca Sutton Koeser <rlskoeser@users.noreply.github.com> Co-authored-by: rlskoeser <rebecca.s.koeser@princeton.edu>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #265 +/- ##
==========================================
+ Coverage 99.40% 99.71% +0.31%
==========================================
Files 24 26 +2
Lines 1012 1427 +415
Branches 28 51 +23
==========================================
+ Hits 1006 1423 +417
+ Misses 2 1 -1
+ Partials 4 3 -1 🚀 New features to boost your workflow:
|
* Configure logging for corpus creation script; add verbose/debug mode * Add page-level timing debug logging; use dictionary lookup for pages * Test logging config & verbose option * Add test for TEI pages by number dict property * Document create script logging improvement in change log
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Associated Issue(s): #
Changes in this PR
Include all key changes in this pull request
Notes
Include any additional notes that will help in the reviewing of this pull request
Reviewer Checklist
Include discrete checks that should be done by the reviewer beyond looking through
code and/or file changes. Note that this check list will correspond to tasks within
the PR overview page.