-
Couldn't load subscription status.
- Fork 1
Feature/add line numbers to corpus #249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #249 +/- ##
===========================================
+ Coverage 99.52% 99.57% +0.04%
===========================================
Files 26 26
Lines 1260 1404 +144
Branches 38 52 +14
===========================================
+ Hits 1254 1398 +144
Misses 2 2
Partials 4 4 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, @tanhaow , this must have been an older issue that we failed to update with new, clearer guidelines on the work to be done.
The line number we need to include is the n attribute of the <lb> element in the TEI XML. I'm not sure the best way to integrate that with plain text for segmentation into sentences, but your approach here looks interesting; I'm thinking we may be able to generalize.
Let's discuss how to revise during our co-working session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some preliminary comments on the revised xml handling, haven't made my way through all the new logic or tests yet and haven't tested locally, but submitting so you can see these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't tested the code locally yet but wanted to give some feedback on simplifying the code.
Let's leave line numbers blank (rather than default 1) if we ever can't determine, so that it will be easier to track down and correct possible errors in the logic / output.
| # When there are no line breaks with line numbers, default to line 1 | ||
| if not offsets: | ||
| return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this happens, but let's not default to 1; let's leave it unset to indicate that a line number wasn't found / couldn't be determined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked up the alto file - it never happened. Originally I didn't include it but then our code coverage check in CI falls. I had to add these two lines just order to the code coverage pass (But they will never be reached in reality).
Co-authored-by: Rebecca Sutton Koeser <rlskoeser@users.noreply.github.com>
|
Thank you for reviewing this PR @rlskoeser ! I've made the following changes:
|
…om/Princeton-CDH/remarx into feature/add-line-numbers-to-corpus
This reverts commit c6de485.
| lb_element = self.find_associated_lb(text) | ||
| if lb_element is not None and lb_element is not last_handled_lb: | ||
| if cleaned_text: | ||
| if body_text_parts and not body_text_parts[-1].endswith("\n"): | ||
| cleaned_text = f"\n{cleaned_text}" | ||
| elif char_offset > 0: | ||
| # Preserve explicit blank lines introduced by <lb/> elements. | ||
| cleaned_text = "\n" | ||
|
|
||
| line_attr = lb_element.get("n") | ||
| if line_attr is not None: | ||
| self.line_number_by_offset[char_offset] = int(line_attr) | ||
| last_handled_lb = lb_element |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems complicated — I think I'm missing something about why this is needed. If we have mixed content immediately after the <lb> tag is there a case where there's no tail text to ensure we can add a newline? Or do we not hit the <lb> tag as a parent in the right sequence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<lb n="33"/>
<hi rendition="i">Schiedensten Proportionen</hi> ...
Yes we have cases where the first characters live inside inline markup right after an , so there's no tail text on the
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I found the right example in your unit tests - it looks indented the way you've formatted it here, but I see in the xml fixture that's exactly the case, no tail text. Thanks for such good unit tests to cover these odd cases! Thinking about whether there's any easy way to simplify this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to find these cases because you suggested yielding undecided line numbers None instead of 1. Otherwise, I would have missed them! They were indeed easy to overlook. But yeah, I also think the current logic looks complicated. Hope there's a way to simplify this
Associated Issue(s): #134 #248
Changes in this PR
line_numberfield to each sentence in the corpus (available for TEI XML documents only)get_extra_metadata()method inTEIinputclass to pass line number infoget_extra_metadata()hook method inFileInputbase class, allowingTEIinputclass to override and provide line number infoNotes
line_numberfieldReviewer Checklist
line_numberfield