Skip to content

Conversation

@tanhaow
Copy link
Contributor

@tanhaow tanhaow commented Oct 15, 2025

Associated Issue(s): #134 #248

Changes in this PR

  • Added line_number field to each sentence in the corpus (available for TEI XML documents only)
  • Added get_extra_metadata() method in TEIinput class to pass line number info
  • Added get_extra_metadata() hook method in FileInput base class, allowing TEIinput class to override and provide line number info

Notes

  • Line number is only available for TEI XML documents; plain text documents do not include the line_number field
  • Line numbers is captured from the n attribute of the element in the TEI XML.

Reviewer Checklist

  • Verify that corpus generated from TEI documents includes line_number field
  • Review the code

@codecov
Copy link

codecov bot commented Oct 15, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.57%. Comparing base (d945507) to head (d3d24b0).
⚠️ Report is 2 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #249      +/-   ##
===========================================
+ Coverage    99.52%   99.57%   +0.04%     
===========================================
  Files           26       26              
  Lines         1260     1404     +144     
  Branches        38       52      +14     
===========================================
+ Hits          1254     1398     +144     
  Misses           2        2              
  Partials         4        4              
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, @tanhaow , this must have been an older issue that we failed to update with new, clearer guidelines on the work to be done.

The line number we need to include is the n attribute of the <lb> element in the TEI XML. I'm not sure the best way to integrate that with plain text for segmentation into sentences, but your approach here looks interesting; I'm thinking we may be able to generalize.

Let's discuss how to revise during our co-working session.

@tanhaow tanhaow requested a review from rlskoeser October 17, 2025 13:52
Copy link
Contributor

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some preliminary comments on the revised xml handling, haven't made my way through all the new logic or tests yet and haven't tested locally, but submitting so you can see these.

@tanhaow tanhaow requested a review from rlskoeser October 20, 2025 13:24
Copy link
Contributor

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tested the code locally yet but wanted to give some feedback on simplifying the code.

Let's leave line numbers blank (rather than default 1) if we ever can't determine, so that it will be easier to track down and correct possible errors in the logic / output.

Comment on lines 121 to 123
# When there are no line breaks with line numbers, default to line 1
if not offsets:
return 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this happens, but let's not default to 1; let's leave it unset to indicate that a line number wasn't found / couldn't be determined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked up the alto file - it never happened. Originally I didn't include it but then our code coverage check in CI falls. I had to add these two lines just order to the code coverage pass (But they will never be reached in reality).

tanhaow and others added 2 commits October 20, 2025 14:56
Co-authored-by: Rebecca Sutton Koeser <rlskoeser@users.noreply.github.com>
@tanhaow
Copy link
Contributor Author

tanhaow commented Oct 20, 2025

Thank you for reviewing this PR @rlskoeser !

I've made the following changes:

  1. Simplified getting footnote functions by yielding both footnote text and line number together in the same chunk.
  2. Removed ordered dict since dict order by insertion order is guaranteed starting in Python 3.7.
  3. Added comment noting why we need to call lstrip() separately.
  4. Changed the line number variable from private to public (dropped the _ prefix in name).
  5. Renamed raw_fragment and cleaned_fragment to raw_text and cleaned_text.
  6. Changed the logic to leave undecided line numbers blank (rather than default 1)
    • I tested with Das Kapital, undecided line numbers seem to all come from pages which only have one line - their tag do not have line number attribute (n=).
  7. Added find_associated_lb function that helps find the closest preceding element for a given text node to handle two tricky newline cases:
    1. Back-to-back tags on the same line (the one found by Laure earlier on). Fix TEI-XML parsing so lines with multiple <lb> tags are handled correctly #248
    2. Text immediately after an nested inside inline markup tags like <hi>. Below is an example:
    <pb n="19"/>
    <p n="cE">
        <lb n="1"/>
        <hi rendition="i">schiedensten Proportionen</hi> mit andern Artikeln aus. Dennoch bleibt sein
         <lb n="2"/>Tauschwerth <hi rendition="i">unverändert</hi>, ob in x Stiefelwichse, y Seide, z Gold u. s. w.
         <lb n="3"/>ausgedrückt. Er muß also von diesen seinen verschiedenen <hi rendition="i">Ausdrucks-
         <lb n="4"/>weisen</hi> unterscheidbar sein.
     </p>

@tanhaow tanhaow requested a review from rlskoeser October 21, 2025 15:44
Comment on lines 190 to 202
lb_element = self.find_associated_lb(text)
if lb_element is not None and lb_element is not last_handled_lb:
if cleaned_text:
if body_text_parts and not body_text_parts[-1].endswith("\n"):
cleaned_text = f"\n{cleaned_text}"
elif char_offset > 0:
# Preserve explicit blank lines introduced by <lb/> elements.
cleaned_text = "\n"

line_attr = lb_element.get("n")
if line_attr is not None:
self.line_number_by_offset[char_offset] = int(line_attr)
last_handled_lb = lb_element
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems complicated — I think I'm missing something about why this is needed. If we have mixed content immediately after the <lb> tag is there a case where there's no tail text to ensure we can add a newline? Or do we not hit the <lb> tag as a parent in the right sequence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<lb n="33"/>
<hi rendition="i">Schiedensten Proportionen</hi> ...

Yes we have cases where the first characters live inside inline markup right after an , so there's no tail text on the

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I found the right example in your unit tests - it looks indented the way you've formatted it here, but I see in the xml fixture that's exactly the case, no tail text. Thanks for such good unit tests to cover these odd cases! Thinking about whether there's any easy way to simplify this...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to find these cases because you suggested yielding undecided line numbers None instead of 1. Otherwise, I would have missed them! They were indeed easy to overlook. But yeah, I also think the current logic looks complicated. Hope there's a way to simplify this

@tanhaow tanhaow merged commit 7fb4da2 into develop Oct 22, 2025
8 checks passed
@tanhaow tanhaow deleted the feature/add-line-numbers-to-corpus branch October 22, 2025 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants