Feature/add line numbers to corpus #249

tanhaow · 2025-10-15T13:37:42Z

Associated Issue(s): #134 #248

Changes in this PR

Added line_number field to each sentence in the corpus (available for TEI XML documents only)
Added get_extra_metadata() method in TEIinput class to pass line number info
Added get_extra_metadata() hook method in FileInput base class, allowing TEIinput class to override and provide line number info

Notes

Line number is only available for TEI XML documents; plain text documents do not include the line_number field
Line numbers is captured from the n attribute of the element in the TEI XML.

Reviewer Checklist

Verify that corpus generated from TEI documents includes line_number field
Review the code

codecov · 2025-10-15T13:39:10Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.57%. Comparing base (d945507) to head (d3d24b0).
⚠️ Report is 2 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #249      +/-   ##
===========================================
+ Coverage    99.52%   99.57%   +0.04%     
===========================================
  Files           26       26              
  Lines         1260     1404     +144     
  Branches        38       52      +14     
===========================================
+ Hits          1254     1398     +144     
  Misses           2        2              
  Partials         4        4

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rlskoeser

Sorry, @tanhaow , this must have been an older issue that we failed to update with new, clearer guidelines on the work to be done.

The line number we need to include is the n attribute of the <lb> element in the TEI XML. I'm not sure the best way to integrate that with plain text for segmentation into sentences, but your approach here looks interesting; I'm thinking we may be able to generalize.

Let's discuss how to revise during our co-working session.

src/remarx/sentence/corpus/base_input.py

src/remarx/sentence/corpus/tei_input.py

…umbers-to-corpus

rlskoeser

I have some preliminary comments on the revised xml handling, haven't made my way through all the new logic or tests yet and haven't tested locally, but submitting so you can see these.

src/remarx/sentence/corpus/tei_input.py

rlskoeser

Haven't tested the code locally yet but wanted to give some feedback on simplifying the code.

Let's leave line numbers blank (rather than default 1) if we ever can't determine, so that it will be easier to track down and correct possible errors in the logic / output.

src/remarx/sentence/corpus/tei_input.py

rlskoeser · 2025-10-20T18:45:11Z

src/remarx/sentence/corpus/tei_input.py

+        # When there are no line breaks with line numbers, default to line 1
+        if not offsets:
+            return 1


I'm not sure if this happens, but let's not default to 1; let's leave it unset to indicate that a line number wasn't found / couldn't be determined.

I looked up the alto file - it never happened. Originally I didn't include it but then our code coverage check in CI falls. I had to add these two lines just order to the code coverage pass (But they will never be reached in reality).

Co-authored-by: Rebecca Sutton Koeser <rlskoeser@users.noreply.github.com>

tanhaow · 2025-10-20T21:22:49Z

Thank you for reviewing this PR @rlskoeser !

I've made the following changes:

Simplified getting footnote functions by yielding both footnote text and line number together in the same chunk.
Removed ordered dict since dict order by insertion order is guaranteed starting in Python 3.7.
Added comment noting why we need to call lstrip() separately.
Changed the line number variable from private to public (dropped the _ prefix in name).
Renamed raw_fragment and cleaned_fragment to raw_text and cleaned_text.
Changed the logic to leave undecided line numbers blank (rather than default 1)
- I tested with Das Kapital, undecided line numbers seem to all come from pages which only have one line - their tag do not have line number attribute (n=).
Added find_associated_lb function that helps find the closest preceding element for a given text node to handle two tricky newline cases:
1. Back-to-back tags on the same line (the one found by Laure earlier on). Fix TEI-XML parsing so lines with multiple <lb> tags are handled correctly #248
2. Text immediately after an nested inside inline markup tags like <hi>. Below is an example:

    <pb n="19"/>
    <p n="cE">
        <lb n="1"/>
        <hi rendition="i">schiedensten Proportionen</hi> mit andern Artikeln aus. Dennoch bleibt sein
         <lb n="2"/>Tauschwerth <hi rendition="i">unverändert</hi>, ob in x Stiefelwichse, y Seide, z Gold u. s. w.
         <lb n="3"/>ausgedrückt. Er muß also von diesen seinen verschiedenen <hi rendition="i">Ausdrucks-
         <lb n="4"/>weisen</hi> unterscheidbar sein.
     </p>

…nline markup

…om/Princeton-CDH/remarx into feature/add-line-numbers-to-corpus

This reverts commit c6de485.

rlskoeser · 2025-10-21T20:58:52Z

src/remarx/sentence/corpus/tei_input.py

+            lb_element = self.find_associated_lb(text)
+            if lb_element is not None and lb_element is not last_handled_lb:
+                if cleaned_text:
+                    if body_text_parts and not body_text_parts[-1].endswith("\n"):
+                        cleaned_text = f"\n{cleaned_text}"
+                elif char_offset > 0:
+                    # Preserve explicit blank lines introduced by <lb/> elements.
+                    cleaned_text = "\n"
+
+                line_attr = lb_element.get("n")
+                if line_attr is not None:
+                    self.line_number_by_offset[char_offset] = int(line_attr)
+                last_handled_lb = lb_element


This seems complicated — I think I'm missing something about why this is needed. If we have mixed content immediately after the <lb> tag is there a case where there's no tail text to ensure we can add a newline? Or do we not hit the <lb> tag as a parent in the right sequence?

<lb n="33"/> <hi rendition="i">Schiedensten Proportionen</hi> ...

Yes we have cases where the first characters live inside inline markup right after an , so there's no tail text on the

Yes, I found the right example in your unit tests - it looks indented the way you've formatted it here, but I see in the xml fixture that's exactly the case, no tail text. Thanks for such good unit tests to cover these odd cases! Thinking about whether there's any easy way to simplify this...

I was able to find these cases because you suggested yielding undecided line numbers None instead of 1. Otherwise, I would have missed them! They were indeed easy to overlook. But yeah, I also think the current logic looks complicated. Hope there's a way to simplify this

tanhaow added 2 commits October 15, 2025 09:27

add line numbers to each sentence when build the corpus

9ca2a24

change to hook method

3a51f08

rlskoeser reviewed Oct 15, 2025

View reviewed changes

src/remarx/sentence/corpus/base_input.py Outdated Show resolved Hide resolved

rlskoeser reviewed Oct 15, 2025

View reviewed changes

src/remarx/sentence/corpus/tei_input.py Outdated Show resolved Hide resolved

tanhaow added 11 commits October 16, 2025 13:05

refactor sentence line number

e20829c

use regex to clean text part

5036aab

Update tei_input.py

013e65d

Merge remote-tracking branch 'origin/develop' into feature/add-line-n…

c1652b3

…umbers-to-corpus

Update CHANGELOG.md

92494fc

Update tei_input.py

a561915

increase code coverage

60d6c5d

Update tei_input.py

8264a84

Update tei_input.py

61c4703

Update tei_input.py

8695dcf

Update base_input.py

183a8e3

tanhaow requested a review from rlskoeser October 17, 2025 13:52

rlskoeser reviewed Oct 17, 2025

View reviewed changes

src/remarx/sentence/corpus/tei_input.py Outdated Show resolved Hide resolved

src/remarx/sentence/corpus/tei_input.py Outdated Show resolved Hide resolved

src/remarx/sentence/corpus/tei_input.py Show resolved Hide resolved

Update tei_input.py

fada6de

tanhaow requested a review from rlskoeser October 20, 2025 13:24

rlskoeser requested changes Oct 20, 2025

View reviewed changes

tanhaow and others added 2 commits October 20, 2025 14:56

Update src/remarx/sentence/corpus/tei_input.py

aa9f3e3

Co-authored-by: Rebecca Sutton Koeser <rlskoeser@users.noreply.github.com>

revise per suggesitons

dd4b0b7

tanhaow and others added 5 commits October 21, 2025 09:35

Add comment noting why we need to call lstrip() first

3e87114

fix the multiple lb tags bug found by laure

72885af

handle the case where text immediately after an <lb/> nested inside i…

0d5df55

…nline markup

handle inline-markup cases

83f84ed

Merge branch 'develop' into feature/add-line-numbers-to-corpus

5e2d218

tanhaow added 4 commits October 21, 2025 11:24

Update test_tei_input.py

c6de485

Merge branch 'feature/add-line-numbers-to-corpus' of https://github.c…

752fe06

…om/Princeton-CDH/remarx into feature/add-line-numbers-to-corpus

Revert "Update test_tei_input.py"

6f43d4e

This reverts commit c6de485.

add more tests to pass coverage check

7357091

tanhaow requested a review from rlskoeser October 21, 2025 15:44

rlskoeser reviewed Oct 21, 2025

View reviewed changes

rlskoeser added 4 commits October 21, 2025 18:02

Refactor preceding lb method; remove unnecessary footnote method

a21842f

Add unit test for find preceding lb method

1cab2c1

Merge branch 'develop' into feature/add-line-numbers-to-corpus

8b1c659

Add time logging for TEI input handling

d3d24b0

rlskoeser approved these changes Oct 21, 2025

View reviewed changes

tanhaow merged commit 7fb4da2 into develop Oct 22, 2025
8 checks passed

tanhaow deleted the feature/add-line-numbers-to-corpus branch October 22, 2025 01:57

Uh oh!

Feature/add line numbers to corpus #249

Feature/add line numbers to corpus #249

Uh oh!

Conversation

tanhaow commented Oct 15, 2025 • edited by rlskoeser Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes in this PR

Notes

Reviewer Checklist

Uh oh!

codecov bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rlskoeser left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rlskoeser left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rlskoeser left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rlskoeser Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

tanhaow Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

tanhaow commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rlskoeser Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

tanhaow Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

rlskoeser Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

tanhaow Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tanhaow commented Oct 15, 2025 •

edited by rlskoeser

Loading

codecov bot commented Oct 15, 2025 •

edited

Loading

tanhaow commented Oct 20, 2025 •

edited

Loading