Merge data across parallel corpora by Enkidu93 · Pull Request #882 · sillsdev/serval

Enkidu93 · 2026-02-24T23:14:19Z

Changes include:

Update parallel corpus preprocessing service to merge training data across parallel corpora. Pretranslation is still per parallel corpus since we ultimately access pretranslations per parallel corpus. Fixes Merge overlapping data across parallel corpora #860.
Extend E2E tests to have a more SF-like set up and exercise USFM options. I switched out NKJV for BSB (also free to use) because the NKJV does not have quotation marks. Fixes Expand Nmt_Paratext E2E test to cover more SF use cases #866.

This change is

…cross parallel corpora; extend E2E tests to have a more SF-like set up and exercise USFM options

codecov-commenter · 2026-02-24T23:23:15Z

Codecov Report

❌ Patch coverage is 98.82353% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 66.95%. Comparing base (04d4719) to head (15cd570).

Files with missing lines	Patch %	Lines
...kit/Services/ParallelCorpusPreprocessingService.cs	98.82%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #882      +/-   ##
==========================================
+ Coverage   66.93%   66.95%   +0.02%     
==========================================
  Files         384      384              
  Lines       20900    20917      +17     
  Branches     2709     2707       -2     
==========================================
+ Hits        13989    14006      +17     
  Misses       5947     5947              
  Partials      964      964

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ddaspit

@ddaspit reviewed 13 files and all commit messages, and made 3 comments.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on Enkidu93).

src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 104 at r1 (raw file):

    }

    public async Task PreprocessAsync(

I think this method is complex enough that it warrants some comments to explain the logic.

src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 193 at r1 (raw file):

        foreach (ParallelCorpus corpus in corpora)
        {
            ITextCorpus sourcePretranslateCorpus = corpus

Would it be possible to refactor the code to reuse the text corpus objects instead of recreating them?

src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 201 at r1 (raw file):

                .TargetCorpora.SelectMany(c => _textCorpusService.CreateTextCorpora(c.Files).Select(tc => (c, tc)))
                .Select(tc => FilterPretranslateCorpora(tc.c, tc.tc, ignoreUsfmMarkers))
                .ToArray()

I don't think you need the ToArray here.

…pture content

Enkidu93

@Enkidu93 made 3 comments.
Reviewable status: 11 of 13 files reviewed, 3 unresolved discussions (waiting on ddaspit).

src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 104 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I think this method is complex enough that it warrants some comments to explain the logic.

I added lots of comments. Let me know if I added too many 🤪.

src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 193 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Would it be possible to refactor the code to reuse the text corpus objects instead of recreating them?

Done. It's a little messy, but yes 😁.

src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 201 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I don't think you need the ToArray here.

Yes, oops, done!

ddaspit

@ddaspit reviewed 3 files and all commit messages, made 1 comment, and resolved 3 discussions.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on Enkidu93).

Update parallel corpus preprocessing service to merge training data a…

4ca7cd6

…cross parallel corpora; extend E2E tests to have a more SF-like set up and exercise USFM options

Enkidu93 requested a review from ddaspit February 24, 2026 23:14

ddaspit requested changes Feb 25, 2026

View reviewed changes

Address reviewer comments; fix small bug with pretranslating non-scri…

fe4fbde

…pture content

Enkidu93 commented Feb 25, 2026

View reviewed changes

Enkidu93 added 3 commits February 25, 2026 11:37

Fix comment

59550fa

Clean up comment

8f5589c

Update asserts given bug fix for non-scripture content

15cd570

ddaspit approved these changes Feb 25, 2026

View reviewed changes

Enkidu93 merged commit 6ea0484 into main Feb 25, 2026
2 checks passed

Enkidu93 deleted the merge_data_across_parallel_corpora branch February 25, 2026 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge data across parallel corpora#882

Merge data across parallel corpora#882
Enkidu93 merged 5 commits intomainfrom
merge_data_across_parallel_corpora

Enkidu93 commented Feb 24, 2026 •

edited by ddaspit

Loading

Uh oh!

codecov-commenter commented Feb 24, 2026 •

edited

Loading

Uh oh!

ddaspit left a comment

Uh oh!

Enkidu93 left a comment

Uh oh!

ddaspit left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Enkidu93 commented Feb 24, 2026 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Enkidu93 commented Feb 24, 2026 •

edited by ddaspit

Loading

codecov-commenter commented Feb 24, 2026 •

edited

Loading