Skip to content

Merge data across parallel corpora#882

Merged
Enkidu93 merged 5 commits intomainfrom
merge_data_across_parallel_corpora
Feb 25, 2026
Merged

Merge data across parallel corpora#882
Enkidu93 merged 5 commits intomainfrom
merge_data_across_parallel_corpora

Conversation

@Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Feb 24, 2026

Changes include:


This change is Reviewable

…cross parallel corpora; extend E2E tests to have a more SF-like set up and exercise USFM options
@Enkidu93 Enkidu93 requested a review from ddaspit February 24, 2026 23:14
@codecov-commenter
Copy link

codecov-commenter commented Feb 24, 2026

Codecov Report

❌ Patch coverage is 98.82353% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 66.95%. Comparing base (04d4719) to head (15cd570).

Files with missing lines Patch % Lines
...kit/Services/ParallelCorpusPreprocessingService.cs 98.82% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #882      +/-   ##
==========================================
+ Coverage   66.93%   66.95%   +0.02%     
==========================================
  Files         384      384              
  Lines       20900    20917      +17     
  Branches     2709     2707       -2     
==========================================
+ Hits        13989    14006      +17     
  Misses       5947     5947              
  Partials      964      964              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddaspit reviewed 13 files and all commit messages, and made 3 comments.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on Enkidu93).


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 104 at r1 (raw file):

    }

    public async Task PreprocessAsync(

I think this method is complex enough that it warrants some comments to explain the logic.


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 193 at r1 (raw file):

        foreach (ParallelCorpus corpus in corpora)
        {
            ITextCorpus sourcePretranslateCorpus = corpus

Would it be possible to refactor the code to reuse the text corpus objects instead of recreating them?


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 201 at r1 (raw file):

                .TargetCorpora.SelectMany(c => _textCorpusService.CreateTextCorpora(c.Files).Select(tc => (c, tc)))
                .Select(tc => FilterPretranslateCorpora(tc.c, tc.tc, ignoreUsfmMarkers))
                .ToArray()

I don't think you need the ToArray here.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 made 3 comments.
Reviewable status: 11 of 13 files reviewed, 3 unresolved discussions (waiting on ddaspit).


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 104 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I think this method is complex enough that it warrants some comments to explain the logic.

I added lots of comments. Let me know if I added too many 🤪.


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 193 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Would it be possible to refactor the code to reuse the text corpus objects instead of recreating them?

Done. It's a little messy, but yes 😁.


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 201 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I don't think you need the ToArray here.

Yes, oops, done!

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@ddaspit reviewed 3 files and all commit messages, made 1 comment, and resolved 3 discussions.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on Enkidu93).

@Enkidu93 Enkidu93 merged commit 6ea0484 into main Feb 25, 2026
2 checks passed
@Enkidu93 Enkidu93 deleted the merge_data_across_parallel_corpora branch February 25, 2026 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expand Nmt_Paratext E2E test to cover more SF use cases Merge overlapping data across parallel corpora

3 participants