Conversation
…cross parallel corpora; extend E2E tests to have a more SF-like set up and exercise USFM options
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #882 +/- ##
==========================================
+ Coverage 66.93% 66.95% +0.02%
==========================================
Files 384 384
Lines 20900 20917 +17
Branches 2709 2707 -2
==========================================
+ Hits 13989 14006 +17
Misses 5947 5947
Partials 964 964 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ddaspit
left a comment
There was a problem hiding this comment.
@ddaspit reviewed 13 files and all commit messages, and made 3 comments.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on Enkidu93).
src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 104 at r1 (raw file):
} public async Task PreprocessAsync(
I think this method is complex enough that it warrants some comments to explain the logic.
src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 193 at r1 (raw file):
foreach (ParallelCorpus corpus in corpora) { ITextCorpus sourcePretranslateCorpus = corpus
Would it be possible to refactor the code to reuse the text corpus objects instead of recreating them?
src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 201 at r1 (raw file):
.TargetCorpora.SelectMany(c => _textCorpusService.CreateTextCorpora(c.Files).Select(tc => (c, tc))) .Select(tc => FilterPretranslateCorpora(tc.c, tc.tc, ignoreUsfmMarkers)) .ToArray()
I don't think you need the ToArray here.
Enkidu93
left a comment
There was a problem hiding this comment.
@Enkidu93 made 3 comments.
Reviewable status: 11 of 13 files reviewed, 3 unresolved discussions (waiting on ddaspit).
src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 104 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I think this method is complex enough that it warrants some comments to explain the logic.
I added lots of comments. Let me know if I added too many 🤪.
src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 193 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Would it be possible to refactor the code to reuse the text corpus objects instead of recreating them?
Done. It's a little messy, but yes 😁.
src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 201 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I don't think you need the
ToArrayhere.
Yes, oops, done!
ddaspit
left a comment
There was a problem hiding this comment.
@ddaspit reviewed 3 files and all commit messages, made 1 comment, and resolved 3 discussions.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on Enkidu93).
Changes include:
Update parallel corpus preprocessing service to merge training data across parallel corpora. Pretranslation is still per parallel corpus since we ultimately access pretranslations per parallel corpus. Fixes Merge overlapping data across parallel corpora #860.
Extend E2E tests to have a more SF-like set up and exercise USFM options. I switched out NKJV for BSB (also free to use) because the NKJV does not have quotation marks. Fixes Expand
Nmt_ParatextE2E test to cover more SF use cases #866.This change is