Open
Description
Based on WMT24, sentence-level translation is going away. There's now more document-level training data available (for example HPLT), and WMT24 used document-level datasets for evaluation.
See Findings of WMT 2024 Shared task
In a shift towards document-level evaluation, we
no longer provide source texts segmented into indi-
vidual sentences. Instead, we keep all paragraphs
intact and evaluated together.
This would require:
- adapting document level datasets to leave some paragraphs to train on instead of splitting to sentences
- fix cleaning procedures
- find evaluation datasets
- implement inference support