Skip to content

[meta] Translate paragraphs instead of sentences #993

Open
@eu9ene

Description

Based on WMT24, sentence-level translation is going away. There's now more document-level training data available (for example HPLT), and WMT24 used document-level datasets for evaluation.

See Findings of WMT 2024 Shared task

In a shift towards document-level evaluation, we
no longer provide source texts segmented into indi-
vidual sentences. Instead, we keep all paragraphs
intact and evaluated together.

This would require:

  • adapting document level datasets to leave some paragraphs to train on instead of splitting to sentences
  • fix cleaning procedures
  • find evaluation datasets
  • implement inference support

Metadata

Assignees

No one assigned

    Labels

    metaA collection of sub-issues that uses a tasklistqualityImproving robustness and translation quality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions