Skip to content

Add WMT translate datasets for 2017-19, with trivial extensibility to additional years. #254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 28, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
běžím
zmizel
Plav
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
I am running (cc)

I am swimming (cc)
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
ich renne
es verschwand
ich schwimme
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
I am running

I am swimming
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# This is a fake perl script to mimic the one written to filter CzEng 1.6 to
# create CzEng 1.7. Our code just parses it to find the blocks that need to be
# filtered out.

use strict;

my %bad = map { ($_, 1) } qw{
2 3 5
9 10 16
};


print STDERR "Done.\n";

Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
ich renne
es verschwand
ich schwimme
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
I am running

I am swimming
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
<?xml version="1.0" encoding="utf-8"?>
<tmx version="1.4">
<header creationtool="Moses-to-TMX-converer" creationtoolversion="1.0" o-tmf="Moses plain text files" datatype="plaintext" segtype="sentence" adminlang="EN-US" srclang="en" creationdate="20170426T083842Z">
</header>
<body>
<tu>
<tuv xml:lang="en">
<seg>I am running (tmx)</seg>
</tuv>
<tuv xml:lang="cs">
<seg>běžím</seg>
</tuv>
</tu>
<tu>
<tuv xml:lang="cs">
<seg>Plavu</seg>
</tuv>
<tuv xml:lang="en">
<seg>I am swimming (tmx)</seg>
</tuv>
</tu>
</body>
</tmx>
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
běžím I am running (tsv)
zmizel Translation with tab (tsv)
Plavu I am swimming (tsv)
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
<?xml version="1.0" encoding="utf-8"?>
<tmx version="1.4">
<header creationtool="Moses-to-TMX-converer" creationtoolversion="1.0" o-tmf="Moses plain text files" datatype="plaintext" segtype="sentence" adminlang="EN-US" srclang="en" creationdate="20170426T083842Z">
</header>
<body>
<tu>
<tuv xml:lang="en">
<seg>I am running</seg>
</tuv>
<tuv xml:lang="de">
<seg>ich renne</seg>
</tuv>
</tu>
<tu>
<tuv xml:lang="de">
<seg>ich swimme</seg>
</tuv>
<tuv xml:lang="en">
<seg>I am swimming</seg>
</tuv>
</tu>
</body>
</tmx>
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
ich renne I am running
es verschwand Translation with tab
ich schwimme I am swimming
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
běžím
zmizel
Plav
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
ich renne
es verschwand
ich schwimme
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
I am running

I am swimming
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
běžím
zmizel
Plav
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
ich renne
es verschwand
ich schwimme
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
I am running

I am swimming
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
běžím
zmizel
Plav
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
ich renne
es verschwand
ich schwimme
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
I am running

I am swimming
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
běžím
zmizel
Plav
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
ich renne
es verschwand
ich schwimme
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
I am running

I am swimming
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
běžím
zmizel
Plav
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
ich renne
es verschwand
ich schwimme
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
I am running

I am swimming
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
běžím
zmizel
Plav
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
ich renne
es verschwand
ich schwimme
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
I am running

I am swimming
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
běžím
zmizel
Plav
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
ich renne
es verschwand
ich schwimme
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
I am running

I am swimming
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<refset setid="newstest2014" srclang="any" trglang="en">
<doc sysid="ref" docid="1" genre="news" origlang="en">
<p>
<seg id="1">I am running</seg>
<seg id="2"></seg>
</p>
</doc>
<doc sysid="ref" docid="2" genre="news" origlang="de">
<p>
<seg id="1">I am swimming</seg>
</p>
</doc>
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<refset setid="newstest2014" srclang="any">
<doc sysid="ref" docid="1" genre="news" origlang="en">
<p>
<seg id="1">běžím</seg>
<seg id="2">zmizel</seg>
</p>
</doc>
<doc sysid="ref" docid="2" genre="news" origlang="de">
<p>
<seg id="1">Plav</seg>
</p>
</doc>
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<refset setid="newstest2014" srclang="any" trglang="en">
<doc sysid="ref" docid="1" genre="news" origlang="en">
<p>
<seg id="1">I am running</seg>
<seg id="2"></seg>
</p>
</doc>
<doc sysid="ref" docid="2" genre="news" origlang="de">
<p>
<seg id="1">I am swimming</seg>
</p>
</doc>
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<refset setid="newstest2014" srclang="any">
<doc sysid="ref" docid="1" genre="news" origlang="en">
<p>
<seg id="1">ich renne</seg>
<seg id="2">es verschwand</seg>
</p>
</doc>
<doc sysid="ref" docid="2" genre="news" origlang="de">
<p>
<seg id="1">ich schwimme</seg>
</p>
</doc>
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<refset setid="newstest2014" srclang="any" trglang="en">
<doc sysid="ref" docid="1" genre="news" origlang="en">
<p>
<seg id="1">I am running</seg>
<seg id="2"></seg>
</p>
</doc>
<doc sysid="ref" docid="2" genre="news" origlang="de">
<p>
<seg id="1">I am swimming</seg>
</p>
</doc>
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<refset setid="newstest2014" srclang="any">
<doc sysid="ref" docid="1" genre="news" origlang="en">
<p>
<seg id="1">běžím</seg>
<seg id="2">zmizel</seg>
</p>
</doc>
<doc sysid="ref" docid="2" genre="news" origlang="de">
<p>
<seg id="1">Plav</seg>
</p>
</doc>
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<refset setid="newstest2015" srclang="any" trglang="en">
<doc sysid="ref" docid="1" genre="news" origlang="en">
<p>
<seg id="1">I am running</seg>
<seg id="2"></seg>
</p>
</doc>
<doc sysid="ref" docid="2" genre="news" origlang="de">
<p>
<seg id="1">I am swimming</seg>
</p>
</doc>
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<refset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1" genre="news" origlang="en">
<p>
<seg id="1">ich renne</seg>
<seg id="2">es verschwand</seg>
</p>
</doc>
<doc sysid="ref" docid="2" genre="news" origlang="de">
<p>
<seg id="1">ich schwimme</seg>
</p>
</doc>
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<refset setid="newstest2014" srclang="any" trglang="en">
<doc sysid="ref" docid="1" genre="news" origlang="en">
<p>
<seg id="1">I am running</seg>
<seg id="2"></seg>
</p>
</doc>
<doc sysid="ref" docid="2" genre="news" origlang="de">
<p>
<seg id="1">I am swimming</seg>
</p>
</doc>
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<refset setid="newstest2014" srclang="any">
<doc sysid="ref" docid="1" genre="news" origlang="en">
<p>
<seg id="1">běžím</seg>
<seg id="2">zmizel</seg>
</p>
</doc>
<doc sysid="ref" docid="2" genre="news" origlang="de">
<p>
<seg id="1">Plav</seg>
</p>
</doc>
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<refset setid="newstest2016" srclang="any" trglang="en">
<doc sysid="ref" docid="1" genre="news" origlang="en">
<p>
<seg id="1">I am running</seg>
<seg id="2"></seg>
</p>
</doc>
<doc sysid="ref" docid="2" genre="news" origlang="de">
<p>
<seg id="1">I am swimming</seg>
</p>
</doc>
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<refset setid="newstest2016" srclang="any">
<doc sysid="ref" docid="1" genre="news" origlang="en">
<p>
<seg id="1">ich renne</seg>
<seg id="2">es verschwand</seg>
</p>
</doc>
<doc sysid="ref" docid="2" genre="news" origlang="de">
<p>
<seg id="1">ich schwimme</seg>
</p>
</doc>
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<refset setid="newstest2014" srclang="any" trglang="en">
<doc sysid="ref" docid="1" genre="news" origlang="en">
<p>
<seg id="1">I am running</seg>
<seg id="2"></seg>
</p>
</doc>
<doc sysid="ref" docid="2" genre="news" origlang="de">
<p>
<seg id="1">I am swimming</seg>
</p>
</doc>
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<refset setid="newstest2014" srclang="any">
<doc sysid="ref" docid="1" genre="news" origlang="en">
<p>
<seg id="1">běžím</seg>
<seg id="2">zmizel</seg>
</p>
</doc>
<doc sysid="ref" docid="2" genre="news" origlang="de">
<p>
<seg id="1">Plav</seg>
</p>
</doc>
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<refset setid="newstest2017" srclang="any" trglang="en">
<doc sysid="ref" docid="1" genre="news" origlang="en">
<p>
<seg id="1">I am running</seg>
<seg id="2"></seg>
</p>
</doc>
<doc sysid="ref" docid="2" genre="news" origlang="de">
<p>
<seg id="1">I am swimming</seg>
</p>
</doc>
Loading