Skip to content

Commit

Permalink
instructions for preparing monolingual and parallel data for estimati…
Browse files Browse the repository at this point in the history
…ng multilingual embeddings.
  • Loading branch information
Waleed Ammar committed Nov 28, 2015
1 parent 7d999e1 commit bda32f9
Show file tree
Hide file tree
Showing 15 changed files with 1,636 additions and 0 deletions.
57 changes: 57 additions & 0 deletions europarl-tools/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
Europarl v6 Preprocessing Tools
===============================
written by Philipp Koehn and Josh Schroeder


Sentence Splitter
=================
Usage ./split-sentences.perl -l [en|de|...] < textfile > splitfile

Uses punctuation and Capitalization clues to split paragraphs of
sentences into files with one sentence per line. For example:

This is a paragraph. It contains several sentences. "But why," you ask?

goes to:

This is a paragraph.
It contains several sentences.
"But why," you ask?

See more information in the Nonbreaking Prefixes section.


Nonbreaking Prefixes Directory
==============================

Nonbreaking prefixes are loosely defined as any word ending in a
period that does NOT indicate an end of sentence marker. A basic
example is Mr. and Ms. in English.

The sentence splitter and tokenizer included with this release
both use the nonbreaking prefix files included in this directory.

To add a file for other languages, follow the naming convention
nonbreaking_prefix.?? and use the two-letter language code you
intend to use when calling split-sentences.perl and tokenizer.perl.

Both split-sentences and tokenizer will first look for a file for the
language they are processing, and fall back to English if a file
for that language is not found. If the nonbreaking_prefixes directory does
not exist at the same location as the split-sentences.perl and tokenizer.perl
files, they will not run.

For the splitter, normally a period followed by an uppercase word
results in a sentence split. If the word preceeding the period
is a nonbreaking prefix, this line break is not inserted.

For the tokenizer, a nonbreaking prefix is not separated from its
period with a space.

A special case of prefixes, NUMERIC_ONLY, is included for special
cases where the prefix should be handled ONLY when before numbers.
For example, "Article No. 24 states this." the No. is a nonbreaking
prefix. However, in "No. It is not true." No functions as a word.

See the example prefix files included here for more examples.

41 changes: 41 additions & 0 deletions europarl-tools/de-xml.perl
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#!/usr/bin/perl -w

use strict;

my $corpus = $ARGV[0];
my $l1 = $ARGV[1];
my $l2 = $ARGV[2];
my $out = $ARGV[3];

print STDERR "de-xml.perl: processing $corpus.$l1 & .$l2 to $out\n";

open(F,"$corpus.$l1");
open(E,"$corpus.$l2");
open(FO,">$out.$l1");
open(EO,">$out.$l2");

my $i=0;

while(my $f = <F>) {
my $e = <E>;
$i++;
chop($e); chop($f);
next if ($e =~ /^<.+>$/) && ($f =~ /^<.+>$/);
if (($e =~ /^<.+>$/) || ($f =~ /^<.+>$/)) {
print STDERR "MISMATCH[$i]: $e <=> $f\n";
next;
}
if (($e =~ /<.+>/) || ($f =~ /<.+>/)) {
print STDERR "TAGS IN TEXT, STRIPPING[$i]: $e <=> $f\n";
$e =~ s/ *<[^>]+> */ /g;
$e =~ s/^ +//;
$e =~ s/ +$//;
$f =~ s/ *<[^>]+> */ /g;
$f =~ s/^ +//;
$f =~ s/ +$//;
print STDERR "TAGS STRIPPED: $e <=> $f\n";
}

print FO $f."\n";
print EO $e."\n";
}
1 change: 1 addition & 0 deletions europarl-tools/nonbreaking_prefixes/README
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
used the nonbreaking prefixes from http://dl.dropbox.com/u/23664530/nonbreaking_prefixes.zip
Loading

0 comments on commit bda32f9

Please sign in to comment.