instructions for preparing monolingual and parallel data for estimati…

…ng multilingual embeddings.
wammar · Nov 28, 2015 · bda32f9 · bda32f9
1 parent 7d999e1
commit bda32f9
Show file tree

Hide file tree

Showing 15 changed files with 1,636 additions and 0 deletions.
diff --git a/europarl-tools/README b/europarl-tools/README
@@ -0,0 +1,57 @@
+Europarl v6 Preprocessing Tools
+===============================
+written by Philipp Koehn and Josh Schroeder
+
+
+Sentence Splitter
+=================
+Usage ./split-sentences.perl -l [en|de|...] < textfile > splitfile
+
+Uses punctuation and Capitalization clues to split paragraphs of 
+sentences into files with one sentence per line. For example:
+
+This is a paragraph. It contains several sentences. "But why," you ask?
+
+goes to:
+
+This is a paragraph.
+It contains several sentences.
+"But why," you ask?
+
+See more information in the Nonbreaking Prefixes section.
+
+
+Nonbreaking Prefixes Directory
+==============================
+
+Nonbreaking prefixes are loosely defined as any word ending in a
+period that does NOT indicate an end of sentence marker. A basic
+example is Mr. and Ms. in English.
+
+The sentence splitter and tokenizer included with this release
+both use the nonbreaking prefix files included in this directory.
+
+To add a file for other languages, follow the naming convention
+nonbreaking_prefix.?? and use the two-letter language code you
+intend to use when calling split-sentences.perl and tokenizer.perl.
+
+Both split-sentences and tokenizer will first look for a file for the
+language they are processing, and fall back to English if a file
+for that language is not found. If the nonbreaking_prefixes directory does
+not exist at the same location as the split-sentences.perl and tokenizer.perl
+files, they will not run.
+
+For the splitter, normally a period followed by an uppercase word
+results in a sentence split. If the word preceeding the period
+is a nonbreaking prefix, this line break is not inserted.
+
+For the tokenizer, a nonbreaking prefix is not separated from its 
+period with a space.
+
+A special case of prefixes, NUMERIC_ONLY, is included for special
+cases where the prefix should be handled ONLY when before numbers.
+For example, "Article No. 24 states this." the No. is a nonbreaking
+prefix. However, in "No. It is not true." No functions as a word.
+
+See the example prefix files included here for more examples.
+
diff --git a/europarl-tools/de-xml.perl b/europarl-tools/de-xml.perl
@@ -0,0 +1,41 @@
+#!/usr/bin/perl -w
+
+use strict;
+
+my $corpus = $ARGV[0];
+my $l1 = $ARGV[1];
+my $l2 = $ARGV[2];
+my $out = $ARGV[3];
+
+print STDERR "de-xml.perl: processing $corpus.$l1 & .$l2 to $out\n";
+
+open(F,"$corpus.$l1");
+open(E,"$corpus.$l2");
+open(FO,">$out.$l1");
+open(EO,">$out.$l2");
+
+my $i=0;
+
+while(my $f = <F>) {
+  my $e = <E>;
+  $i++;
+  chop($e); chop($f);
+  next if ($e =~ /^<.+>$/) && ($f =~ /^<.+>$/);
+  if (($e =~ /^<.+>$/) || ($f =~ /^<.+>$/)) {
+    print STDERR "MISMATCH[$i]: $e <=> $f\n";
+    next;
+  }
+  if (($e =~ /<.+>/) || ($f =~ /<.+>/)) {
+      print STDERR "TAGS IN TEXT, STRIPPING[$i]: $e <=> $f\n";
+      $e =~ s/ *<[^>]+> */ /g;
+      $e =~ s/^ +//;
+      $e =~ s/ +$//;
+      $f =~ s/ *<[^>]+> */ /g;
+      $f =~ s/^ +//;
+      $f =~ s/ +$//;
+      print STDERR "TAGS STRIPPED: $e <=> $f\n";
+  }
+
+  print FO $f."\n";
+  print EO $e."\n";
+}
diff --git a/europarl-tools/nonbreaking_prefixes/README b/europarl-tools/nonbreaking_prefixes/README
@@ -0,0 +1 @@
+used the nonbreaking prefixes from http://dl.dropbox.com/u/23664530/nonbreaking_prefixes.zip
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		used the nonbreaking prefixes from http://dl.dropbox.com/u/23664530/nonbreaking_prefixes.zip