-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
instructions for preparing monolingual and parallel data for estimati…
…ng multilingual embeddings.
- Loading branch information
Waleed Ammar
committed
Nov 28, 2015
1 parent
7d999e1
commit bda32f9
Showing
15 changed files
with
1,636 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
Europarl v6 Preprocessing Tools | ||
=============================== | ||
written by Philipp Koehn and Josh Schroeder | ||
|
||
|
||
Sentence Splitter | ||
================= | ||
Usage ./split-sentences.perl -l [en|de|...] < textfile > splitfile | ||
|
||
Uses punctuation and Capitalization clues to split paragraphs of | ||
sentences into files with one sentence per line. For example: | ||
|
||
This is a paragraph. It contains several sentences. "But why," you ask? | ||
|
||
goes to: | ||
|
||
This is a paragraph. | ||
It contains several sentences. | ||
"But why," you ask? | ||
|
||
See more information in the Nonbreaking Prefixes section. | ||
|
||
|
||
Nonbreaking Prefixes Directory | ||
============================== | ||
|
||
Nonbreaking prefixes are loosely defined as any word ending in a | ||
period that does NOT indicate an end of sentence marker. A basic | ||
example is Mr. and Ms. in English. | ||
|
||
The sentence splitter and tokenizer included with this release | ||
both use the nonbreaking prefix files included in this directory. | ||
|
||
To add a file for other languages, follow the naming convention | ||
nonbreaking_prefix.?? and use the two-letter language code you | ||
intend to use when calling split-sentences.perl and tokenizer.perl. | ||
|
||
Both split-sentences and tokenizer will first look for a file for the | ||
language they are processing, and fall back to English if a file | ||
for that language is not found. If the nonbreaking_prefixes directory does | ||
not exist at the same location as the split-sentences.perl and tokenizer.perl | ||
files, they will not run. | ||
|
||
For the splitter, normally a period followed by an uppercase word | ||
results in a sentence split. If the word preceeding the period | ||
is a nonbreaking prefix, this line break is not inserted. | ||
|
||
For the tokenizer, a nonbreaking prefix is not separated from its | ||
period with a space. | ||
|
||
A special case of prefixes, NUMERIC_ONLY, is included for special | ||
cases where the prefix should be handled ONLY when before numbers. | ||
For example, "Article No. 24 states this." the No. is a nonbreaking | ||
prefix. However, in "No. It is not true." No functions as a word. | ||
|
||
See the example prefix files included here for more examples. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
#!/usr/bin/perl -w | ||
|
||
use strict; | ||
|
||
my $corpus = $ARGV[0]; | ||
my $l1 = $ARGV[1]; | ||
my $l2 = $ARGV[2]; | ||
my $out = $ARGV[3]; | ||
|
||
print STDERR "de-xml.perl: processing $corpus.$l1 & .$l2 to $out\n"; | ||
|
||
open(F,"$corpus.$l1"); | ||
open(E,"$corpus.$l2"); | ||
open(FO,">$out.$l1"); | ||
open(EO,">$out.$l2"); | ||
|
||
my $i=0; | ||
|
||
while(my $f = <F>) { | ||
my $e = <E>; | ||
$i++; | ||
chop($e); chop($f); | ||
next if ($e =~ /^<.+>$/) && ($f =~ /^<.+>$/); | ||
if (($e =~ /^<.+>$/) || ($f =~ /^<.+>$/)) { | ||
print STDERR "MISMATCH[$i]: $e <=> $f\n"; | ||
next; | ||
} | ||
if (($e =~ /<.+>/) || ($f =~ /<.+>/)) { | ||
print STDERR "TAGS IN TEXT, STRIPPING[$i]: $e <=> $f\n"; | ||
$e =~ s/ *<[^>]+> */ /g; | ||
$e =~ s/^ +//; | ||
$e =~ s/ +$//; | ||
$f =~ s/ *<[^>]+> */ /g; | ||
$f =~ s/^ +//; | ||
$f =~ s/ +$//; | ||
print STDERR "TAGS STRIPPED: $e <=> $f\n"; | ||
} | ||
|
||
print FO $f."\n"; | ||
print EO $e."\n"; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
used the nonbreaking prefixes from http://dl.dropbox.com/u/23664530/nonbreaking_prefixes.zip |
Oops, something went wrong.