Simple implementation of text hyphenation using pattern approach described in Franklin Mark Liang's thesis "Word Hy-phen-a-tion by Com-put-er".
Hyphenator preserves the case of the original text and hyphenates words with non-alphabetic characters as long as they are not placed in the middle of the words.
To use this tool you need a list of hyphenation patterns for the desired language. It can be downloaded from TeX hyphenation repository. Choose the *.pat.txt file.
Alternatively, you can download a dictionary file e.g. from LibreOffice repositories. In this case choose file with "hyph" prefix e.g. hyph_pl_PL.dic for Polish language. Make sure to remove the tags at the beginning of the file and only pass the patterns themselves to the Hyphenator.
UTF-8 <---- encoding info, use it to load file
LEFTHYPHENMIN 2 <------ this value can be passed to the Hyphenator as minLeadingLength
RIGHTHYPHENMIN 2 <------ this value can be passed to the Hyphenator as minTralingLength
.ć8 <--- pattern
.4ć3ć8
.ćł8
.2ć1ń8
Library is published to maven central and can be added to your project in the standard way:
<dependencies>
...
<dependency>
<groupId>io.github.nianna</groupId>
<artifactId>hyphenator</artifactId>
<version>1.0.1</version>
</dependency>
</dependencies>
Input text is automatically split into tokens. By default the first and last chunk after hyphenation must be at least 2 characters long. Space is used as word separator and hyphen as syllables separator.
List<String> patterns = ... // load the patterns from the patterns file
Hyphenator hyphenator = new Hyphenator(patterns);
HyphenatedText result = hyphenator.hyphenateText("Testing (automatic) HyPHeNAtioN by computer!");
System.out.println(result.read()); // prints "Test-ing (au-to-mat-ic) Hy-PHeN-AtioN by com-put-er!"
You can also hyphenate a single token:
HyphenatedToken result = hyphenator.hyphenateToken("Testing");
System.out.println(result.read("-")); // prints "Test-ing"
System.out.println(result.hyphenIndexes()); // prints [4]
To skip some hyphens you can specify the following properties while creating Hyphenator instance.
- minLeadingLength (default: 2) - hyphen can be placed only after first minLeadingLength characters
- minTrailingLength (default: 2) - hyphen can be placed only before last minTrailingLength characters
List<String> patterns = ... // load the patterns from the patterns file
HyphenatorProperties properties = new HyphenatorProperties(3, 4);
Hyphenator hyphenator = new Hyphenator(patterns, properties);
HyphenatedText result = hyphenator.hyphenateText("Testing (automatic) HyPHeNAtioN by computer!");
System.out.println(result.read()); // prints "Testing (auto-matic) HyPHeN-AtioN by com-puter!"
To customize the separator on which text is supposed to be split into tokens pass the tokenSeparator argument to the Hyphenator.
List<String> patterns = ... // load the patterns from the patterns file
Hyphenator hyphenator = new Hyphenator(patterns, new HyphenatorProperties(), "|");
HyphenatedText result = hyphenator.hyphenateText("Testing|(automatic)|HyPHeNAtioN|by|computer!");
System.out.println(result.read()); // prints "Test-ing (au-to-mat-ic) Hy-PHeN-AtioN by com-put-er!"
To customize the word or syllables separator used for creating hyphenated text pass arguments to HyphenatedText::read method.
List<String> patterns = ... // load the patterns from the patterns file
Hyphenator hyphenator = new Hyphenator(patterns);
HyphenatedText result = hyphenator.hyphenateText("Testing (automatic) HyPHeNAtioN by computer!");
String hyphenatedText = result.read("|", "_");
System.out.println(hyphenatedText); // prints "Test_ing|(au_to_mat_ic)|Hy_PHeN_AtioN|by|com_put_er!"