Turkish analysis components for Apache Lucene/Solr 5.3.0
The use of Open Source Software is gaining increasing momentum in Turkey. Turkish users on Apache Lucene/Solr (and other Apache projects)mailing lists are increasing. This project makes use of publicly available Turkish nlp tools to create Apache Lucene/Solr plugins from them. I created this project in order to promote and support open source. Stock Lucene/Solr has SnowballPorterFilter(Factory) for the Turkish language. However, this stemmer performs poorly and has funny collisions. For example; altın, alim, alın, altan, and alıntı are all reduced to a same stem. In other words, they are treated as if they were the same word even though they have different meanings. I will post some other harmful collisions here.
Currently we have five TokenFilters. Detailed documentation is on the way.
TRMorphStemFilter(Factory)
Turkish Stemmer based on TRmorph
This one is not production ready yet. It requires Operating System specific foma executable.
I couldn't find an elegant way to convert foma
to java. I am using "executing shell commands in Java to call flookup
" workaround advised in [FAQ] (http://code.google.com/p/foma/wiki/FAQ). If you know something better please let me know.
<fieldType name="text_tr_morph" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ApostropheFilterFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
<filter class="org.apache.lucene.analysis.tr.TRMorphStemFilterFactory" lookup="/Applications/foma/flookup" fst="/Volumes/datadisk/Desktop/TRmorph-master/stem.fst" />
</analyzer>
</fieldType>
Zemberek2StemFilter(Factory) Turkish Stemmer based on Zemberek2 You need two jars : zemberek-cekirdek-2.1.3.jar zemberek-tr-2.1.3.jar TurkishAnalysis-5.3.0.jar inside solr/collection1/lib directory.
<fieldType name="text_tr_zemberek2" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ApostropheFilterFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
<filter class="org.apache.lucene.analysis.tr.Zemberek2StemFilterFactory" strategy="minMorpheme"/>
</analyzer>
</fieldType>
Zemberek2DeASCIIfyFilter(Factory) Turkish DeASCIIfier based on Zemberek2 You need two jars : zemberek-cekirdek-2.1.3.jar zemberek-tr-2.1.3.jar TurkishAnalysis-5.3.0.jar inside solr/collection1/lib directory.
Zemberek3StemFilter(Factory) Turkish Stemmer based on Zemberek3 Download tr folder which contains dictionary files, and put it under solr/collection1/conf. You need three jars : zemberek-morphology-0.9.1.jar zemberek-core-0.9.1.jar TurkishAnalysis-5.3.0.jar inside solr/collection1/lib directory. Please note that zemberek-* jars need to generated from my fork. Here is the difference over original repository.
<fieldType name="text_tr_zemberek3" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ApostropheFilterFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
<filter class="org.apache.lucene.analysis.tr.Zemberek3StemFilterFactory" strategy="maxLength" dictionary="tr/master-dictionary.dict,tr/secondary-dictionary.dict,tr/non-tdk.dict,tr/proper.dict"/>
</analyzer>
</fieldType>
TurkishDeASCIIfyFilter(Factory) Translation of Emacs Turkish mode from Lisp into Java. This filter is intended to be used to allow diacritics-insensitive search for Turkish.
<fieldType name="text_tr_deascii" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ApostropheFilterFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
<filter class="org.apache.lucene.analysis.tr.TurkishDeASCIIfyFilterFactory" preserveOriginal="false"/>
<filter class="org.apache.lucene.analysis.tr.Zemberek3StemFilterFactory" strategy="maxLength" dictionary="tr/master-dictionary.dict,tr/secondary-dictionary.dict,tr/non-tdk.dict,tr/proper.dict"/>
</analyzer>
</fieldType>
I will post benchmark results of different field types (different stemmers) designed for different use-cases.
##Dependencies
- JRE 1.7 or above
- Apache Maven 3.0.3 or above
- Apache Lucene (Solr) 5.3.0
##Author
Please feel free to contact Ahmet Arslan at iorixxx at yahoo dot com
if you have any questions, comments or contributions.