lucene-solr-analysis-turkish

Turkish analysis components for Apache Lucene/Solr 5.3.0

The use of Open Source Software is gaining increasing momentum in Turkey. Turkish users on Apache Lucene/Solr (and other Apache projects)mailing lists are increasing. This project makes use of publicly available Turkish nlp tools to create Apache Lucene/Solr plugins from them. I created this project in order to promote and support open source. Stock Lucene/Solr has SnowballPorterFilter(Factory) for the Turkish language. However, this stemmer performs poorly and has funny collisions. For example; altın, alim, alın, altan, and alıntı are all reduced to a same stem. In other words, they are treated as if they were the same word even though they have different meanings. I will post some other harmful collisions here.

Currently we have five TokenFilters. Detailed documentation is on the way.

TRMorphStemFilter(Factory) Turkish Stemmer based on TRmorph This one is not production ready yet. It requires Operating System specific foma executable. I couldn't find an elegant way to convert foma to java. I am using "executing shell commands in Java to call flookup" workaround advised in [FAQ] (http://code.google.com/p/foma/wiki/FAQ). If you know something better please let me know.

<fieldType name="text_tr_morph" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ApostropheFilterFactory"/>
    <filter class="solr.TurkishLowerCaseFilterFactory"/>
    <filter class="org.apache.lucene.analysis.tr.TRMorphStemFilterFactory" lookup="/Applications/foma/flookup" fst="/Volumes/datadisk/Desktop/TRmorph-master/stem.fst" />
  </analyzer>
</fieldType>

Zemberek2StemFilter(Factory) Turkish Stemmer based on Zemberek2 You need two jars : zemberek-cekirdek-2.1.3.jar zemberek-tr-2.1.3.jar TurkishAnalysis-5.3.0.jar inside solr/collection1/lib directory.

<fieldType name="text_tr_zemberek2" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ApostropheFilterFactory"/>
    <filter class="solr.TurkishLowerCaseFilterFactory"/>
    <filter class="org.apache.lucene.analysis.tr.Zemberek2StemFilterFactory" strategy="minMorpheme"/>
  </analyzer>
</fieldType>

Zemberek2DeASCIIfyFilter(Factory) Turkish DeASCIIfier based on Zemberek2 You need two jars : zemberek-cekirdek-2.1.3.jar zemberek-tr-2.1.3.jar TurkishAnalysis-5.3.0.jar inside solr/collection1/lib directory.

Zemberek3StemFilter(Factory) Turkish Stemmer based on Zemberek3 Download tr folder which contains dictionary files, and put it under solr/collection1/conf. You need three jars : zemberek-morphology-0.9.1.jar zemberek-core-0.9.1.jar TurkishAnalysis-5.3.0.jar inside solr/collection1/lib directory. Please note that zemberek-* jars need to generated from my fork. Here is the difference over original repository.

<fieldType name="text_tr_zemberek3" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ApostropheFilterFactory"/>
    <filter class="solr.TurkishLowerCaseFilterFactory"/>
    <filter class="org.apache.lucene.analysis.tr.Zemberek3StemFilterFactory" strategy="maxLength" dictionary="tr/master-dictionary.dict,tr/secondary-dictionary.dict,tr/non-tdk.dict,tr/proper.dict"/>
  </analyzer>
</fieldType>

TurkishDeASCIIfyFilter(Factory) Translation of Emacs Turkish mode from Lisp into Java. This filter is intended to be used to allow diacritics-insensitive search for Turkish.

<fieldType name="text_tr_deascii" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <filter class="solr.ApostropheFilterFactory"/>
     <filter class="solr.TurkishLowerCaseFilterFactory"/>
     <filter class="org.apache.lucene.analysis.tr.TurkishDeASCIIfyFilterFactory" preserveOriginal="false"/>
     <filter class="org.apache.lucene.analysis.tr.Zemberek3StemFilterFactory" strategy="maxLength" dictionary="tr/master-dictionary.dict,tr/secondary-dictionary.dict,tr/non-tdk.dict,tr/proper.dict"/>
   </analyzer>
 </fieldType>

I will post benchmark results of different field types (different stemmers) designed for different use-cases.

##Dependencies

JRE 1.7 or above
Apache Maven 3.0.3 or above
Apache Lucene (Solr) 5.3.0

##Author Please feel free to contact Ahmet Arslan at iorixxx at yahoo dot com if you have any questions, comments or contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
solr		solr
src		src
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lucene-solr-analysis-turkish

About

Releases

Packages

Languages

License

fsonmez/lucene-solr-analysis-turkish

Folders and files

Latest commit

History

Repository files navigation

lucene-solr-analysis-turkish

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages