Skip to content

lulu-2021/content_indexer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

ContentIndexer

ContentIndexer is a small GenServer based indexing & searching service. Intially I created this for my blog that is based on markdown. When the total amount of data to be indexed is not huge this small service can handle it very quickly. It stores the index in a genserver and hence searching is very fast.

It uses tf-idf matching & weighting for the actual index. The searching is done in the same way and comparing the query against the index via similarity.

What is tf-idf?

tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.

Helpful blog post

tf-idf background info

Installation

The library is available in Hex. The package can be installed by adding content_indexer to your list of dependencies in mix.exs:

def deps do
  [{:content_indexer, "~> 0.2.0"}]
end

Usage

Please review this test ContentIndexer.TfIdf.IndexProcessTest for the easiest way to know how you can use this in your project. The module ContentIndexer.Services.PreProcess has several functions that are used to pre-process both the content and the queries - since these are passed as functions you can write your own versions of these and pass them into the content tokenisation and query building process.

Currently I am using this to process markdown files for my blog - but this can be useful for any other such text based content.

The hex documentation is here https://hexdocs.pm/content_indexer.

Running tests

Clone the repo and fetch its dependencies:

$ git clone https://github.com/netflakes/content_indexer.git
$ cd ecto
$ mix deps.get
$ mix test

License

The source code is licensed under the MIT license

About

index content using stemming & tf_idf matching

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published