A text search engine built using GO
- Title
- URL
- Abstract
- strings.Contains
- regexp
- matchstring
The problem is that they don't scale Takes upto 2 seconds for 600k docs , but what if we have 10M docs ? The time will keep increasing
We will use the approach called inverted index
- We will pre-process the data and create inverted-index from the text.
- We will keep a track of each word and its existence document wise.
- We do Tokenization
- Then we will do filtering (lowercasing , dropping common words, stemming).
- Lastly we will do searching.
- We don't go through the docs for searching we will simply search the index.