Skip to content

ritsource/fast-substring-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fast Substring Search (Algorithm)

This repository provides a fast and scalable solution for substring search across a collection of text documents. It utilizes a suffix tree data structure to efficiently index and query documents stored on the local file system.

Features

High-Performance Indexing: Efficient substring search with scalable indexing, optimized for large text collections. Document Ingestion & Search: Upload multiple text documents and search through them using an interactive CLI.

Use Case 1: Basic

A basic usage example of the data store (DocStore) is provided in main.py. It demonstrates how to upload documents and perform substring searches. The usage looks like this:

def get_doc_store():
    return DocStore(volume_path="./.data", to_index=True)

if __name__ == "__main__":
    doc_store = get_doc_store()
    
    docs = [
        {"id": str(uuid.uuid4())[:8], "content": "Hello world, this is a test"},
        {"id": str(uuid.uuid4())[:8], "content": "Hewlett Packard, this is a test"},
        {"id": str(uuid.uuid4())[:8], "content": "Google, this is a test"}
    ]
    
    for doc in docs:
        doc_store.add(doc["id"], doc["content"], content_type="text")
        print(f"Added {doc['id']}, content: \"{doc['content']}\"")

    print("")
    
    keywords = ["Hello", "He", "he", "hewlett", "NotThere", "Pack"]
    
    for kw in keywords:
        d = doc_store.search(kw)
        print(f"Search by: \"{kw}\", Matches ({len(d)}): {d}")

You can run main.py directly to try it out.

Use Case 2: Search Over a Collection of Documents

Step 1: Prepare Documents

We start with a CSV file located at fix/df_file.csv, which contains example text data. We'll generate a set of documents from this file to upload into DocStore for searching.

To generate documents, run the following command:

python3 prep/gen-docs.py 10

This will create the specified number of .txt files under the fix/generated directory, which can then be used to populate the DocStore.

Step 2: Upload Documents and Search

To upload and index the generated documents, run:

python3 search.py fix/generated

This will load all .txt files from the fix/generated directory into the DocStore, build the index, and launch an interactive prompt where you can perform substring searches with custom inputs.

Project Structure

[fast-substring-search]/
├── fix/                   # Fixtures/data
├── idx/                 # Indexing implementation using a Suffix Tree
├── prep/                  # Scripts and other tools
├── store/                 # DocStore interface
├── README.md
├── main.py
└── search.py

idx/tree.py contains the core indexing algorithm, based on the Suffix Tree data structure. It implements Ukkonen's algorithm, which is very well known for building suffix trees efficiently.

Appendix

Disable Indexing (for Debugging)

To run the search without using the tree-based indexing (i.e., perform a naive linear scan), use the following command:

INDEX=0 python3 search.py fix/generated

This disables the suffix tree index and performs a direct search over the documents. Running this on a large set of documents allows you to compare performance between the naive and indexed approaches.

Print Tree (for Debugging)

To print the structure of the suffix tree along with the data it contains (useful for debugging), we can run print_nodes() method on doc_store.idx.tree. Following is an example

def get_doc_store():
    return DocStore(volume_path="./.data", to_index=True)

if __name__ == "__main__":
    doc_store = get_doc_store()
    
    docs = [
        {"id": str(uuid.uuid4())[:8], "content": "Hello world"},
        {"id": str(uuid.uuid4())[:8], "content": "Hewlett Packard"},
        {"id": str(uuid.uuid4())[:8], "content": "World"}
    ]
    
    for doc in docs:
        doc_store.add(doc["id"], doc["content"], content_type="text")
    
    doc_store.idx.tree.print_nodes()

About

Fast and scalable substring search over text files using Suffix Trees (Ukkonen’s Algorithm).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages