Fast Substring Search (Algorithm)

This repository provides a fast and scalable solution for substring search across a collection of text documents. It utilizes a suffix tree data structure to efficiently index and query documents stored on the local file system.

Features

High-Performance Indexing: Efficient substring search with scalable indexing, optimized for large text collections. Document Ingestion & Search: Upload multiple text documents and search through them using an interactive CLI.

Use Case 1: Basic

A basic usage example of the data store (DocStore) is provided in main.py. It demonstrates how to upload documents and perform substring searches. The usage looks like this:

def get_doc_store():
    return DocStore(volume_path="./.data", to_index=True)

if __name__ == "__main__":
    doc_store = get_doc_store()
    
    docs = [
        {"id": str(uuid.uuid4())[:8], "content": "Hello world, this is a test"},
        {"id": str(uuid.uuid4())[:8], "content": "Hewlett Packard, this is a test"},
        {"id": str(uuid.uuid4())[:8], "content": "Google, this is a test"}
    ]
    
    for doc in docs:
        doc_store.add(doc["id"], doc["content"], content_type="text")
        print(f"Added {doc['id']}, content: \"{doc['content']}\"")

    print("")
    
    keywords = ["Hello", "He", "he", "hewlett", "NotThere", "Pack"]
    
    for kw in keywords:
        d = doc_store.search(kw)
        print(f"Search by: \"{kw}\", Matches ({len(d)}): {d}")

You can run main.py directly to try it out.

Use Case 2: Search Over a Collection of Documents

Step 1: Prepare Documents

We start with a CSV file located at fix/df_file.csv, which contains example text data. We'll generate a set of documents from this file to upload into DocStore for searching.

To generate documents, run the following command:

python3 prep/gen-docs.py 10

This will create the specified number of .txt files under the fix/generated directory, which can then be used to populate the DocStore.

Step 2: Upload Documents and Search

To upload and index the generated documents, run:

python3 search.py fix/generated

This will load all .txt files from the fix/generated directory into the DocStore, build the index, and launch an interactive prompt where you can perform substring searches with custom inputs.

Project Structure

[fast-substring-search]/
├── fix/                   # Fixtures/data
├── idx/                 # Indexing implementation using a Suffix Tree
├── prep/                  # Scripts and other tools
├── store/                 # DocStore interface
├── README.md
├── main.py
└── search.py

idx/tree.py contains the core indexing algorithm, based on the Suffix Tree data structure. It implements Ukkonen's algorithm, which is very well known for building suffix trees efficiently.

Appendix

Disable Indexing (for Debugging)

To run the search without using the tree-based indexing (i.e., perform a naive linear scan), use the following command:

INDEX=0 python3 search.py fix/generated

This disables the suffix tree index and performs a direct search over the documents. Running this on a large set of documents allows you to compare performance between the naive and indexed approaches.

Print Tree (for Debugging)

To print the structure of the suffix tree along with the data it contains (useful for debugging), we can run print_nodes() method on doc_store.idx.tree. Following is an example

def get_doc_store():
    return DocStore(volume_path="./.data", to_index=True)

if __name__ == "__main__":
    doc_store = get_doc_store()
    
    docs = [
        {"id": str(uuid.uuid4())[:8], "content": "Hello world"},
        {"id": str(uuid.uuid4())[:8], "content": "Hewlett Packard"},
        {"id": str(uuid.uuid4())[:8], "content": "World"}
    ]
    
    for doc in docs:
        doc_store.add(doc["id"], doc["content"], content_type="text")
    
    doc_store.idx.tree.print_nodes()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fast Substring Search (Algorithm)

Features

Use Case 1: Basic

Use Case 2: Search Over a Collection of Documents

Step 1: Prepare Documents

Step 2: Upload Documents and Search

Project Structure

Appendix

Disable Indexing (for Debugging)

Print Tree (for Debugging)

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
fix		fix
idx		idx
prep		prep
store		store
.gitignore		.gitignore
README.md		README.md
fzf.py		fzf.py
main.py		main.py
search.py		search.py

ritsource/fast-substring-search

Folders and files

Latest commit

History

Repository files navigation

Fast Substring Search (Algorithm)

Features

Use Case 1: Basic

Use Case 2: Search Over a Collection of Documents

Step 1: Prepare Documents

Step 2: Upload Documents and Search

Project Structure

Appendix

Disable Indexing (for Debugging)

Print Tree (for Debugging)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages