This repository provides a fast and scalable solution for substring search across a collection of text documents. It utilizes a suffix tree data structure to efficiently index and query documents stored on the local file system.
High-Performance Indexing: Efficient substring search with scalable indexing, optimized for large text collections. Document Ingestion & Search: Upload multiple text documents and search through them using an interactive CLI.
A basic usage example of the data store (DocStore) is provided in main.py. It demonstrates how to upload documents and perform substring searches. The usage looks like this:
def get_doc_store():
return DocStore(volume_path="./.data", to_index=True)
if __name__ == "__main__":
doc_store = get_doc_store()
docs = [
{"id": str(uuid.uuid4())[:8], "content": "Hello world, this is a test"},
{"id": str(uuid.uuid4())[:8], "content": "Hewlett Packard, this is a test"},
{"id": str(uuid.uuid4())[:8], "content": "Google, this is a test"}
]
for doc in docs:
doc_store.add(doc["id"], doc["content"], content_type="text")
print(f"Added {doc['id']}, content: \"{doc['content']}\"")
print("")
keywords = ["Hello", "He", "he", "hewlett", "NotThere", "Pack"]
for kw in keywords:
d = doc_store.search(kw)
print(f"Search by: \"{kw}\", Matches ({len(d)}): {d}")You can run main.py directly to try it out.
We start with a CSV file located at fix/df_file.csv, which contains example text data. We'll generate a set of documents from this file to upload into DocStore for searching.
To generate documents, run the following command:
python3 prep/gen-docs.py 10This will create the specified number of .txt files under the fix/generated directory, which can then be used to populate the DocStore.
To upload and index the generated documents, run:
python3 search.py fix/generatedThis will load all .txt files from the fix/generated directory into the DocStore, build the index, and launch an interactive prompt where you can perform substring searches with custom inputs.
[fast-substring-search]/
├── fix/ # Fixtures/data
├── idx/ # Indexing implementation using a Suffix Tree
├── prep/ # Scripts and other tools
├── store/ # DocStore interface
├── README.md
├── main.py
└── search.py
idx/tree.py contains the core indexing algorithm, based on the Suffix Tree data structure. It implements Ukkonen's algorithm, which is very well known for building suffix trees efficiently.
To run the search without using the tree-based indexing (i.e., perform a naive linear scan), use the following command:
INDEX=0 python3 search.py fix/generatedThis disables the suffix tree index and performs a direct search over the documents. Running this on a large set of documents allows you to compare performance between the naive and indexed approaches.
To print the structure of the suffix tree along with the data it contains (useful for debugging), we can run print_nodes() method on doc_store.idx.tree. Following is an example
def get_doc_store():
return DocStore(volume_path="./.data", to_index=True)
if __name__ == "__main__":
doc_store = get_doc_store()
docs = [
{"id": str(uuid.uuid4())[:8], "content": "Hello world"},
{"id": str(uuid.uuid4())[:8], "content": "Hewlett Packard"},
{"id": str(uuid.uuid4())[:8], "content": "World"}
]
for doc in docs:
doc_store.add(doc["id"], doc["content"], content_type="text")
doc_store.idx.tree.print_nodes()