LFAS — Lightning Fast Address Search

High-performance address search engine designed for fuzzy matching and partial queries. Built with Rust for speed and efficiency, with Python bindings for easy integration.

Overview

LFAS implements a two-stage retrieval architecture inspired by "Efficient Query Evaluation using a Two-Level Retrieval Process" (Crane et al., 2017). The system uses distinctive tokens for candidate selection and comprehensive token matching for ranking.

For a detailed explanation of the implementation and design decisions (in Portuguese), see: "Como tornar buscas de endereços rápidas e precisas".

Features

Fuzzy Search — Handles typos and partial address descriptions
Field-Aware Indexing — Optimized for structured address data (street, city, state, ZIP, etc.)
BM25F Scoring — Advanced relevance ranking with field-specific weights
Persistent Storage — LMDB-backed index for fast disk I/O
High Throughput — Batch indexing at 100K+ docs/sec
Low Latency — Sub-200ms query response times
Token Strategies — Distinctive tokens for candidate filtering, weak tokens (n-grams) for scoring

Architecture

                +------------------+
                |   Streamlit UI   |
                +--------+---------+
                         |
                    +----+-----+
                    |  Python  |
                    | Bindings |
                    +----+-----+
                         |
    +--------------------+------------------+
    |         Rust Core Engine              |
    +---------------------------------------+
    |  Tokenizer --> Index --> Scorer       |
    |  - N-grams & Phrases                  |
    |  - Inverted Index (LMDB)              |
    |  - BM25F Ranking                      |
    +---------------------------------------+

Prerequisites

Rust 1.93+
Python 3.13+
Make

Installation

git clone <repository-url>
cd lfas
make develop
uv sync

Building

source .venv/bin/activate
make develop   # Development mode
make release   # Optimized release

Running the Application

streamlit run app.py

Access the web interface at http://localhost:8501.

Usage

Indexing Documents

Upload a CSV file with address data. Required columns:

You can obtain address data from the National Address Registry for Statistical Purposes, a comprehensive database of georeferenced residential addresses across Brazil.

Column	Description
`id`	Address Id
`rua`	Street name
`municipio`	City
`estado`	State
`cep`	ZIP code
`bairro`	Neighborhood
`tipo_logradouro`	Street type
`numero`	House number
`complemento`	Address complement
`nome`	Name / identifier

Creating the index

import time
import pandas as pd
from pathlib import Path
from lfas import SearchEngine

index_folder = Path("./my_index")
index_folder.mkdir(exist_ok=True)

metadata_file = index_folder / "metadata.bin"

engine = SearchEngine(db_path=index_folder)

df = pd.read_csv("addresses.csv")

chunk_size = 500_000
total_rows = len(df)

for i in range(0, total_rows, chunk_size):
    batch_start_time = time.time()

    chunk = df.iloc[i : i + chunk_size]

    batch_data = []
    for row_idx, row in chunk.iterrows():

        doc_id = int(row["id"])
        # Exclude 'id' from indexed fields
        doc_dict = {k: str(v) for k, v in row.items() if pd.notna(v) and k != "id"} 
        batch_data.append((doc_id, doc_dict))

    engine.index_batch(batch_data)

    elapsed = time.time() - batch_start_time
    print(f"Indexed {i + len(chunk):,}/{total_rows:,} docs in {elapsed:.2f}s")

engine.flush()
engine.save_metadata(metadata_file)

print(f"Indexed {total_rows:,} documents")

Searching Addresses

import pandas as pd
from pathlib import Path
from lfas import SearchEngine

index_folder = Path("./my_index")
index_folder.mkdir(exist_ok=True)

metadata_file = index_folder / "metadata.bin"

engine = SearchEngine(db_path=index_folder)
engine.load_metadata(metadata_file)

df = pd.read_csv("addresses.csv")

results = engine.search(
    query={"rua": "Mauriti", "municipio": "Belem", "numero": "31"},
    top_k=10,
    blocking_k=1000,
)

for doc_id, score in results:
    row = df.query(f"id=={doc_id}").to_dict(orient="records")[0]
    print(
        f"""
    ID={row['id']}
    Similarity={score:.2f}
    State={row['estado']}
    Municipality={row['municipio']}
    Neighboorhood={row['bairro']}
    ZipCode={row['cep']}
    Street={row['rua']}
    Number={row['numero']}
    """
    )

Tokenization Strategy

Distinctive Tokens (Candidate Filtering)

Used in Round 1 to narrow the candidate set via union operations:

CEP patterns: 66095-000
House numbers: 31, 500
State abbreviations: PA, MA
N-grams with address types: rua 123, br 010

Weak Tokens (Scoring Only)

3-character n-grams from all tokens. Improves recall for partial matches and feeds the BM25F scorer in Round 2.

Example

Input: "Travessa Mauriti 31 Belem PA"

Distinctive: ["31", "pa", "travessa 31"]
All: ["travessa", "mauriti", "31", "belem", "pa", "mau", "uri", "iti", ...]

Performance

Indexing throughput: ~9,000+ docs/sec at ~500 bytes/doc (compressed). Search latency: under 200ms for most queries.

Benchmark Results

cargo bench --bench search_benchmark

single_field_rare_term     time: [~145 us]
multi_field_common_terms   time: [~295 us]

Two-Stage Search

Round 1 — Candidate Retrieval: Union of posting lists for distinctive tokens produces the candidate set.
Round 2 — Ranking: BM25F scores all candidates using the full token set, including weak n-grams.

Configuration

BM25F Field Weights

Default weights (configurable in src/engine.rs):

Field	Weight
`numero`	10.0
`cep`	8.0
`rua`	5.0
`municipio`	3.0
`bairro`	2.0
`complemento`	1.5
`estado`	1.0
`nome`	1.0
`tipo_logradouro`	0.5

Length Normalization (b values)

Fixed-length identifier fields (cep, estado, numero) default to b=0.0. Free-text fields use b=0.75. Configurable in src/engine.rs.

LMDB Settings

Configurable in src/storage/lmdb.rs:

pub const BATCH_SIZE: usize = 100_000;                  // Write buffer size
pub const MAP_SIZE: usize = 10 * 1024 * 1024 * 1024;   // 10 GB

Development

make test       # Run all unit and integration tests
make check      # Lint with Clippy and verify formatting
make bench      # Run default benchmark
make doc        # Generate and open documentation
make clean      # Remove build artifacts

Benchmark Suite

cargo bench --bench index_benchmark        # Indexing performance
cargo bench --bench search_benchmark       # Search performance
cargo bench --bench tokenizer_benchmark    # Tokenizer performance
cargo bench --bench persistance_benchmark  # Storage I/O
cargo bench --bench concurrency_benchmark  # Concurrent reads

Project Structure

lfas/
├── src/
|   ├── engine.rs           Search engine core logic
|   ├── index.rs            Inverted index implementation
|   ├── lib.rs              Library root and type definitions
|   ├── metadata.rs         Document statistics and field lengths
|   ├── postings.rs         Posting lists (bitmaps + frequencies)
|   ├── python.rs           PyO3 Python bindings
|   ├── scorer.rs           BM25F ranking algorithm
|   ├── timing.rs           Performance instrumentation
|   ├── tokenizer.rs        Text processing and n-gram generation
|   ├── storage/
|   |   ├── lmdb.rs         LMDB persistent storage backend
|   |   ├── memory.rs       In-memory storage backend (tests)
|   |   └── mod.rs          Storage trait definition
|   └── bin
|       └── stub_gen.rs     Python stub generator
├── benches/                Criterion benchmark suites
├── tests/                  Integration tests
├── app/                    Streamlit web interface
├── python/lfas/            Python package and SearchEngine wrapper
├── Cargo.toml
└── pyproject.toml

Contributing

Fork the repository
Create a feature branch: git checkout -b feature/my-improvement
Commit your changes: git commit -am 'Add my improvement'
Push: git push origin feature/my-improvement
Open a Pull Request

License

MIT License — see the LICENSE file for details.

Acknowledgments

Built with Rust and PyO3
Persistent storage via LMDB through the heed crate
Web UI powered by Streamlit
Benchmarking with Criterion.rs

References

Crane, M., Trotman, A., & O'Keefe, R. (2017). Efficient Query Evaluation using a Two-Level Retrieval Process. arXiv:1708.01402. https://arxiv.org/abs/1708.01402

Reis, M. (2025). Como tornar buscas de endereços rápidas e precisas. Substack. https://maikerar.substack.com/p/como-tornar-buscas-de-enderecos-rapidas

This project is optimized for Brazilian address data. To adapt it for other address formats, modify the tokenizer rules in src/tokenizer.rs.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github/workflows		.github/workflows
.streamlit		.streamlit
app		app
benches		benches
img		img
python/lfas		python/lfas
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

LFAS — Lightning Fast Address Search

Overview

Features

Architecture

Prerequisites

Installation

Building

Running the Application

Usage

Indexing Documents

Creating the index

Searching Addresses

Tokenization Strategy

Distinctive Tokens (Candidate Filtering)

Weak Tokens (Scoring Only)

Example

Performance

Benchmark Results

Two-Stage Search

Configuration

BM25F Field Weights

Length Normalization (b values)

LMDB Settings

Development

Benchmark Suite

Project Structure

Contributing

License

Acknowledgments

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages