Skip to content

maikereis/lfas

Repository files navigation

LFAS — Lightning Fast Address Search

Rust Python License Build

High-performance address search engine designed for fuzzy matching and partial queries. Built with Rust for speed and efficiency, with Python bindings for easy integration.

Overview

LFAS implements a two-stage retrieval architecture inspired by "Efficient Query Evaluation using a Two-Level Retrieval Process" (Crane et al., 2017). The system uses distinctive tokens for candidate selection and comprehensive token matching for ranking.

For a detailed explanation of the implementation and design decisions (in Portuguese), see: "Como tornar buscas de endereços rápidas e precisas".

Features

  • Fuzzy Search — Handles typos and partial address descriptions
  • Field-Aware Indexing — Optimized for structured address data (street, city, state, ZIP, etc.)
  • BM25F Scoring — Advanced relevance ranking with field-specific weights
  • Persistent Storage — LMDB-backed index for fast disk I/O
  • High Throughput — Batch indexing at 100K+ docs/sec
  • Low Latency — Sub-200ms query response times
  • Token Strategies — Distinctive tokens for candidate filtering, weak tokens (n-grams) for scoring

Architecture

                +------------------+
                |   Streamlit UI   |
                +--------+---------+
                         |
                    +----+-----+
                    |  Python  |
                    | Bindings |
                    +----+-----+
                         |
    +--------------------+------------------+
    |         Rust Core Engine              |
    +---------------------------------------+
    |  Tokenizer --> Index --> Scorer       |
    |  - N-grams & Phrases                  |
    |  - Inverted Index (LMDB)              |
    |  - BM25F Ranking                      |
    +---------------------------------------+

Prerequisites

  • Rust 1.93+
  • Python 3.13+
  • Make

Installation

git clone <repository-url>
cd lfas
make develop
uv sync

Building

source .venv/bin/activate
make develop   # Development mode
make release   # Optimized release

Running the Application

streamlit run app.py

Access the web interface at http://localhost:8501.

Usage

Indexing Documents

Upload a CSV file with address data. Required columns:


You can obtain address data from the National Address Registry for Statistical Purposes, a comprehensive database of georeferenced residential addresses across Brazil.

Column Description
id Address Id
rua Street name
municipio City
estado State
cep ZIP code
bairro Neighborhood
tipo_logradouro Street type
numero House number
complemento Address complement
nome Name / identifier

Creating the index

import time
import pandas as pd
from pathlib import Path
from lfas import SearchEngine

index_folder = Path("./my_index")
index_folder.mkdir(exist_ok=True)

metadata_file = index_folder / "metadata.bin"

engine = SearchEngine(db_path=index_folder)

df = pd.read_csv("addresses.csv")

chunk_size = 500_000
total_rows = len(df)

for i in range(0, total_rows, chunk_size):
    batch_start_time = time.time()

    chunk = df.iloc[i : i + chunk_size]

    batch_data = []
    for row_idx, row in chunk.iterrows():

        doc_id = int(row["id"])
        # Exclude 'id' from indexed fields
        doc_dict = {k: str(v) for k, v in row.items() if pd.notna(v) and k != "id"} 
        batch_data.append((doc_id, doc_dict))

    engine.index_batch(batch_data)

    elapsed = time.time() - batch_start_time
    print(f"Indexed {i + len(chunk):,}/{total_rows:,} docs in {elapsed:.2f}s")

engine.flush()
engine.save_metadata(metadata_file)

print(f"Indexed {total_rows:,} documents")

Searching Addresses

import pandas as pd
from pathlib import Path
from lfas import SearchEngine

index_folder = Path("./my_index")
index_folder.mkdir(exist_ok=True)

metadata_file = index_folder / "metadata.bin"

engine = SearchEngine(db_path=index_folder)
engine.load_metadata(metadata_file)

df = pd.read_csv("addresses.csv")

results = engine.search(
    query={"rua": "Mauriti", "municipio": "Belem", "numero": "31"},
    top_k=10,
    blocking_k=1000,
)

for doc_id, score in results:
    row = df.query(f"id=={doc_id}").to_dict(orient="records")[0]
    print(
        f"""
    ID={row['id']}
    Similarity={score:.2f}
    State={row['estado']}
    Municipality={row['municipio']}
    Neighboorhood={row['bairro']}
    ZipCode={row['cep']}
    Street={row['rua']}
    Number={row['numero']}
    """
    )

Tokenization Strategy

Distinctive Tokens (Candidate Filtering)

Used in Round 1 to narrow the candidate set via union operations:

  • CEP patterns: 66095-000
  • House numbers: 31, 500
  • State abbreviations: PA, MA
  • N-grams with address types: rua 123, br 010

Weak Tokens (Scoring Only)

3-character n-grams from all tokens. Improves recall for partial matches and feeds the BM25F scorer in Round 2.

Example

Input: "Travessa Mauriti 31 Belem PA"

  • Distinctive: ["31", "pa", "travessa 31"]
  • All: ["travessa", "mauriti", "31", "belem", "pa", "mau", "uri", "iti", ...]

Performance

Indexing throughput: ~9,000+ docs/sec at ~500 bytes/doc (compressed). Search latency: under 200ms for most queries.

Benchmark Results

cargo bench --bench search_benchmark

single_field_rare_term     time: [~145 us]
multi_field_common_terms   time: [~295 us]

Two-Stage Search

  1. Round 1 — Candidate Retrieval: Union of posting lists for distinctive tokens produces the candidate set.
  2. Round 2 — Ranking: BM25F scores all candidates using the full token set, including weak n-grams.

Configuration

BM25F Field Weights

Default weights (configurable in src/engine.rs):

Field Weight
numero 10.0
cep 8.0
rua 5.0
municipio 3.0
bairro 2.0
complemento 1.5
estado 1.0
nome 1.0
tipo_logradouro 0.5

Length Normalization (b values)

Fixed-length identifier fields (cep, estado, numero) default to b=0.0. Free-text fields use b=0.75. Configurable in src/engine.rs.

LMDB Settings

Configurable in src/storage/lmdb.rs:

pub const BATCH_SIZE: usize = 100_000;                  // Write buffer size
pub const MAP_SIZE: usize = 10 * 1024 * 1024 * 1024;   // 10 GB

Development

make test       # Run all unit and integration tests
make check      # Lint with Clippy and verify formatting
make bench      # Run default benchmark
make doc        # Generate and open documentation
make clean      # Remove build artifacts

Benchmark Suite

cargo bench --bench index_benchmark        # Indexing performance
cargo bench --bench search_benchmark       # Search performance
cargo bench --bench tokenizer_benchmark    # Tokenizer performance
cargo bench --bench persistance_benchmark  # Storage I/O
cargo bench --bench concurrency_benchmark  # Concurrent reads

Project Structure

lfas/
├── src/
|   ├── engine.rs           Search engine core logic
|   ├── index.rs            Inverted index implementation
|   ├── lib.rs              Library root and type definitions
|   ├── metadata.rs         Document statistics and field lengths
|   ├── postings.rs         Posting lists (bitmaps + frequencies)
|   ├── python.rs           PyO3 Python bindings
|   ├── scorer.rs           BM25F ranking algorithm
|   ├── timing.rs           Performance instrumentation
|   ├── tokenizer.rs        Text processing and n-gram generation
|   ├── storage/
|   |   ├── lmdb.rs         LMDB persistent storage backend
|   |   ├── memory.rs       In-memory storage backend (tests)
|   |   └── mod.rs          Storage trait definition
|   └── bin
|       └── stub_gen.rs     Python stub generator
├── benches/                Criterion benchmark suites
├── tests/                  Integration tests
├── app/                    Streamlit web interface
├── python/lfas/            Python package and SearchEngine wrapper
├── Cargo.toml
└── pyproject.toml

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/my-improvement
  3. Commit your changes: git commit -am 'Add my improvement'
  4. Push: git push origin feature/my-improvement
  5. Open a Pull Request

License

MIT License — see the LICENSE file for details.

Acknowledgments

References

Crane, M., Trotman, A., & O'Keefe, R. (2017). Efficient Query Evaluation using a Two-Level Retrieval Process. arXiv:1708.01402. https://arxiv.org/abs/1708.01402

Reis, M. (2025). Como tornar buscas de endereços rápidas e precisas. Substack. https://maikerar.substack.com/p/como-tornar-buscas-de-enderecos-rapidas


This project is optimized for Brazilian address data. To adapt it for other address formats, modify the tokenizer rules in src/tokenizer.rs.

About

High-performance address search engine designed for fuzzy matching and partial queries. Built with Rust for speed and efficiency, with Python bindings for easy integration.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors