High-performance address search engine designed for fuzzy matching and partial queries. Built with Rust for speed and efficiency, with Python bindings for easy integration.
LFAS implements a two-stage retrieval architecture inspired by "Efficient Query Evaluation using a Two-Level Retrieval Process" (Crane et al., 2017). The system uses distinctive tokens for candidate selection and comprehensive token matching for ranking.
For a detailed explanation of the implementation and design decisions (in Portuguese), see: "Como tornar buscas de endereços rápidas e precisas".
- Fuzzy Search — Handles typos and partial address descriptions
- Field-Aware Indexing — Optimized for structured address data (street, city, state, ZIP, etc.)
- BM25F Scoring — Advanced relevance ranking with field-specific weights
- Persistent Storage — LMDB-backed index for fast disk I/O
- High Throughput — Batch indexing at 100K+ docs/sec
- Low Latency — Sub-200ms query response times
- Token Strategies — Distinctive tokens for candidate filtering, weak tokens (n-grams) for scoring
+------------------+
| Streamlit UI |
+--------+---------+
|
+----+-----+
| Python |
| Bindings |
+----+-----+
|
+--------------------+------------------+
| Rust Core Engine |
+---------------------------------------+
| Tokenizer --> Index --> Scorer |
| - N-grams & Phrases |
| - Inverted Index (LMDB) |
| - BM25F Ranking |
+---------------------------------------+
- Rust 1.93+
- Python 3.13+
- Make
git clone <repository-url>
cd lfas
make develop
uv syncsource .venv/bin/activate
make develop # Development mode
make release # Optimized releasestreamlit run app.pyAccess the web interface at http://localhost:8501.
Upload a CSV file with address data. Required columns:
You can obtain address data from the National Address Registry for Statistical Purposes, a comprehensive database of georeferenced residential addresses across Brazil.
| Column | Description |
|---|---|
id |
Address Id |
rua |
Street name |
municipio |
City |
estado |
State |
cep |
ZIP code |
bairro |
Neighborhood |
tipo_logradouro |
Street type |
numero |
House number |
complemento |
Address complement |
nome |
Name / identifier |
import time
import pandas as pd
from pathlib import Path
from lfas import SearchEngine
index_folder = Path("./my_index")
index_folder.mkdir(exist_ok=True)
metadata_file = index_folder / "metadata.bin"
engine = SearchEngine(db_path=index_folder)
df = pd.read_csv("addresses.csv")
chunk_size = 500_000
total_rows = len(df)
for i in range(0, total_rows, chunk_size):
batch_start_time = time.time()
chunk = df.iloc[i : i + chunk_size]
batch_data = []
for row_idx, row in chunk.iterrows():
doc_id = int(row["id"])
# Exclude 'id' from indexed fields
doc_dict = {k: str(v) for k, v in row.items() if pd.notna(v) and k != "id"}
batch_data.append((doc_id, doc_dict))
engine.index_batch(batch_data)
elapsed = time.time() - batch_start_time
print(f"Indexed {i + len(chunk):,}/{total_rows:,} docs in {elapsed:.2f}s")
engine.flush()
engine.save_metadata(metadata_file)
print(f"Indexed {total_rows:,} documents")import pandas as pd
from pathlib import Path
from lfas import SearchEngine
index_folder = Path("./my_index")
index_folder.mkdir(exist_ok=True)
metadata_file = index_folder / "metadata.bin"
engine = SearchEngine(db_path=index_folder)
engine.load_metadata(metadata_file)
df = pd.read_csv("addresses.csv")
results = engine.search(
query={"rua": "Mauriti", "municipio": "Belem", "numero": "31"},
top_k=10,
blocking_k=1000,
)
for doc_id, score in results:
row = df.query(f"id=={doc_id}").to_dict(orient="records")[0]
print(
f"""
ID={row['id']}
Similarity={score:.2f}
State={row['estado']}
Municipality={row['municipio']}
Neighboorhood={row['bairro']}
ZipCode={row['cep']}
Street={row['rua']}
Number={row['numero']}
"""
)Used in Round 1 to narrow the candidate set via union operations:
- CEP patterns:
66095-000 - House numbers:
31,500 - State abbreviations:
PA,MA - N-grams with address types:
rua 123,br 010
3-character n-grams from all tokens. Improves recall for partial matches and feeds the BM25F scorer in Round 2.
Input: "Travessa Mauriti 31 Belem PA"
- Distinctive:
["31", "pa", "travessa 31"] - All:
["travessa", "mauriti", "31", "belem", "pa", "mau", "uri", "iti", ...]
Indexing throughput: ~9,000+ docs/sec at ~500 bytes/doc (compressed). Search latency: under 200ms for most queries.
cargo bench --bench search_benchmark
single_field_rare_term time: [~145 us]
multi_field_common_terms time: [~295 us]
- Round 1 — Candidate Retrieval: Union of posting lists for distinctive tokens produces the candidate set.
- Round 2 — Ranking: BM25F scores all candidates using the full token set, including weak n-grams.
Default weights (configurable in src/engine.rs):
| Field | Weight |
|---|---|
numero |
10.0 |
cep |
8.0 |
rua |
5.0 |
municipio |
3.0 |
bairro |
2.0 |
complemento |
1.5 |
estado |
1.0 |
nome |
1.0 |
tipo_logradouro |
0.5 |
Fixed-length identifier fields (cep, estado, numero) default to b=0.0. Free-text fields use b=0.75. Configurable in src/engine.rs.
Configurable in src/storage/lmdb.rs:
pub const BATCH_SIZE: usize = 100_000; // Write buffer size
pub const MAP_SIZE: usize = 10 * 1024 * 1024 * 1024; // 10 GBmake test # Run all unit and integration tests
make check # Lint with Clippy and verify formatting
make bench # Run default benchmark
make doc # Generate and open documentation
make clean # Remove build artifactscargo bench --bench index_benchmark # Indexing performance
cargo bench --bench search_benchmark # Search performance
cargo bench --bench tokenizer_benchmark # Tokenizer performance
cargo bench --bench persistance_benchmark # Storage I/O
cargo bench --bench concurrency_benchmark # Concurrent readslfas/
├── src/
| ├── engine.rs Search engine core logic
| ├── index.rs Inverted index implementation
| ├── lib.rs Library root and type definitions
| ├── metadata.rs Document statistics and field lengths
| ├── postings.rs Posting lists (bitmaps + frequencies)
| ├── python.rs PyO3 Python bindings
| ├── scorer.rs BM25F ranking algorithm
| ├── timing.rs Performance instrumentation
| ├── tokenizer.rs Text processing and n-gram generation
| ├── storage/
| | ├── lmdb.rs LMDB persistent storage backend
| | ├── memory.rs In-memory storage backend (tests)
| | └── mod.rs Storage trait definition
| └── bin
| └── stub_gen.rs Python stub generator
├── benches/ Criterion benchmark suites
├── tests/ Integration tests
├── app/ Streamlit web interface
├── python/lfas/ Python package and SearchEngine wrapper
├── Cargo.toml
└── pyproject.toml
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-improvement - Commit your changes:
git commit -am 'Add my improvement' - Push:
git push origin feature/my-improvement - Open a Pull Request
MIT License — see the LICENSE file for details.
- Built with Rust and PyO3
- Persistent storage via LMDB through the heed crate
- Web UI powered by Streamlit
- Benchmarking with Criterion.rs
Crane, M., Trotman, A., & O'Keefe, R. (2017). Efficient Query Evaluation using a Two-Level Retrieval Process. arXiv:1708.01402. https://arxiv.org/abs/1708.01402
Reis, M. (2025). Como tornar buscas de endereços rápidas e precisas. Substack. https://maikerar.substack.com/p/como-tornar-buscas-de-enderecos-rapidas
This project is optimized for Brazilian address data. To adapt it for other address formats, modify the tokenizer rules in
src/tokenizer.rs.