This implements a Name Matching ML model for entity resolution and transaction monitoring. It uses a LightGBM classifier to determine whether two names (person or organization) refer to the same entity. The model uses multiple string similarity features including edit distance, Jaccard similarity, TF-IDF cosine similarity, and sentence embeddings.
The codebase implements a three-stage ML pipeline:
-
Data Generation (
name_matching/data/)generate_names.py: Creates synthetic names usingFakerlibrary for Western/Asian persons and organizationsmake_dataset.py: Generates training pairs (positive and negative examples) using Azure OpenAI to create aliases
-
Feature Engineering (
name_matching/features/)build_features.py: Computes 8 similarity features between name pairs:- Jaccard similarity (token intersection over union)
- TF-IDF cosine similarity (requires pre-fitted vectorizer)
- Ratio feature (normalized edit distance)
- Sorted token ratio (edit distance on sorted tokens)
- Token set ratio (edit distance on unique sorted tokens)
- Partial ratio (fuzzy matching)
- Embedding distance (sentence-transformers: all-MiniLM-L6-v2)
- String length absolute difference
-
Model Training (
name_matching/models/)train_model.py: Trains LightGBM classifier, generates performance plots, saves model and TF-IDF vectorizer as pickle files
- Configuration: All paths and column names centralized in
name_matching/config/Config.ini - Logging: Uses structlog throughout, configured via CLI flags (
--silent,--human-readable) - Entity Types: Handles both "PERS" (person) and "ORGA" (organization) entities differently
Raw names → Alias generation (Azure OpenAI) → Positive/Negative pairs →
Feature engineering (TF-IDF + embeddings) → LightGBM training →
Saved models (model_lgb_name_matching.pkl, name_matching_tfidf_ngrams.pkl)
python -m name_matching.data.generate_names --n_persons 700 --n_orgas 300python -m name_matching.data.make_dataset --n_neg 10- Requires Azure OpenAI credentials in
.envfile - Generates positive pairs via LLM-generated aliases and typos
- Generates negative pairs using edit distance to find hard negatives
- Output:
data/processed/name_matching_pos_pairs.csvandname_matching_neg_pairs.csv
python -m name_matching.models.train_model --test-size 0.2 --thresh 0.85 --human-readable- Loads positive/negative pairs, builds features, trains LightGBM classifier
- Output: Model pickles in
models/and performance plots inreports/figures/ --thresh: Classification threshold for positive class prediction
-s, --silent: Enable INFO logging (default behavior, less verbose)-hr, --human-readable: Pretty-print classification report-dt, --disable-tqdm: Disable progress bars
Install all dependencies:
pip install -r requirements.txtDownload required NLTK data (for stopwords):
python -c "import nltk; nltk.download('stopwords')"Required environment variables in .env:
AZURE_OPENAI_API_VERSION=<version>
AZURE_OPENAI_DEPLOYMENT=<model-name>
AZURE_OPENAI_API_KEY=<key>
AZURE_OPENAI_ENDPOINT=<endpoint-url>
The feature generation requires a pre-fitted TF-IDF vectorizer. During training, FeatureGenerator.create_tfidf_vectorizer() creates and saves it. For inference, the saved vectorizer must be loaded first.
Names are normalized in make_dataset.py via process_text_standard():
- Convert to uppercase
- Remove special characters and punctuation
- Optionally remove stopwords (disabled for names)
- Numeric token removal configurable
Hard negative mining in TrainingDataGenerator.generate_neg_mappings():
- Sample candidates with same first/last name (confusable pairs)
- Sort remaining by edit distance
- Select top
nclosest non-matches per positive example
All models and data stored in directories specified by Config.ini:
models/: Trained model and vectorizer picklesdata/raw/: Synthetic or uploaded raw name listsdata/processed/: Generated positive/negative pairsreports/figures/: ROC-AUC, feature importance, and PR curves
Use the provided bash script to run the entire pipeline automatically:
# Run full pipeline with defaults:
./train_pipeline.sh
# Customize parameters:
./train_pipeline.sh --n-persons 1000 --n-orgas 500 --thresh 0.9 --human-readable
# Enable hyperparameter tuning:
./train_pipeline.sh --tune --n-trials 50
# Skip certain steps (if already completed):
./train_pipeline.sh --skip-generate --skip-dataset
# See all options:
./train_pipeline.sh --helpThe script provides:
- Color-coded output for easy progress tracking
- Prerequisite checks (Python, .env file, directories)
- Error handling and validation
- Progress summaries and file counts
- Total runtime reporting
- Ability to skip completed steps
Execute steps individually in order:
# 1. Generate synthetic names (optional if you have real data)
python -m name_matching.data.generate_names --n_persons 700 --n_orgas 300
# 2. Generate training pairs (requires Azure OpenAI)
python -m name_matching.data.make_dataset --n_neg 10
# 3. Train the model
python -m name_matching.models.train_model --test-size 0.2 --thresh 0.85 --human-readablepytest tests/ -vpytest tests/unit_tests/ -vpytest tests/integration_tests/test_api.py -vpytest tests/unit_tests/test_predict_model.py -vStart the development server:
python app.pyThe API will run on http://localhost:5001
For production deployment with Gunicorn:
gunicorn -w 4 -b 0.0.0.0:5001 app:appGET /health- Health checkGET /info- Model informationPOST /predict- Single name pair predictionPOST /predict/batch- Batch predictions
See example_api_usage.py for comprehensive examples. Basic usage:
# In terminal 1: Start the API
python app.py
# In terminal 2: Run examples
python example_api_usage.pyOr make direct requests:
curl -X POST http://localhost:5001/predict \
-H "Content-Type: application/json" \
-d '{
"CUST_NAME": "John Smith",
"COUNTERPART_NAME": "J. Smith",
"FT_NO": "FT12345",
"threshold": 0.85
}'The entity_resolution.py script implements a complete entity resolution pipeline using graph-based community detection:
python entity_resolution.py- Load & Preprocess: Loads transaction data, normalizes names using
process_text_standard() - Deduplication: Removes duplicate name pairs
- Pairwise Comparison: Generates all unique pairwise combinations (
nchoose 2 pairs) - Batch Prediction: Uses trained model to predict matches for all pairs
- Graph Construction: Creates NetworkX graph where edges represent matched pairs
- Community Detection: Applies Louvain algorithm to find entity clusters
- Entity Assignment: Maps original transaction names to resolved entity IDs
- Visualization: Generates before/after graph visualizations
Input: CSV file with Cust_Name and Counterpart_Name columns
Output:
data/processed/resolved_txns.csvwithENTITY_X,ENTITY_Y,RESOLVED_NAME_X,RESOLVED_NAME_Ycolumns- Graph visualizations in
reports/figures/
- Model Training (
train_model.py): Trains the LightGBM classifier on labeled pairs - Entity Resolution (
entity_resolution.py): Uses the trained model to resolve entities in unlabeled transaction data via graph algorithms