-
Notifications
You must be signed in to change notification settings - Fork 0
BNPL Production Pipeline Deployment v0.1.0 #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Setup Phase 2 production pipeline development for BNPL ML models. - Production branch established with proper naming - All 6 model artifacts validated and ready - Target: <100ms inference from 2ms research baseline - Architecture: Single-transaction → Multi-model → Shadow logging
- Add engineer_single_transaction() method for real-time inference - Validates against actual BigQuery json_body structure - Handles unknown categories with consistent 36-feature output - 13-16ms processing time, <100ms SLA requirement - Compatible with fitted preprocessor artifacts
🏗️ Architecture Decision: Single-Transaction Feature EngineeringProblem ContextThe existing batch feature engineering pipeline processes 84K transactions from BigQuery in ~2ms per transaction (optimized for throughput). Production API inference requires processing individual JSON transactions in <100ms (optimized for latency). Design Decision: Separate Methods vs Combined ApproachChosen Approach: Separate Methods # Batch processing (unchanged)
def engineer_features(self, sample_size=None) -> Tuple[DataFrame, Dict]:
# BigQuery → 84K transactions → batch feature engineering
# Real-time processing (new)
def engineer_single_transaction(self, transaction_data: dict) -> DataFrame:
# API JSON → 1 transaction → real-time feature engineering Alternative Rejected: Combined Method def engineer_features(self, data=None, single_transaction=False):
if single_transaction:
# Handle single transaction
else:
# Handle batch processing Rationale for Separation
Technical Implementation DetailsOne-Hot Encoding Consistency Challenge
Performance Optimization
Production Validation
Impact on Development VelocityThis architecture enables parallel development:
Next: Multi-model predictor with shadow/champion/A-B testing modes |
- Log warnings when unknown categories encountered in one-hot encoding - Monitor payment_provider, device_type, product_category, purchase_context, risk_scenario - Includes 5% frequency threshold guidance for model retraining - Maintains 36-feature consistency while alerting on data drift
🎯 Multi-Model Predictor Architecture & Calibration DecisionsDeployment Mode Abstraction StrategyProduction ML systems must navigate the tension between operational flexibility and performance optimization. We designed a deployment mode abstraction that addresses this through runtime configuration rather than code branching: # Single interface, multiple behaviors
predictor = BNPLPredictor(mode="shadow") # All 4 models for comparison
predictor = BNPLPredictor(mode="champion") # Ridge only for production
predictor = BNPLPredictor(mode="logistic") # Specific model for experiments This pattern emerged from recognizing that model deployment follows a lifecycle: shadow deployment for validation → champion selection based on performance → ongoing experiments for optimization. Each phase requires different computational resources and prediction outputs, but maintaining separate codebases creates deployment complexity and testing overhead. The abstraction enables seamless transitions between deployment phases without code changes. Shadow mode loads all models for comprehensive comparison, while champion mode optimizes for latency by loading only the selected model. This operational flexibility proved critical during testing, where switching between modes revealed performance characteristics that wouldn't be apparent in single-model implementations. Probability Calibration Challenge & Technical ResolutionThe ensemble model presented a fundamental challenge in probabilistic machine learning: combining predictions from estimators with different output semantics. Our VotingClassifier contains heterogeneous estimators: ensemble.estimators_ = [
RidgeClassifier(alpha=1000.0), # decision_function() → unbounded scores
LogisticRegression(penalty='l1'), # predict_proba() → calibrated probabilities
LogisticRegression(penalty='elasticnet') # predict_proba() → calibrated probabilities
] The naive approach of using sklearn's built-in # Statistically incorrect approach
decision_score = ridge.decision_function(X)[0]
pseudo_prob = 1 / (1 + np.exp(-decision_score)) # Not a calibrated probability While this produces values in [0,1], it assumes the decision function is already calibrated to the logistic scale. In reality, RidgeClassifier's decision function represents the distance from the separating hyperplane, not log-odds ratios. The sigmoid transformation creates probability-like values that lack meaningful relationship to true class probabilities. Production-Ready Solution: Selective AveragingThe implemented solution recognizes that probabilistic consistency trumps model completeness in ensemble predictions: def _predict_ensemble(self, ensemble_model, features_processed):
calibrated_predictions = []
for estimator in ensemble_model.estimators_:
if hasattr(estimator, 'predict_proba'):
pred_proba = estimator.predict_proba(features_processed)[0]
calibrated_predictions.append(pred_proba[1]) # P(default=True)
# Exclude estimators without calibrated outputs
return np.mean(calibrated_predictions) This approach maintains probabilistic integrity by averaging only estimators with proper calibration. The ensemble now represents the mean of two LogisticRegression models (L1 and ElasticNet penalties), which both output true probabilities derived from sigmoid-transformed linear combinations. The trade-off is explicit: we sacrifice one model's contribution to ensemble averaging while preserving individual Ridge predictions for comparison. This maintains the ability to evaluate all models while ensuring ensemble predictions have valid probabilistic interpretation. Long-term Calibration StrategyThe proper solution involves wrapping RidgeClassifier with calibration during training: from sklearn.calibration import CalibratedClassifierCV
ridge_calibrated = CalibratedClassifierCV(
RidgeClassifier(alpha=1000.0),
cv=3,
method='isotonic' # or 'sigmoid'
)
ridge_calibrated.fit(X_train, y_train)
# Now provides predict_proba() with properly calibrated outputs CalibratedClassifierCV learns a post-hoc calibration mapping from the base classifier's decision function to true probabilities using validation data. This preserves the Ridge model's decision boundary while enabling probabilistic interpretation. We documented this approach in known issues for the next training cycle rather than attempting runtime calibration, which would require access to training data and violate our stateless inference requirement. A/B Testing Architecture DecisionWe initially considered implementing A/B testing capabilities within the predictor class—traffic splitting, experiment tracking, and performance comparison. However, this conflates prediction generation with experiment management, violating separation of concerns. A/B testing requires several orthogonal capabilities:
These responsibilities naturally belong in the Shadow Mode Controller (Step 5), which orchestrates the interaction between prediction generation and business decision-making. The predictor remains focused on efficient model inference, while the controller handles experiment design and evaluation. This separation enables independent scaling: predictors can be optimized for latency while controllers manage complex experimental logic. It also simplifies testing, as predictor behavior remains deterministic regardless of experimental configuration. Performance & Operational CharacteristicsThe architecture achieves sub-2ms inference across all deployment modes while maintaining 36-feature consistency. Stateless design eliminates external dependencies during prediction, enabling horizontal scaling without coordination overhead. Model loading is optimized per deployment context: shadow mode's 2.17ms includes all four models, while champion mode's 0.78ms reflects single-model efficiency. This performance differential validates the deployment mode abstraction—different operational needs require different computational trade-offs. Next: Basic API endpoints will integrate these predictors with HTTP interfaces, while A/B testing functionality will be implemented in Step 5 (Shadow Mode Controller) to maintain architectural separation between prediction generation and experiment management. |
- Support shadow/champion/specific deployment modes - Resolve ensemble probability calibration for mixed estimator types - Exclude uncalibrated RidgeClassifier from ensemble averaging - Achieve <2ms inference across all deployment modes - Document calibration solution roadmap for next training cycle
- Establish Sr Principal Engineer level technical documentation requirements - Define educational-first approach for PR comments and code documentation - Set standards for technical depth, trade-off analysis, and prose style - Include repository-specific commands and development workflows
- FastAPI integration with comprehensive input validation - Shadow mode deployment supporting all 4 models - 17ms end-to-end latency with <100ms SLA compliance - Health monitoring and model status endpoints - Graceful error handling with descriptive responses
🌐 Production API Architecture & Integration StrategyHTTP Interface Design PhilosophyBuilding production ML systems requires bridging the gap between sophisticated machine learning pipelines and simple HTTP interfaces that frontend applications and external services can consume. Our API design prioritizes developer experience while maintaining the computational efficiency achieved in the underlying ML components. The endpoint structure follows RESTful conventions with a clear separation of concerns: transaction processing, health monitoring, and system introspection occupy distinct routes with appropriate HTTP semantics. This design anticipates integration patterns where external systems need both synchronous prediction capabilities and asynchronous monitoring of system health. Request Processing Pipeline ArchitectureThe API implements a layered processing model that transforms external HTTP requests through the complete ML pipeline: HTTP Request → Input Validation → Feature Engineering → Multi-Model Prediction → Response Formatting Each layer maintains clear interfaces and error boundaries. Pydantic models enforce input validation at the HTTP boundary, preventing malformed data from propagating into the ML pipeline. This validation occurs before any computational resources are consumed on feature engineering or model inference. The feature engineering integration demonstrates how stateless processing principles enable scalable API design. Each request creates a self-contained processing context without external dependencies, allowing horizontal scaling without coordination overhead between API instances. Performance Optimization Through Dependency InjectionFastAPI's dependency injection system enables sophisticated performance optimizations through singleton management of expensive resources: _feature_engineer: Optional[BNPLFeatureEngineer] = None
_predictor: Optional[BNPLPredictor] = None
async def get_feature_engineer() -> BNPLFeatureEngineer:
global _feature_engineer
if _feature_engineer is None:
_feature_engineer = BNPLFeatureEngineer(client=None, verbose=False)
return _feature_engineer This pattern ensures that model loading—the most expensive initialization operation—occurs only once per API instance. Subsequent requests reuse loaded models and fitted preprocessors, dramatically reducing response latency. The singleton pattern works because our ML components are stateless: they maintain no request-specific state that would create concurrency issues. The dependency injection also facilitates testing by allowing mock implementations during test execution while maintaining production behavior in deployed environments. Shadow Mode API Integration StrategyThe API serves as the primary interface for shadow mode deployment, where all four models generate predictions for every request while only the champion model's prediction influences the response structure: # Step 2: Multi-Model Prediction
predictions = predictor.predict(features)
# Step 3: Risk Classification
champion_model = predictions.get("champion", "ridge")
default_prob = predictions.get(champion_model, predictions.get("prediction", 0.0)) This design captures comprehensive model comparison data—essential for A/B testing and model performance evaluation—while presenting a consistent interface to consuming applications. External systems receive a single risk assessment based on the champion model, but internal monitoring systems can access all model predictions for analysis. The shadow mode implementation at the API layer rather than the predictor layer reflects our architectural principle of separating prediction generation from deployment strategy. The predictor focuses on efficient model inference, while the API orchestrates deployment-specific logic like champion selection and response formatting. Error Handling & Operational ResilienceProduction APIs must gracefully handle the spectrum of possible failures: malformed inputs, model loading errors, prediction failures, and infrastructure issues. Our error handling strategy implements defense in depth: Input Validation: Pydantic models catch type errors, range violations, and missing fields before processing begins. This prevents resource waste on obviously invalid requests and provides clear feedback to client applications about data requirements. Pipeline Error Isolation: Each processing stage (feature engineering, prediction, response formatting) implements try-catch boundaries with stage-specific error messages. This granular error reporting accelerates debugging in production environments. Graceful Degradation: Health check endpoints operate independently of the main prediction pipeline, ensuring monitoring systems can assess API health even when ML components experience issues. Response Design for Machine Learning SystemsML API responses must balance information richness with interface simplicity. Our response model addresses multiple stakeholder needs: class RiskAssessmentResponse(BaseModel):
# Business layer: Simple risk classification
risk_level: str # "LOW"|"MEDIUM"|"HIGH"
default_probability: float # [0,1] probability
# ML layer: Comprehensive model information
predictions: Dict # All model outputs
champion_model: str # Current best performer
# Operations layer: Performance monitoring
processing_time_ms: float # End-to-end latency
model_inference_time_ms: float # Pure ML computation time Business applications can consume the simplified risk_level classification while ML operations teams access detailed prediction breakdowns and performance metrics. This layered information architecture prevents the need for separate API endpoints serving different stakeholder needs. Latency Optimization & Performance CharacteristicsAchieving sub-100ms response times for complex ML pipelines requires careful attention to computational bottlenecks. Our optimization strategy targets the most expensive operations: Model Loading: Singleton dependency injection eliminates repeated model deserialization (typically 200-500ms per model load). Feature Engineering: Stateless processing with hardcoded categorical mappings avoids database lookups during inference (saves 10-50ms per request). Prediction Batching: While processing single transactions, the ML pipeline maintains batch-friendly interfaces to leverage vectorized operations in NumPy and scikit-learn. Current performance profile demonstrates successful optimization:
The breakdown reveals that ML computation dominates request processing time, indicating efficient HTTP handling and successful elimination of I/O bottlenecks. Health Monitoring & Observability StrategyProduction ML systems require comprehensive health monitoring that extends beyond simple HTTP availability. Our health check design validates the entire ML pipeline: @router.get("/health")
async def health_check() -> HealthResponse:
model_status = {
"feature_engineer": "healthy",
"predictor": "healthy",
"models_loaded": str(model_info["models_loaded"]),
"champion": model_info["champion"]
} This approach enables monitoring systems to detect failures in model loading, feature engineering initialization, or prediction generation before these failures impact customer-facing requests. The health check validates component initialization without performing full prediction processing, providing fast feedback for load balancer health checks while ensuring actual ML capability verification. Integration Readiness & Future ExtensibilityThe API design anticipates evolution toward more sophisticated deployment patterns. The current shadow mode implementation provides foundation for A/B testing frameworks, gradual model rollouts, and multi-tenant prediction serving. Route structure enables version management through URL prefixing ( Next: Shadow Mode Controller will orchestrate experiment management, prediction logging, and business decision integration while leveraging these API endpoints as the prediction interface. |
- Move tests from root directory to organized tests/ structure - Unit tests: tests/unit/{features,models,api}/ for component isolation - Integration tests: tests/integration/ for multi-component testing - Add centralized test runner with categorized execution - Update CLAUDE.md with proper testing commands and structure
Step 4 Implementation Complete: Shadow Mode Controller with Redis IntegrationProblem ContextProduction ML systems require sophisticated experiment management beyond simple model deployment. The challenge lies in conducting statistically valid A/B tests while maintaining business continuity and gathering comprehensive performance data for model optimization. Technical ImplementationShadow Controller ArchitectureThe Shadow Controller implements a three-layer separation of concerns: Prediction Generation: Pure ML inference handled by existing This separation enables independent optimization of each layer. ML teams can focus on model accuracy while deployment teams manage experiment methodologies without touching inference code. Storage Abstraction for Operational EvolutionProduction systems evolve through distinct phases requiring different storage strategies. The storage abstraction pattern enables seamless transitions: Development: In-memory storage for rapid iteration # Environment-based storage selection
def create_production_storage():
if os.getenv("REDIS_URL"): # Railway automatically sets this
return RedisPredictionStorage(...)
return InMemoryPredictionStorage() Railway Deployment OptimizationSince ML API and Redis deploy in the same Railway project, the implementation leverages internal network optimization:
Advanced Experiment ManagementStatistical A/B Testing ImplementationThe experiment manager implements deterministic traffic assignment using customer ID hashing, ensuring consistent model exposure across multiple requests while maintaining statistical randomness at the population level. # Deterministic assignment prevents customer experience inconsistency
traffic_segment = "champion" if hash(customer_id) % 100 < champion_traffic else "challenger" Business Decision PoliciesDecision policies separate ML predictions from business requirements. Risk thresholds adapt to market conditions without code changes:
Production Data Flow
The async logging pattern ensures sub-100ms API response times while capturing comprehensive experiment data for post-hoc analysis. Key Technical DecisionsDependency Injection PatternThe API uses FastAPI's dependency injection for Shadow Controller management: @router.post("/risk-assessment")
async def assess_risk(
transaction: TransactionInput,
shadow_controller: ShadowController = Depends(get_shadow_controller)
): This pattern enables testing with mock objects while providing singleton caching in production. Environment Configuration StrategySimplified environment detection focuses on Railway deployment reality:
Performance Characteristics
Implementation Files
Validation Results
This implementation provides the foundation for sophisticated A/B testing while maintaining the simplicity and performance requirements for Railway deployment. |
Step 5 Complete: Basic Deployment ConfigurationDocker ContainerizationRailway-optimized Dockerfile implements production best practices for ML API deployment: Security Hardening: Non-root user execution prevents privilege escalation attacks in containerized environments. Dependency Management: Poetry-based dependency resolution with cache optimization reduces build times while ensuring reproducible environments. Health Monitoring: Integrated health checks enable Railway's load balancer to detect container health and route traffic appropriately. HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3 \
CMD curl -f http://localhost:$PORT/v1/bnpl/health || exit 1 Environment Integration: Uses Railway's MLflow Experiment Tracking IntegrationImplements comprehensive experiment tracking without impacting API performance through async logging patterns: Prediction Logging: Every risk assessment generates MLflow run with model predictions, business decisions, and performance metrics. Experiment Parameters: Captures decision policies, risk thresholds, traffic segments, and model selections for statistical analysis. Performance Metrics: Tracks processing times, default probabilities, and model-specific predictions for optimization insights. # Async MLflow logging prevents API blocking
async def _log_prediction_async(self, prediction_log, experiment_info):
# Store in Redis first (fast)
self.storage.store_prediction(prediction_log)
# MLflow logging in background thread
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.submit(self._log_to_mlflow, prediction_log, experiment_info) Development vs Production: Local development uses SQLite backend for immediate experiment visualization. Production deployment uses ephemeral container storage with future migration path to hosted MLflow server. Production Configuration ManagementEnvironment-based configuration enables seamless transitions across deployment stages: Railway Integration: Automatic detection of Railway environment variables ( Security Patterns: Template-based environment files prevent credential exposure while documenting required variables. Deployment Flexibility: Single codebase supports local development, staging, and production environments through configuration rather than code changes. Technical Architecture ImpactThe deployment configuration completes the production-ready architecture:
Performance Characteristics: Container startup time <30 seconds, health check response <100ms, full request processing <20ms including async logging. Scalability Foundation: Stateless container design enables horizontal scaling. Redis provides shared state across multiple container instances. Monitoring Integration: MLflow experiment tracking provides operational visibility into model performance trends and business impact metrics. Deployment Readiness ValidationAll Phase 2 objectives achieved:
Performance Targets Met: <100ms transaction processing, Redis caching, graceful error handling, comprehensive experiment tracking. The system is now production deployment ready for Railway with sophisticated experiment management capabilities. |
Railway Deployment Complete: BNPL ML Shadow Mode Controller LiveProduction Deployment ArchitectureThe BNPL ML system now operates in Railway's production environment with Redis co-located in the same project. This architecture leverages Railway's internal networking to achieve sub-millisecond Redis operations, eliminating network latency that would occur with external Redis providers. Railway automatically injects the Poetry Dependency Resolution ChallengeRailway's containerized build environment encountered connection pool exhaustion during Poetry's dependency installation phase. Poetry 2.0's resolver attempts to parallelize package downloads, creating multiple concurrent connections to PyPI. Railway's network infrastructure limits concurrent connections per container, causing timeouts during the 180+ package resolution process. The solution preserves Poetry for local development while using pip + requirements.txt for deployment. This hybrid approach maintains the benefits of Poetry's sophisticated dependency resolution locally while leveraging pip's sequential installation pattern that works within Railway's connection constraints. # Poetry preserved for future Railway optimization
# RUN pip install poetry==1.6.1
# pip workflow for current Railway compatibility
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt This pattern ensures reproducible builds across environments while accommodating infrastructure limitations. MLflow Integration RealityMLflow operates with SQLite backend in the container filesystem, providing immediate experiment tracking capabilities. Each prediction generates MLflow runs asynchronously to prevent API performance impact. The async logging pattern uses ThreadPoolExecutor to isolate MLflow operations from the main request thread. Container restarts result in MLflow data loss, as expected with ephemeral storage. This trade-off prioritizes deployment simplicity and cost optimization for initial production deployment. Future iterations will implement persistent MLflow storage using Railway's PostgreSQL service or external MLflow servers. Performance Validation ResultsAPI response times consistently measure under 20ms, well below the 100ms target. This performance stems from several optimization decisions: Redis operations complete in sub-millisecond timeframes due to internal Railway networking. The Shadow Controller's async logging pattern ensures prediction storage never blocks API responses. Model loading occurs at container startup rather than per-request, amortizing initialization costs across request volume. Container health checks respond within 100ms, enabling Railway's load balancer to accurately route traffic and detect unhealthy instances. The health endpoint validates both API responsiveness and Redis connectivity, providing comprehensive system status. Shadow Mode Controller Production CapabilitiesThe production deployment enables sophisticated A/B testing through deterministic traffic assignment. Hash-based customer segmentation ensures consistent model exposure across sessions while maintaining statistical randomness across the population. Business decision policies operate independently from model predictions, allowing risk threshold adjustments without model redeployment. Experiment data flows through Redis to enable real-time performance monitoring. The storage abstraction supports future migration to BigQuery for long-term analytics without Shadow Controller modification. Next Phase Architecture EvolutionPhase 3 will address MLflow persistence through dedicated Railway service deployment. PostgreSQL backend will enable team collaboration and persistent experiment history. This architecture requires careful consideration of MLflow server scaling and authentication patterns. BigQuery integration will provide comprehensive prediction analytics and model performance trending. The current Redis caching layer positions the system for efficient batch uploads to BigQuery, maintaining real-time performance while enabling analytical capabilities. Load testing validation becomes critical as transaction volume increases. The current architecture supports horizontal scaling through Railway's container replication, but performance characteristics under sustained load require empirical validation. The production deployment successfully demonstrates the Shadow Controller's capability to manage complex ML deployment scenarios while maintaining business continuity and comprehensive experiment tracking. |
Summary
Implementing the complete BNPL ML model production pipeline to deploy 4 validated ML models (Ridge, Logistic, Elastic Net, Voting Ensemble) to production with shadow mode capabilities.
Phase 1 Context ✅ COMPLETE (ML Model Research)
models/production/
with metadatapoetry run pytest tests/integration/
)Phase 2 Implementation Plan (Production Deployment)
This PR covers the core production pipeline development:
✅ Completed Tasks
🔄 Current Tasks
Step 1: Single-transaction feature engineering
Step 2: Flexible multi-model predictor
Step 3: Basic API endpoint
Step 4: Shadow mode controller
Step 5: Basic deployment configuration
Phase 3 Will Cover (Subsequent PRs)
Production Infrastructure
Monitoring & Optimization
Technical Architecture
Key Success Criteria
Test Plan
Note: This is Phase 2 of the BNPL ML deployment. Phase 3 will add advanced monitoring, Airflow integration, and production optimization.