Data Scientist | Production ML & Computer Vision | Ph.D., Engineering
I am a Data Scientist with a Ph.D. in Engineering and 5+ years of experience designing and deploying production ML systems across computer vision, time-series forecasting, and causal inference. I work end-to-end, from feature engineering and model validation through cloud deployment and stakeholder communication.
My background bridges rigorous statistical research and applied data science, with peer-reviewed publications and production systems running across federally funded and commercial analytics projects.
🔭 Currently working on: gwnbr, an open-source Python package for Geographically Weighted Negative Binomial Regression, alongside a JOSS submission and an applied case study.
- Machine Learning & AI Solutions Build and deploy ML pipelines for classification, prediction, and anomaly detection using Python, PyTorch, scikit-learn, and Spark.
- Time-Series & Forecasting Develop predictive models for demand, system performance, and rare-event risks using ARIMA, SARIMAX, LSTM, and Extreme Value Theory.
- Computer Vision at Production Scale Design and deploy CNN and HRNet pipelines on cloud infrastructure for feature extraction from high-resolution imagery.
- Causal Inference & Statistical Modeling Apply causal inference, hypothesis testing, and experimental design to quantify drivers of behavioral and operational outcomes.
- Data-Driven Decision Support Translate analytical results into clear strategies with dashboards, SHAP-based explainability, and storytelling that empowers technical and non-technical stakeholders alike.
- End-to-End Data Science Lifecycle Experience across data acquisition, ETL, feature engineering, model validation, deployment, and monitoring in both cloud and on-prem environments.
gwnbr, a modular Python package implementing Geographically Weighted Negative Binomial Regression (GWNBR), translating a SAS macro by Silva & Rodrigues (2014) into an open-source, peer-reviewable tool.
- Three model classes (GWNBR, GWNBRg, GWPR), multiple kernel functions, Golden Section Search bandwidth selection, Newton-Raphson and IRLS solvers, and a 28-test pytest suite
- Validated on a 1,460-unit spatial study; GWNBR outperformed GWPR by a wide margin (AICc 14,147 vs. 38,664) on data with ~42x overdispersion
- MIT licensed, archived on Zenodo (DOI: 10.5281/zenodo.21041972), JOSS submission in progress
- Computer Vision at Production Scale Built and deployed an HRNet inference pipeline with a custom sliding 3×3 tile radial weighting strategy on Azure, cutting data collection costs by 80% ($100K+ to under $20K) and turnaround time from months to two weeks.
- Multi-Modal CNN Pipeline Designed an end-to-end CNN pipeline (PyTorch, Azure ML) integrating imagery, geospatial, and text data, achieving 92% feature extraction accuracy across 20+ jurisdictions.
- Anomaly Detection & Rare-Event Forecasting Developed LSTM/ARIMA anomaly detection and Extreme Value Theory frameworks on large continuous sensor streams, cutting calibration costs by 65% and reducing risk exposure by 70%.
- Causal Inference for Behavioral Modeling Applied causal inference methods to quantify marginal contributions of exogenous signals on behavioral outcomes, generating feature importance insights for decision support systems.
- Binary Classification on Behavioral Data Built logistic regression and XGBoost models on multi-signal behavioral and sensor datasets, validated with ROC/AUC and confusion matrix analysis, improving predictive accuracy by 40%.
- Behavioral Modeling for Policy Decisions Designed and validated regression-based models of driver-pedestrian interactions to guide infrastructure investment and safety improvements.
- Freight & Logistics Analytics Built demand models for truck parking, freight generators, and corridor bottlenecks using spatial-temporal analysis and big data sources (WIM, INRIX, Replica).
- Languages: Python, SQL, R, MATLAB
- Machine Learning: PyTorch, scikit-learn, XGBoost, CatBoost, LSTM, CNNs, HRNet, anomaly detection
- Statistical Methods: Causal inference, hypothesis testing, experimental design, Extreme Value Theory, GLMs, backtesting
- Data Engineering: PySpark, Parquet, ELT pipelines, Medallion architecture (Bronze/Silver/Gold), multi-source integration
- Visualization & Explainability: SHAP, Streamlit, Tableau, Plotly, matplotlib, seaborn
- Cloud & MLOps: AWS, Azure ML, Weights & Biases, Git, GitHub Actions
I am both a Data Scientist and an Engineer, which means I approach problems with a structured, systems-oriented mindset while staying focused on delivering data solutions with measurable outcomes. Whether it's optimizing infrastructure, detecting anomalies, or designing predictive models, my goal is to turn complex data into insights that create value for people, organizations, and communities.
- 📄 Google Scholar (search Ananta Sinha)
- 📫 Email: anantasinha60@gmail.com
🧠 Curious about how data, ML, and rigorous methodology intersect to build systems people can actually rely on.

