Skip to content
View kalyan-venk's full-sized avatar
  • 13:41 (UTC -04:00)

Highlights

  • Pro

Block or report kalyan-venk

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
kalyan-venk/README.md
ML Engineer  ·  LLM Evaluation  ·  AI Systems

Kalyan Venkatesh

Most LLM reliability work assumes the problem is the model. I think the problem is the system around the model - the evaluation loop, the pipeline architecture, the assumptions baked into how we measure failure.

LinkedIn Email Portfolio Live Demo


MS Computer Science DePaul University · Jun 2026 · GPA 3.84
Production ML 3+ years · sensen.ai · 26 global deployments
Research LLM inference reliability · 2 systems under faculty supervision
Availability OPT eligible Jun 2026 · Open to DS / MLE / AI Engineer roles

Research

Multi-Agent Inference Reliability Framework   LangGraph Ollama HumanEval

DePaul University · Supervised by Prof. Vahid Alizadeh · Jan 2026 – Present

Can a lightweight 3-agent runtime (Planner → Critic → Fixer) fix LLM failures without touching model weights? Yes. But not the way you'd expect.

Built across 10 phases · 5 model families · hundreds of experimental conditions.

Finding Result
Upgrading the Critic (3B → 8B) made 18 conditions net-negative Stronger critics are confidently wrong in new ways
Inverse Capability Hypothesis -- interventions help weak models, hurt strong ones Validated across all 4 families. Crossover at ~65% pass@1
Selective Reversion Gate reverts 73.4% of degraded fixer outputs +4.85 pp pass@1 (95% CI [+3.36, +6.34], p = 0.010)
Threshold sweep across 6 values, τ = 0.70 optimal Latency: 7.89s → 4.79s · Trigger rate down 78%

Statistically validated across 3 independent trials. Targeting ICSE 2027.


<<<<<<< HEAD

Inference-Lens   DeBERTa-v3 XGBoost MLflow LLM-Bar

=======

Inference-Lens   DeBERTa-v3 XGBoost MLflow LLM-Bar

dev/update-landing-page DePaul University · Supervised by Prof. Bamshad Mobasher · May 2026 – Present

LLM-as-judge is the default eval paradigm. Almost nobody is asking how easily the judge can be deceived.

Try the live scorer and watch an automated judge pick the worse response in real time.

Built a system to stress-test evaluator reliability under systematic adversarial pressure.

What I Built Scale
Benchmarked LR · XGBoost · DeBERTa-v3 against adversarial inputs 419 LLM-Bar pairs across 4 perturbation categories
Response archetype clustering to map structural vulnerability 170K+ Anthropic HH-RLHF preference annotations
5-fold CV supervised pipeline + MLflow artifact versioning AUC-ROC target > 0.82
Real-time Streamlit evaluation interface Per-feature verdict breakdowns

The goal isn't just "can we fool the judge." It's finding which classes of outputs are vulnerable -- so you can build evaluators that aren't.


Industry

Data Scientist · sensen.ai 2022 – 2024

  • ANPR model evaluation pipelines across 26 global deployments -- adopted as the standard validation workflow
  • Led R&D on industrial pollution enforcement: drone-based effluent sampling, 20-25x increase in regulatory coverage
  • ETL automation cutting consolidation from 10+ hrs to 2 hrs/week across ~30K weekly sightings

Data Engineer · AECOM & Siri 2021 – 2022

  • Forecasting and data integration pipelines for asset lifecycle management across 25+ locations

Stack

Python SQL PyTorch HuggingFace Transformers LangChain LangGraph Scikit-learn XGBoost DeBERTa LoRA/PEFT MLflow FastAPI Streamlit Docker Kubernetes AWS (S3 · EC2 · SageMaker) GitHub Actions LLM-as-Judge FAISS Ollama Adversarial ML Prompt Engineering McNemar's test SHAP


Certifications


Open to Data Scientist · ML Engineer · AI Engineer roles

Pinned Loading

  1. agentic-llmops agentic-llmops Public

    Runtime hallucination monitoring for multi-agent code generation pipelines. 6-phase study across 5 model architectures on HumanEval. Built with LangGraph, Ollama and MLflow. Paper in progress.

    TeX 1

  2. Inference-Lens Inference-Lens Public

    End-to-end LLM output quality scoring system with evaluator reliability stress-testing under adversarial conditions.

    Jupyter Notebook 1