This repo contains a set of practice inference graphs implemented using Seldon core inference graph. Inference graphs in seldon folder are implemented using Seldon 1st gen custom python package and pipelines in mlserver folder are implemented using Serving Custom Model Seldon's newer serving platform mlserver and Seldon Inference Graph.
NOTE: This repo is shared for learning purposes, some of the pipeliens implemented here might not have a real-world usecases and they are not fully tested.
Pull requests, suggestions and completing the list of pipelines for future implementation are highly appreciated.
Pipelines from InferLine: latency-aware provisioning and scaling for prediction serving pipelines
- Cascade
- Ensemble
- Preprocess
- Vidoe Monitoring
and the following pipelines:
- audio-qa: Audio to text -> Question Answering
- audio-sent: Audio to text -> Sentiment Analysis
- nlp: language identification -> translate fr to Eng -> summerisation
- sum-qa: Summerisation -> Question Answering
- video: Object Detection -> Object Classification
- audio-qa: Audio to text -> Question Answering
- audio-sent: Audio to text -> Sentiment Analysis
- nlp: language identification -> translate fr to Eng -> summerisation
- sum-qa: Summerisation -> Question Answering
- video: Object Detection -> Object Classification
Pre-built container images are also available here. Therefore if you are just trying out, you can deploy yaml files on your K8S cluster the way they are.
Some of the academic and industrial relevant projects that could be used as a source of Inference Pipelines for future implementations.
- InferLine: latency-aware provisioning and scaling for prediction serving pipelines
- GrandSLAm: Guaranteeing SLAs for Jobs in Microservices Execution Frameworks
- FA2: Fast, Accurate Autoscaling for Serving Deep Learning Inference with SLA Guarantees
- Rim: Offloading Inference to the Edge
- Llama: A Heterogeneous & Serverless Framework for Auto-Tuning Video Analytics Pipelines
- Scrooge: A Cost-Effective Deep Learning Inference System
- Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis
- VideoEdge: Processing Camera Streams using Hierarchical Clusters
- Live Video Analytics at Scale with Approximation and Delay-Tolerance
- Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing
- XRBench: An Extended Reality (XR) Machine Learning Benchmark Suite for the Metaverse
- On Human Intellect and Machine Failures: Troubleshooting Integrative Machine Learning Systems
- Fixes That Fail: Self-Defeating Improvements in Machine-Learning Systems
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- PaLM: Scaling Language Modeling with Pathways
- Language Model Cascades
- Understanding the Complexity and Its Impact on Testing in ML-Enabled Systems
- PromptChainer: Chaining Large Language Model Prompts through Visual Programming
- 3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows
- Feature Interactions on Steroids: On the Composition of ML Models
This repo also includes a small async load tester for sending workloads to the models/pipeliens. You can find it under async load tester folder.
Source:
Source:
Please give a star if this repo helped you learning somthing new :)
- Re-implement pipelines in Seldon V2
- Add an example of using shared models in pipeliens using V2
- Example of multi-model request propagation
- Example implementation using Nvidia Triton Server as the base containers instead of MLServer
- Examples of model load/unload in Triton and MLServer
- GPU examples with fractional GPUs
- Send image/audio/text in a compresssed fromat
- Add performance evaluation scripts and load tester
- Complete Unfinished pipelines
- Examples of using Triton Client for interacting with MLSserver examples
- Examples of using Triton Inference Server as the serving backend
- Pipelines implementation in upcoming Seldon core V2
- Examples of Integration with Autoscalers (Builtin Autoscaler, VPA and event-driven autoscaler like KEDA)
- Implemnet GPT2 -> DALL-E pipeline inspired from dalle-runtime