This repo covers Kubeflow Environment with LABs: Kubeflow, Pipeline, Experiments, Run, Minio, etc. Possible usage scenarios are aimed to update over time.
Kubeflow supposes Machine Learning (ML) Pipeline that runs on Kubernetes (K8s) Cluster. Kubeflow uses the power of the K8s (Clusters and Autoscaling). Each step of the process in the ML Pipeline is container. Hence, each step can be isolated, run parallel (if they are not in sequence).
- Have a knowledge of
- Container Technology (Docker). You can learn it from here => Fast-Docker
- Container Orchestration Technology (Kubernetes). You can learn it from here => Fast-Kubernetes
Keywords: Kubeflow, ML Pipeline, MLOps, AIOps
- LAB: Creating LAB Environment (WSL2), Installing Kubeflow with MicroK8s, Juju on Ubuntu 20.04
- LAB: Creating LAB Environment, Installing MiniKF with Vagrant
- LAB/Project: Kubeflow Pipeline (From Scratch) with Kubeflow SDK (DSL Compiler) and Custom Docker Images (Decision Tree, Logistic Regression, SVM, Naive Bayes, Xg Boost)
- LAB/Project: KALE (Kubeflow Automated PipeLines Engine) and KATIB (AutoML: Finding Best Hyperparameter Values)
- LAB/Project: KALE (Kubeflow Automated PipeLines Engine) and KServe (Model Serving) for Model Prediction
- Motivation
- What is Kubelow?
- How Kubeflow Works?
- What is Container (Docker)?
- What is Kubernetes?
- Installing Kubeflow
- Kubeflow Basics
- Kubeflow Jupyter Notebook
- Kubeflow Pipeline
- KALE (Kubeflow Automated PipeLines Engine)
- KATIB (AutoML: Finding Best Hyperparameter Values)
- KServe (Model Serving)
- Training-Operators (Distributed Training)
- Minio and ROK (Object Storages)
- Project 1: Creating ML Pipeline with Custom Docker Images (Decision Tree, Logistic Regression, SVM, Naive Bayes, Xg Boost)
- Project 2: KALE (Kubeflow Automated PipeLines Engine) and KATIB (AutoML: Finding Best Hyperparameter Values)
- Project 3: KALE (Kubeflow Automated PipeLines Engine) and KServe (Model Serving) for Model Prediction
- Project 4: Distributed Training with Training Operator
- Other Useful Resources Related Kubeflow
- References
Why should we use / learn Kubeflow?
- Kubeflow uses containers to run steps of ML algorithms on PC cluster.
- Kubeflow supports parallel training (with Tensorflow).
- Kubeflow provides Machine Learning (ML) data pipeline.
- It saves pipelines, experiments, and runs (experiment tracking on Kubeflow).
- It provides easy, repeatable, portable deployments on a diverse infrastructure (for example, experimenting on a laptop, then moving to an on-premises cluster or to the cloud).
- Kubeflow provides deploying and managing loosely-coupled microservices and scaling based on demand.
- Kubeflow is free, open source platform that runs on on-premise or any cloud (AWS, Google Cloud, Azure).
- It includes Jupyter Notebook to develop ML algorithms, user interface to show pipeline.
- "Kubeflow started as an open sourcing of the way Google ran TensorFlow internally, based on a pipeline called TensorFlow Extended. It began as just a simpler way to run TensorFlow jobs on Kubernetes, but has since expanded to be a multi-architecture, multi-cloud framework for running entire machine learning pipelines." (ref: kubeflow.org)
- Kubeflow applies to become a CNCF incubating project, it is announced on 24 October 2022 (ref: opensource.googleblog.com).
- "The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable." (ref: kubeflow.org)
- "Kubeflow has developed into an end-to-end, extendable ML platform, with multiple distinct components to address specific stages of the ML lifecycle: model development (Kubeflow Notebooks), model training (Kubeflow Pipelines and Kubeflow Training Operator), model serving (KServe), and automated machine learning (Katib)" (ref: opensource.googleblog.com).
- Kubeflow is a type of ML data pipeline application that provides to create ML data pipeline (saving model and artifacts, running multiple times) like Airflow
-
Kubeflow works on Kubernetes platform with Docker Containers.
-
Kubernetes creates the node clusters with many servers and PCs. Kubeflow is a distributed application (~35 pods) running on the Kubernetes platform. Kubeflow pods are running on the different nodes if there are several nodes connected to the Kubernetes cluster.
-
Containers include Python Machine learning (ML) codes that are each step of the ML pipeline (e.g. Dowloading data function, decision tree classifier, linear regression classifier, evaluation part, etc.)
-
Containers' outputs can be able to connect to the other containers' inputs. With this feature, it is possible to create DAG (Directed Acyclic Graph) with containers. Each function can be able to run on the seperate containers.
-
If you want to learn the details of the working of Kubeflow, you should learn:
-
- Docker Containers
-
- Kubernetes
-
-
Docker is a tool that reduces the gap between Development/Deployment phase of a software development cycle.
-
Docker is like VM but it has more features than VMs (no kernel, only small app and file systems, portable)
- On Linux Kernel (2000s) two features are added (these features support Docker):
- Namespaces: Isolate process.
- Control Groups: Resource usage (CPU, Memory) isolation and limitation for each process.
- On Linux Kernel (2000s) two features are added (these features support Docker):
-
Without Docker containers, each VM consumes 30% resources (Memory, CPU)
-
To learn about Docker and Containers, please go to this repo: https://github.com/omerbsezer/Fast-Docker
-
"Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available." (Ref: Kubernetes.io)
-
To learn about Kubernetes, please go to this repo: https://github.com/omerbsezer/Fast-Kubernetes
-
How to install Kubeflow on WSL2 with Juju:
-
To get more features like KALE, and to install in easy way: Use Kubeflow with MiniKF below (preferred)
-
Kubeflow with MiniKF: How to install MiniKF with Vagrant and VirtualBox:
- Kubeflow is an ML distributed application that contains following parts:
- Kubeflow Jupyter Notebook (creating multiple notebook pods)
- Kubeflow Pipelines
- KALE (Kubeflow Automated PipeLines Engine)
- Kubeflow Runs and Experiment (which store all run and experiment)
- KATIB (AutoML: Finding Best Hyperparameter Values)
- KFServe (Model Serving)
- Training-Operators (Distributed Training)
-
Kubeflow creates Notebook using containers and K8s pod.
-
When user wants to run new notebook, user can configure:
- which image should be base image under the notebook pod,
- how many CPU core and RAM the notebook pod should use,
- if there is GPU in the K8s cluster, should this use or not for the notebook pod,
- how much volume space (workspace volume) should be use for this notebook pod,
- should the existing volume space be shared with other notebook pods,
- should persistent volume be used (PV, PVC with NFS volume),
- which environment variables or secrets should be reachable from notebook pod,
- should this notebook pod run on which server in the cluster, with which pods (K8s affinity, tolerations)
-
After launching notebook pod, it creates pod and we can connect it to open the notebook.
-
After creating notebook pod, in MiniKF, it triggers to create volume automatically (with ROK storage class), user can reach files and even downloads the files.
-
Kubeflow Pipelines is based on Argo Workflows which is a container-native workflow engine for kubernetes.
-
Kubeflow Pipelines consists of (ref: Kubeflow-Book):
- Python SDK: allows you to create and manipulate pipelines and their components using Kubeflow Pipelines domain-specific language.
- DSL compiler: allows you to transform your pipeline defined in python code into a static configuration reflected in a YAML file.
- Pipeline Service: creates a pipeline run from the static configuration or YAML file.
- Kubernetes Resources: the pipeline service connects to kubernetes API in order to define the resources needed to run the pipeline defined in the YAML file.
- Artifact Storage: Kubeflow Pipelines storages metadata and artifacts. Metadata such as experiments, jobs, runs and metrics are stored in a MySQL database. Artifacts such as pipeline packages, large scale metrics and views are stored in an artifact store such as MinIO server.
-
Have a look to Kubeflow Pipeline Project with SDK:
-
KALE (Kubeflow Automated pipeLines Engine) is a project that aims at simplifying the Data Science experience of deploying Kubeflow Pipelines workflows.
-
Kale bridges this gap by providing a simple UI to define Kubeflow Pipelines workflows directly from you JupyterLab interface, without the need to change a single line of code (ref: https://github.com/kubeflow-kale/kale).
-
With KALE, each cells are tagged and worklow can be created by connecting cells, then after compiling, Kubeflow Pipeline is created and run.
-
KALE feature helps data scientist to run on Kubeflow quickly without creating any container manually.
-
Have a look to KALE and KATIB Project:
-
Katib is a Kubernetes-native project for automated machine learning (AutoML). Katib supports Hyperparameter Tuning, Early Stopping and Neural Architecture Search.
-
Katib has search methods (ref: https://github.com/kubeflow/katib):
- Hyperparameter Tuning: Random Search, Grid Search, Bayesian Optimization, TPE, Multivariate TPE, CMA-ES, Sobol's Quasirandom Sequence, HyperBand, Population Based Training.
- Neural Architecture Search: ENAS, DARTS
- Early Stopping: Median Stop
-
Have a look to KALE and KATIB Project:
-
KServe enables serverless inferencing on Kubernetes and provides performant, high abstraction interfaces for common machine learning (ML) frameworks like TensorFlow, XGBoost, scikit-learn, PyTorch, and ONNX to solve production model serving use cases (ref: https://github.com/kserve/kserve).
-
Have a look to KALE and KServe Project:
-
It is great advantage to run distributed and parallel jobs (training) on Kubernetes with Training-Operators. User can determine the number of worker trainer pods.
-
Training operator provides Kubernetes custom resources that makes it easy to run distributed or non-distributed TensorFlow / PyTorch / Apache MXNet / XGBoost / MPI jobs on Kubernetes (ref: https://github.com/kubeflow/training-operator).
-
Distributed Training become more important day by day, because the number of the parameters is increasing (especially deep learning, deep neural networks). Increasing parameter provides better results but it also causes the longer training and it needs more computing power.
- How is the number of the parameters calculated? => https://stackoverflow.com/questions/28232235/how-to-calculate-the-number-of-parameters-of-convolutional-neural-networks
- Common DL models parameters: VGG => 138 Million, AlexNet => 62 Million, ResNet-152: 60.3 Million.
- OpenAI ChatGPT (GPT 3.5) has 175 billion parameters (ref: https://www.sciencefocus.com/future-technology/gpt-3/).
- The Chinese tech giant Huawei built a 200-Billion-parameter language model called PanGu (ref: https://www.technologyreview.com/2021/12/21/1042835/2021-was-the-year-of-monster-ai-models/).
- Inspur, another Chinese firm, built Yuan 1.0, a 245-billion-parameter model.
- Baidu and Peng Cheng Laboratory, a research institute in Shenzhen, announced PCL-BAIDU Wenxin, a model with 280 billion parameters.
- The Beijing Academy of AI announced Wu Dao 2.0, which has 1.75 trillion parameters.
- South Korean internet search firm Naver announced a model called HyperCLOVA, with 204 billion parameters.
-
CERN uses Kubeflow and Training operators to speed up the training (3D-GAN) on parallel multiple GPUs (1 single training time: 2.5 days = 60 hours to 30 minutes):
Project 1: Creating ML Pipeline with Custom Docker Images (Decision Tree, Logistic Regression, SVM, Naive Bayes, Xg Boost)
- Have a look to Kubeflow Pipeline Project:
Project 2: KALE (Kubeflow Automated PipeLines Engine) and KATIB (AutoML: Finding Best Hyperparameter Values)
- Have a look to KALE and KATIB Project:
Project 3: KALE (Kubeflow Automated PipeLines Engine) and KServe (Model Serving) for Model Prediction
- Have a look to KALE and KServe Project:
- kubeflow.org: (kubeflow documentation) https://v0-7.kubeflow.org/docs/
- opensource.googleblog.com: https://opensource.googleblog.com/2022/10/kubeflow-applies-to-become-a-cncf-incubating-project.html
- kubeflow-pipelines towardsdatascience: https://towardsdatascience.com/kubeflow-pipelines-how-to-build-your-first-kubeflow-pipeline-from-scratch-2424227f7e5
- Kubernetes.io: https://kubernetes.io/docs/concepts/overview/
- docs.docker.com: https://docs.docker.com/get-started/overview/
- Argo Worflow: https://github.com/argoproj/argo-workflows
- Kubeflow-Book: https://www.amazon.com.mx/Kubeflow-Machine-Learning-Lab-Production/dp/1492050121
- KALE: https://github.com/kubeflow-kale/kale
- KATIB: https://github.com/kubeflow/katib,
- KALE Tags: https://medium.com/kubeflow/automating-jupyter-notebook-deployments-to-kubeflow-pipelines-with-kale-a4ede38bea1f
- KServe: https://github.com/kserve/kserve
- https://www.technologyreview.com/2021/12/21/1042835/2021-was-the-year-of-monster-ai-models/