Run MLprojects on kubernetes (mlflow#1181)

* initial setup and job yaml file creation * first tests on kuberntes * refactory and tests * adds kubernetes mode to cli * adds to env vars Azure vars if exits in local host * fix tag setup and parameters in kubernetes call * encapsulates run command logic in function * adds Job monitoring in command line * lint * remove kube context from MLproject file * refactor MLproject and new tests * validates docker_auth_config and kube_context in cli * Add docs * add example * lint and tests for kubernetes config * lint on tests * lint tests * add kubernetes in requirements.txt and lint tests * lint and tests env * lint and requirements * merge, loggin and lint * remove unused import * change image_namespace in dockeMLproject docker example * add resources attributes in MLproject ability to pass k8s job template * improve docs on k8s resources in MLprojects Ability to pass relative job-template file * lint on tests * remove docker auth parameter from CLI * refatores image tag * lint * change mode to backend * change mode to backend * update to use backend_config * update to backend_config * change username in image_url * lint and add base-image to kubernetes_config.json * lint * absolut_import to handle kubernetes modlue import conflict * lint * docs * remove kubernetes_env in project_spec and var names in cli * fix container command format * return docker_env usage in kubernetes backend * lint * lint _logger.info * update kubernetes docs and fix docker docs * docker project validation * docker image tagging scheme * tests for docker_env validation, backend tag and other issues * handles errors pushing docker images * docs typo, remove kubernetes from setup * remove code to add azure blob storage keys in job definition file * lint * load of kube-job-template-path * refactory to add KubernetesSubmittedRun * lint * add KubernetesSubmittedRun and tests * lint tests * lint * Some changes * lint and fixes * fix tests validating docker env * remove the streaming of the logs * Updated logic for checking of kubernetes job status. * Fixed tests. * Added a test for state transitions of kubernetes run. * Minor update. * docs * backend tag rename * remove rstcheck from lint-requirements * correct project env tag * rstcheck in lint-requiremtns.txt * handles environment variables in job template when it already has variables specified * Simplified state monitoring of Kubernetes Submitted Run.
simonvanbernem · Jun 26, 2019 · 5dcb226 · 5dcb226
1 parent b0836f8
commit 5dcb226
Show file tree

Hide file tree

Showing 17 changed files with 648 additions and 41 deletions.
diff --git a/dev-requirements.txt b/dev-requirements.txt
@@ -7,3 +7,6 @@ nose
 codecov
 coverage
 pypi-publisher
+scikit-learn
+scipy
+kubernetes
diff --git a/docs/source/projects.rst b/docs/source/projects.rst
@@ -212,7 +212,8 @@ Docker container environment
 
   .. code-block:: yaml
 
-    docker_env: mlflow-docker-example-environment
+    docker_env:
+      image: mlflow-docker-example-environment
 
   In this example, ``docker_env`` refers to the Docker image with name 
   ``mlflow-docker-example-environment`` and default tag ``latest``. Because no registry path is 
@@ -223,7 +224,8 @@ Docker container environment
 
   .. code-block:: yaml
     
-    docker_env: 012345678910.dkr.ecr.us-west-2.amazonaws.com/mlflow-docker-example-environment:7.0
+    docker_env:
+      image: 012345678910.dkr.ecr.us-west-2.amazonaws.com/mlflow-docker-example-environment:7.0
 
   In this example, ``docker_env`` refers to the Docker image with name 
   ``mlflow-docker-example-environment`` and tag ``7.0`` in the Docker registry with path
@@ -324,6 +326,9 @@ Deployment Mode
     includes setting cluster parameters such as a VM type. Of course, you can also run projects on
     any other computing infrastructure of your choice using the local version of the ``mlflow run``
     command (for example, submit a script that does ``mlflow run`` to a standard job queueing system).
+
+    It's also possible to launch projects remotely in `Kubernetes <https://kubernetes.io/>`_ clusters 
+    using command-line (see :ref:`Run a project on Kubernetes <kubernetes_execution>`).
 
 Environment
     By default, MLflow Projects are run in the environment specified by the project directory
@@ -363,6 +368,85 @@ for your run. Then, run your project using the command
 
 where ``<uri>`` is a Git repository URI or a folder.
 
+.. _kubernetes_execution:
+
+Run a project on Kubernetes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+MLflow projects can be executed in kubernetes clusters. Basically it uses the image created to run 
+projects in :ref:`Docker environment <project-docker-container-environments>`  and pushes it to an 
+image repository, so you need to configure MLproject with ``docker_env`` section. After that it 
+creates a Kubernetes Job that uses this published image and runs the MLflow project on kubernetes.
+A brief overview of how to configure and use this feature is as follows:
+
+In project folder you need to create a ``backend_config.json`` with the following attributes:
+
+.. code-block:: json
+
+{
+  "kube-context": "docker-for-desktop",
+
+  "image-uri": "username/mlflow-kubernetes-example",
+
+  "kube-job-template-path": "kubernetes_job_template.yaml"
+}
+
+The ``kube-context`` attribute is the kubernetes context where mlflow will run the Job. ``image-uri`` points to the 
+registry/repository/image where the image will be pushed so kubernetes can download it and run. Remeber that mlflow 
+expects that login credentials are already stored for both kubernetes context and docker repository to push images.
+
+The ``kube-job-template-path`` points to a yaml file with the kubernetes Job/Batch specification to run the traning on 
+kubernetes. 
+See below the example available in the Docker example project. For more information about specification options please see 
+Kubernetes docs:
+`Jobs - Run to Completion <https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/>`_
+
+Have in mind that mlflow overwrites the folowing attributes in the job yaml file so it can handle jobs creation and monitor its status:
+
+- ``metadata.name``;
+
+- ``spec.template.spec.container[0].name``;
+
+- ``spec.template.spec.container[0].image``;
+
+- ``spec.template.spec.container[0].command``;
+
+
+.. code-block:: json
+
+  apiVersion: batch/v1
+  kind: Job
+  metadata:
+    name: ""
+    namespace: mlflow
+  spec:
+    ttlSecondsAfterFinished: 100
+    backoffLimit: 0
+    template:
+      spec:
+        containers:
+        - name: "name"
+          image: "image"
+          command: ["cmd"]
+        resources:
+          limits:
+            memory: 512Mi
+          requests:
+            memory: 256Mi
+        restartPolicy: Never
+
+To run your project just use the command:
+
+.. code-block:: bash
+
+  mlflow run <uri> --backend kubernetes --backend-config examples/docker/kubernetes_config.json
+
+Where ``<uri>`` is a Git repository URI or a folder.
+
+To see it in action, you can use the `Docker example <https://github.com/mlflow/mlflow/tree/master/examples/docker>`_ 
+with the ``kubernetes_backend.json`` and ``kubernetes_job_template.yaml`` files.
+
+
 Iterating Quickly
 -----------------
 

diff --git a/examples/docker/kubernetes_config.json b/examples/docker/kubernetes_config.json
@@ -0,0 +1,5 @@
+{
+    "kube-context": "docker-for-desktop",
+    "kube-job-template-path": "examples/docker/kubernetes_job_template.yaml",
+    "image-uri": "username/mlflow-kubernetes-example"
+}
diff --git a/examples/docker/kubernetes_job_template.yaml b/examples/docker/kubernetes_job_template.yaml
@@ -0,0 +1,20 @@
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: ""
+  namespace: mlflow
+spec:
+  ttlSecondsAfterFinished: 100
+  backoffLimit: 0
+  template:
+    spec:
+      containers:
+      - name: "name"
+        image: "image"
+        command: ["cmd"]
+      resources:
+        limits:
+          memory: 512Mi
+        requests:
+          memory: 256Mi
+      restartPolicy: Never
diff --git a/mlflow/cli.py b/mlflow/cli.py
@@ -114,13 +114,16 @@ def run(uri, entry_point, version, param_list, experiment_name, experiment_id, b
             eprint("Repeated parameter: '%s'" % name)
             sys.exit(1)
         param_dict[name] = value
-    cluster_spec_arg = backend_config
     if backend_config is not None and os.path.splitext(backend_config)[-1] != ".json":
         try:
-            cluster_spec_arg = json.loads(backend_config)
+            backend_config = json.loads(backend_config)
         except ValueError as e:
-            eprint("Invalid cluster spec JSON. Parse error: %s" % e)
+            eprint("Invalid backend config JSON. Parse error: %s" % e)
             raise
+    if backend == "kubernetes":
+        if backend_config is None:
+            eprint("Specify 'backend_config' when using kubernetes mode.")
+            sys.exit(1)
     try:
         projects.run(
             uri,
@@ -130,11 +133,11 @@ def run(uri, entry_point, version, param_list, experiment_name, experiment_id, b
             experiment_id=experiment_id,
             parameters=param_dict,
             backend=backend,
-            backend_config=cluster_spec_arg,
+            backend_config=backend_config,
             use_conda=(not no_conda),
             storage_dir=storage_dir,
-            synchronous=backend == "local" or backend is None,
-            run_id=run_id,
+            synchronous=backend in ("local", "kubernetes") or backend is None,
+            run_id=run_id
         )
     except projects.ExecutionException as e:
         _logger.error("=== %s ===", e)

diff --git a/mlflow/projects/__init__.py b/mlflow/projects/__init__.py
@@ -7,6 +7,7 @@
 from distutils import dir_util
 import hashlib
 import json
+import yaml
 import os
 import sys
 import re
@@ -33,7 +34,8 @@
 from mlflow.utils.mlflow_tags import MLFLOW_PROJECT_ENV, MLFLOW_DOCKER_IMAGE_NAME, \
     MLFLOW_DOCKER_IMAGE_ID, MLFLOW_USER, MLFLOW_SOURCE_NAME, MLFLOW_SOURCE_TYPE, \
     MLFLOW_GIT_COMMIT, MLFLOW_GIT_REPO_URL, MLFLOW_GIT_BRANCH, LEGACY_MLFLOW_GIT_REPO_URL, \
-    LEGACY_MLFLOW_GIT_BRANCH_NAME, MLFLOW_PROJECT_ENTRY_POINT, MLFLOW_PARENT_RUN_ID
+    LEGACY_MLFLOW_GIT_BRANCH_NAME, MLFLOW_PROJECT_ENTRY_POINT, MLFLOW_PARENT_RUN_ID, \
+    MLFLOW_PROJECT_BACKEND
 from mlflow.utils import databricks_utils, file_utils
 
 # TODO: this should be restricted to just Git repos and not S3 and stuff like that
@@ -108,6 +110,8 @@ def _run(uri, experiment_id, entry_point="main", version=None, parameters=None,
             tracking.MlflowClient().set_tag(active_run.info.run_id, tag, version)
 
     if backend == "databricks":
+        tracking.MlflowClient().set_tag(active_run.info.run_id, MLFLOW_PROJECT_BACKEND,
+                                        "databricks")
         from mlflow.projects.databricks import run_databricks
         return run_databricks(
             remote_run=active_run,
@@ -120,17 +124,22 @@ def _run(uri, experiment_id, entry_point="main", version=None, parameters=None,
         # If a docker_env attribute is defined in MLproject then it takes precedence over conda yaml
         # environments, so the project will be executed inside a docker container.
         if project.docker_env:
-            tracking.MlflowClient().set_tag(active_run.info.run_id, MLFLOW_PROJECT_ENV, "docker")
-            _validate_docker_env(project.docker_env)
+            tracking.MlflowClient().set_tag(active_run.info.run_id, MLFLOW_PROJECT_ENV,
+                                            "docker")
+            tracking.MlflowClient().set_tag(active_run.info.run_id, MLFLOW_PROJECT_BACKEND,
+                                            "local")
+            _validate_docker_env(project)
             _validate_docker_installation()
             image = _build_docker_image(work_dir=work_dir,
-                                        project=project,
-                                        active_run=active_run)
+                                        image_uri=project.name,
+                                        base_image=project.docker_env.get('image'),
+                                        run_id=active_run.info.run_id)
             command += _get_docker_command(image=image, active_run=active_run)
         # Synchronously create a conda environment (even though this may take some time)
         # to avoid failures due to multiple concurrent attempts to create the same conda env.
         elif use_conda:
             tracking.MlflowClient().set_tag(active_run.info.run_id, MLFLOW_PROJECT_ENV, "conda")
+            tracking.MlflowClient().set_tag(active_run.info.run_id, MLFLOW_PROJECT_BACKEND, "local")
             command_separator = " && "
             conda_env_name = _get_or_create_conda_env(project.conda_env_path)
             command += _get_conda_command(conda_env_name)
@@ -147,7 +156,33 @@ def _run(uri, experiment_id, entry_point="main", version=None, parameters=None,
             work_dir=work_dir, entry_point=entry_point, parameters=parameters,
             experiment_id=experiment_id,
             use_conda=use_conda, storage_dir=storage_dir, run_id=active_run.info.run_id)
-    supported_backends = ["local", "databricks"]
+    elif backend == "kubernetes":
+        from mlflow.projects import kubernetes as kb
+        tracking.MlflowClient().set_tag(active_run.info.run_id, MLFLOW_PROJECT_ENV, "docker")
+        tracking.MlflowClient().set_tag(active_run.info.run_id, MLFLOW_PROJECT_BACKEND,
+                                        "kubernetes")
+        _validate_docker_env(project)
+        _validate_docker_installation()
+        kube_config = _parse_kubernetes_config(backend_config)
+        image = _build_docker_image(work_dir=work_dir,
+                                    image_uri=kube_config["image-uri"],
+                                    base_image=project.docker_env.get('image'),
+                                    run_id=active_run.info.run_id)
+        image_digest = kb.push_image_to_registry(image.tags[0])
+        submitted_run = kb.run_kubernetes_job(project.name,
+                                              active_run,
+                                              image.tags[0],
+                                              image_digest,
+                                              _get_entry_point_command(project, entry_point,
+                                                                       parameters, storage_dir),
+                                              _get_run_env_vars(
+                                                run_id=active_run.info.run_uuid,
+                                                experiment_id=active_run.info.experiment_id),
+                                              kube_config['kube-context'],
+                                              kube_config['kube-job-template'])
+        return submitted_run
+
+    supported_backends = ["local", "databricks", "kubernetes"]
     raise ExecutionException("Got unsupported execution mode %s. Supported "
                              "values: %s" % (backend, supported_backends))
 
@@ -670,7 +705,7 @@ def _get_docker_command(image, active_run):
 
     for key, value in env_vars.items():
         cmd += ["-e", "{key}={value}".format(key=key, value=value)]
-    cmd += [image]
+    cmd += [image.tags[0]]
     return cmd
 
 
@@ -687,10 +722,39 @@ def _validate_docker_installation():
                                  "at https://docs.docker.com/install/overview/.")
 
 
-def _validate_docker_env(docker_env):
-    if not docker_env.get('image'):
+def _validate_docker_env(project):
+    if not project.name:
+        raise ExecutionException("Project name in MLProject must be specified when using docker "
+                                 "for image tagging.")
+    if not project.docker_env.get('image'):
         raise ExecutionException("Project with docker environment must specify the docker image "
-                                 "to use via an 'image' field under the 'docker_env' field")
+                                 "to use via an 'image' field under the 'docker_env' field.")
+
+
+def _parse_kubernetes_config(backend_config):
+    """
+    Creates build context tarfile containing Dockerfile and project code, returning path to tarfile
+    """
+    if not backend_config:
+        raise ExecutionException("Backend_config file not found.")
+    kube_config = backend_config.copy()
+    if 'kube-job-template-path' not in backend_config.keys():
+        raise ExecutionException("'kube-job-template-path' attribute must be specified in "
+                                 "backend_config.")
+    kube_job_template = backend_config['kube-job-template-path']
+    if os.path.exists(kube_job_template):
+        with open(kube_job_template, 'r') as job_template:
+            yaml_obj = yaml.safe_load(job_template.read())
+        kube_job_template = yaml_obj
+        kube_config['kube-job-template'] = kube_job_template
+    else:
+        raise ExecutionException("Could not find 'kube-job-template-path': {}".format(
+            kube_job_template))
+    if 'kube-context' not in backend_config.keys():
+        raise ExecutionException("Could not find kube-context in backend_config.")
+    if 'image-uri' not in backend_config.keys():
+        raise ExecutionException("Could not find 'image-uri' in backend_config.")
+    return kube_config
 
 
 def _create_docker_build_ctx(work_dir, dockerfile_contents):
@@ -712,51 +776,46 @@ def _create_docker_build_ctx(work_dir, dockerfile_contents):
     return result_path
 
 
-def _build_docker_image(work_dir, project, active_run):
+def _build_docker_image(work_dir, image_uri, base_image, run_id):
     """
-    Build a docker image containing the project in `work_dir`, using the base image and tagging the
-    built image with the project name specified by `project`.
+    Build a docker image containing the project in `work_dir`, using the base image.
     """
-    if not project.name:
-        raise ExecutionException("Project name in MLproject must be specified when using docker "
-                                 "for image tagging.")
-    tag_name = _get_docker_tag_name(project.name, work_dir)
+    tag_name = _get_docker_tag_name(image_uri, work_dir)
     dockerfile = (
         "FROM {imagename}\n"
         "LABEL Name={tag_name}\n"
         "COPY {build_context_path}/ /mlflow/projects/code/\n"
         "WORKDIR /mlflow/projects/code/\n"
-    ).format(imagename=project.docker_env.get('image'), tag_name=tag_name,
+    ).format(imagename=base_image, tag_name=tag_name,
              build_context_path=_PROJECT_TAR_ARCHIVE_NAME)
     build_ctx_path = _create_docker_build_ctx(work_dir, dockerfile)
     with open(build_ctx_path, 'rb') as docker_build_ctx:
         _logger.info("=== Building docker image %s ===", tag_name)
         client = docker.from_env()
-        image = client.images.build(
+        image, _ = client.images.build(
             tag=tag_name, forcerm=True,
             dockerfile=posixpath.join(_PROJECT_TAR_ARCHIVE_NAME, _GENERATED_DOCKERFILE_NAME),
             fileobj=docker_build_ctx, custom_context=True, encoding="gzip")
     try:
         os.remove(build_ctx_path)
     except Exception:  # pylint: disable=broad-except
         _logger.info("Temporary docker context file %s was not deleted.", build_ctx_path)
-    tracking.MlflowClient().set_tag(active_run.info.run_id,
-
+    tracking.MlflowClient().set_tag(run_id,
                                     MLFLOW_DOCKER_IMAGE_NAME,
                                     tag_name)
-    tracking.MlflowClient().set_tag(active_run.info.run_id,
+    tracking.MlflowClient().set_tag(run_id,
                                     MLFLOW_DOCKER_IMAGE_ID,
-                                    image[0].id)
-    return tag_name
+                                    image.id)
+    return image
 
 
-def _get_docker_tag_name(project_name, work_dir):
+def _get_docker_tag_name(imagename, work_dir):
     """Returns an appropriate Docker tag for a project based on name and git hash."""
-    project_name = project_name if project_name else "docker-project"
+    imagename = imagename if imagename else "docker-project"
     # Optionally include first 7 digits of git SHA in tag name, if available.
     git_commit = _get_git_commit(work_dir)
-    version_string = "-" + git_commit[:7] if git_commit else ""
-    return "mlflow-" + project_name + version_string
+    version_string = ":" + git_commit[:7] if git_commit else ""
+    return imagename + version_string
 
 
 __all__ = [