Skip to content

Commit

Permalink
Run MLprojects on kubernetes (mlflow#1181)
Browse files Browse the repository at this point in the history
* initial setup and job yaml file creation

* first tests on kuberntes

* refactory and tests

* adds kubernetes mode to cli

* adds to env vars Azure vars if exits in local host

* fix tag setup and parameters in kubernetes call

* encapsulates run command logic in function

* adds Job monitoring in command line

* lint

* remove kube context from MLproject file

* refactor MLproject and new tests

* validates docker_auth_config and kube_context in cli

* Add docs

* add example

* lint and tests for kubernetes config

* lint on tests

* lint tests

* add kubernetes in requirements.txt and lint tests

* lint and tests env

* lint and requirements

* merge, loggin and lint

* remove unused import

* change image_namespace in dockeMLproject  docker example

* add resources attributes in MLproject
ability to pass k8s job template

* improve docs on k8s resources in MLprojects
Ability to pass relative job-template file

* lint on tests

* remove docker auth parameter from CLI

* refatores image tag

* lint

* change mode to backend

* change mode to backend

* update to use backend_config

* update to backend_config

* change username in image_url

* lint and add base-image to kubernetes_config.json

* lint

* absolut_import to handle kubernetes modlue import conflict

* lint

* docs

* remove kubernetes_env in project_spec and var names in cli

* fix container command format

* return docker_env usage in kubernetes backend

* lint

* lint _logger.info

* update kubernetes docs and fix docker docs

* docker project validation

* docker image tagging scheme

* tests for docker_env validation, backend tag and other issues

* handles errors pushing docker images

* docs typo, remove kubernetes from setup

* remove code to add azure blob storage keys in job definition file

* lint

* load of kube-job-template-path

* refactory to add KubernetesSubmittedRun

* lint

* add KubernetesSubmittedRun and tests

* lint tests

* lint

* Some changes

* lint and fixes

* fix tests validating docker env

* remove the streaming of the logs

* Updated logic for checking of kubernetes job status.

* Fixed tests.

* Added a test for state transitions of kubernetes run.

* Minor update.

* docs

* backend tag rename

* remove rstcheck from lint-requirements

* correct project env tag

* rstcheck in lint-requiremtns.txt

* handles environment variables in job template when it already has variables specified

* Simplified state monitoring of Kubernetes Submitted Run.
  • Loading branch information
marcusrehm authored and tomasatdatabricks committed Jun 26, 2019
1 parent b0836f8 commit 5dcb226
Show file tree
Hide file tree
Showing 17 changed files with 648 additions and 41 deletions.
3 changes: 3 additions & 0 deletions dev-requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,6 @@ nose
codecov
coverage
pypi-publisher
scikit-learn
scipy
kubernetes
88 changes: 86 additions & 2 deletions docs/source/projects.rst
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,8 @@ Docker container environment

.. code-block:: yaml
docker_env: mlflow-docker-example-environment
docker_env:
image: mlflow-docker-example-environment
In this example, ``docker_env`` refers to the Docker image with name
``mlflow-docker-example-environment`` and default tag ``latest``. Because no registry path is
Expand All @@ -223,7 +224,8 @@ Docker container environment

.. code-block:: yaml
docker_env: 012345678910.dkr.ecr.us-west-2.amazonaws.com/mlflow-docker-example-environment:7.0
docker_env:
image: 012345678910.dkr.ecr.us-west-2.amazonaws.com/mlflow-docker-example-environment:7.0
In this example, ``docker_env`` refers to the Docker image with name
``mlflow-docker-example-environment`` and tag ``7.0`` in the Docker registry with path
Expand Down Expand Up @@ -324,6 +326,9 @@ Deployment Mode
includes setting cluster parameters such as a VM type. Of course, you can also run projects on
any other computing infrastructure of your choice using the local version of the ``mlflow run``
command (for example, submit a script that does ``mlflow run`` to a standard job queueing system).

It's also possible to launch projects remotely in `Kubernetes <https://kubernetes.io/>`_ clusters
using command-line (see :ref:`Run a project on Kubernetes <kubernetes_execution>`).

Environment
By default, MLflow Projects are run in the environment specified by the project directory
Expand Down Expand Up @@ -363,6 +368,85 @@ for your run. Then, run your project using the command
where ``<uri>`` is a Git repository URI or a folder.

.. _kubernetes_execution:

Run a project on Kubernetes
^^^^^^^^^^^^^^^^^^^^^^^^^^^

MLflow projects can be executed in kubernetes clusters. Basically it uses the image created to run
projects in :ref:`Docker environment <project-docker-container-environments>` and pushes it to an
image repository, so you need to configure MLproject with ``docker_env`` section. After that it
creates a Kubernetes Job that uses this published image and runs the MLflow project on kubernetes.
A brief overview of how to configure and use this feature is as follows:

In project folder you need to create a ``backend_config.json`` with the following attributes:

.. code-block:: json
{
"kube-context": "docker-for-desktop",

"image-uri": "username/mlflow-kubernetes-example",

"kube-job-template-path": "kubernetes_job_template.yaml"
}

The ``kube-context`` attribute is the kubernetes context where mlflow will run the Job. ``image-uri`` points to the
registry/repository/image where the image will be pushed so kubernetes can download it and run. Remeber that mlflow
expects that login credentials are already stored for both kubernetes context and docker repository to push images.

The ``kube-job-template-path`` points to a yaml file with the kubernetes Job/Batch specification to run the traning on
kubernetes.
See below the example available in the Docker example project. For more information about specification options please see
Kubernetes docs:
`Jobs - Run to Completion <https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/>`_

Have in mind that mlflow overwrites the folowing attributes in the job yaml file so it can handle jobs creation and monitor its status:

- ``metadata.name``;

- ``spec.template.spec.container[0].name``;

- ``spec.template.spec.container[0].image``;

- ``spec.template.spec.container[0].command``;


.. code-block:: json
apiVersion: batch/v1
kind: Job
metadata:
name: ""
namespace: mlflow
spec:
ttlSecondsAfterFinished: 100
backoffLimit: 0
template:
spec:
containers:
- name: "name"
image: "image"
command: ["cmd"]
resources:
limits:
memory: 512Mi
requests:
memory: 256Mi
restartPolicy: Never
To run your project just use the command:

.. code-block:: bash
mlflow run <uri> --backend kubernetes --backend-config examples/docker/kubernetes_config.json
Where ``<uri>`` is a Git repository URI or a folder.

To see it in action, you can use the `Docker example <https://github.com/mlflow/mlflow/tree/master/examples/docker>`_
with the ``kubernetes_backend.json`` and ``kubernetes_job_template.yaml`` files.


Iterating Quickly
-----------------

Expand Down
5 changes: 5 additions & 0 deletions examples/docker/kubernetes_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"kube-context": "docker-for-desktop",
"kube-job-template-path": "examples/docker/kubernetes_job_template.yaml",
"image-uri": "username/mlflow-kubernetes-example"
}
20 changes: 20 additions & 0 deletions examples/docker/kubernetes_job_template.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
apiVersion: batch/v1
kind: Job
metadata:
name: ""
namespace: mlflow
spec:
ttlSecondsAfterFinished: 100
backoffLimit: 0
template:
spec:
containers:
- name: "name"
image: "image"
command: ["cmd"]
resources:
limits:
memory: 512Mi
requests:
memory: 256Mi
restartPolicy: Never
15 changes: 9 additions & 6 deletions mlflow/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,13 +114,16 @@ def run(uri, entry_point, version, param_list, experiment_name, experiment_id, b
eprint("Repeated parameter: '%s'" % name)
sys.exit(1)
param_dict[name] = value
cluster_spec_arg = backend_config
if backend_config is not None and os.path.splitext(backend_config)[-1] != ".json":
try:
cluster_spec_arg = json.loads(backend_config)
backend_config = json.loads(backend_config)
except ValueError as e:
eprint("Invalid cluster spec JSON. Parse error: %s" % e)
eprint("Invalid backend config JSON. Parse error: %s" % e)
raise
if backend == "kubernetes":
if backend_config is None:
eprint("Specify 'backend_config' when using kubernetes mode.")
sys.exit(1)
try:
projects.run(
uri,
Expand All @@ -130,11 +133,11 @@ def run(uri, entry_point, version, param_list, experiment_name, experiment_id, b
experiment_id=experiment_id,
parameters=param_dict,
backend=backend,
backend_config=cluster_spec_arg,
backend_config=backend_config,
use_conda=(not no_conda),
storage_dir=storage_dir,
synchronous=backend == "local" or backend is None,
run_id=run_id,
synchronous=backend in ("local", "kubernetes") or backend is None,
run_id=run_id
)
except projects.ExecutionException as e:
_logger.error("=== %s ===", e)
Expand Down
115 changes: 87 additions & 28 deletions mlflow/projects/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from distutils import dir_util
import hashlib
import json
import yaml
import os
import sys
import re
Expand All @@ -33,7 +34,8 @@
from mlflow.utils.mlflow_tags import MLFLOW_PROJECT_ENV, MLFLOW_DOCKER_IMAGE_NAME, \
MLFLOW_DOCKER_IMAGE_ID, MLFLOW_USER, MLFLOW_SOURCE_NAME, MLFLOW_SOURCE_TYPE, \
MLFLOW_GIT_COMMIT, MLFLOW_GIT_REPO_URL, MLFLOW_GIT_BRANCH, LEGACY_MLFLOW_GIT_REPO_URL, \
LEGACY_MLFLOW_GIT_BRANCH_NAME, MLFLOW_PROJECT_ENTRY_POINT, MLFLOW_PARENT_RUN_ID
LEGACY_MLFLOW_GIT_BRANCH_NAME, MLFLOW_PROJECT_ENTRY_POINT, MLFLOW_PARENT_RUN_ID, \
MLFLOW_PROJECT_BACKEND
from mlflow.utils import databricks_utils, file_utils

# TODO: this should be restricted to just Git repos and not S3 and stuff like that
Expand Down Expand Up @@ -108,6 +110,8 @@ def _run(uri, experiment_id, entry_point="main", version=None, parameters=None,
tracking.MlflowClient().set_tag(active_run.info.run_id, tag, version)

if backend == "databricks":
tracking.MlflowClient().set_tag(active_run.info.run_id, MLFLOW_PROJECT_BACKEND,
"databricks")
from mlflow.projects.databricks import run_databricks
return run_databricks(
remote_run=active_run,
Expand All @@ -120,17 +124,22 @@ def _run(uri, experiment_id, entry_point="main", version=None, parameters=None,
# If a docker_env attribute is defined in MLproject then it takes precedence over conda yaml
# environments, so the project will be executed inside a docker container.
if project.docker_env:
tracking.MlflowClient().set_tag(active_run.info.run_id, MLFLOW_PROJECT_ENV, "docker")
_validate_docker_env(project.docker_env)
tracking.MlflowClient().set_tag(active_run.info.run_id, MLFLOW_PROJECT_ENV,
"docker")
tracking.MlflowClient().set_tag(active_run.info.run_id, MLFLOW_PROJECT_BACKEND,
"local")
_validate_docker_env(project)
_validate_docker_installation()
image = _build_docker_image(work_dir=work_dir,
project=project,
active_run=active_run)
image_uri=project.name,
base_image=project.docker_env.get('image'),
run_id=active_run.info.run_id)
command += _get_docker_command(image=image, active_run=active_run)
# Synchronously create a conda environment (even though this may take some time)
# to avoid failures due to multiple concurrent attempts to create the same conda env.
elif use_conda:
tracking.MlflowClient().set_tag(active_run.info.run_id, MLFLOW_PROJECT_ENV, "conda")
tracking.MlflowClient().set_tag(active_run.info.run_id, MLFLOW_PROJECT_BACKEND, "local")
command_separator = " && "
conda_env_name = _get_or_create_conda_env(project.conda_env_path)
command += _get_conda_command(conda_env_name)
Expand All @@ -147,7 +156,33 @@ def _run(uri, experiment_id, entry_point="main", version=None, parameters=None,
work_dir=work_dir, entry_point=entry_point, parameters=parameters,
experiment_id=experiment_id,
use_conda=use_conda, storage_dir=storage_dir, run_id=active_run.info.run_id)
supported_backends = ["local", "databricks"]
elif backend == "kubernetes":
from mlflow.projects import kubernetes as kb
tracking.MlflowClient().set_tag(active_run.info.run_id, MLFLOW_PROJECT_ENV, "docker")
tracking.MlflowClient().set_tag(active_run.info.run_id, MLFLOW_PROJECT_BACKEND,
"kubernetes")
_validate_docker_env(project)
_validate_docker_installation()
kube_config = _parse_kubernetes_config(backend_config)
image = _build_docker_image(work_dir=work_dir,
image_uri=kube_config["image-uri"],
base_image=project.docker_env.get('image'),
run_id=active_run.info.run_id)
image_digest = kb.push_image_to_registry(image.tags[0])
submitted_run = kb.run_kubernetes_job(project.name,
active_run,
image.tags[0],
image_digest,
_get_entry_point_command(project, entry_point,
parameters, storage_dir),
_get_run_env_vars(
run_id=active_run.info.run_uuid,
experiment_id=active_run.info.experiment_id),
kube_config['kube-context'],
kube_config['kube-job-template'])
return submitted_run

supported_backends = ["local", "databricks", "kubernetes"]
raise ExecutionException("Got unsupported execution mode %s. Supported "
"values: %s" % (backend, supported_backends))

Expand Down Expand Up @@ -670,7 +705,7 @@ def _get_docker_command(image, active_run):

for key, value in env_vars.items():
cmd += ["-e", "{key}={value}".format(key=key, value=value)]
cmd += [image]
cmd += [image.tags[0]]
return cmd


Expand All @@ -687,10 +722,39 @@ def _validate_docker_installation():
"at https://docs.docker.com/install/overview/.")


def _validate_docker_env(docker_env):
if not docker_env.get('image'):
def _validate_docker_env(project):
if not project.name:
raise ExecutionException("Project name in MLProject must be specified when using docker "
"for image tagging.")
if not project.docker_env.get('image'):
raise ExecutionException("Project with docker environment must specify the docker image "
"to use via an 'image' field under the 'docker_env' field")
"to use via an 'image' field under the 'docker_env' field.")


def _parse_kubernetes_config(backend_config):
"""
Creates build context tarfile containing Dockerfile and project code, returning path to tarfile
"""
if not backend_config:
raise ExecutionException("Backend_config file not found.")
kube_config = backend_config.copy()
if 'kube-job-template-path' not in backend_config.keys():
raise ExecutionException("'kube-job-template-path' attribute must be specified in "
"backend_config.")
kube_job_template = backend_config['kube-job-template-path']
if os.path.exists(kube_job_template):
with open(kube_job_template, 'r') as job_template:
yaml_obj = yaml.safe_load(job_template.read())
kube_job_template = yaml_obj
kube_config['kube-job-template'] = kube_job_template
else:
raise ExecutionException("Could not find 'kube-job-template-path': {}".format(
kube_job_template))
if 'kube-context' not in backend_config.keys():
raise ExecutionException("Could not find kube-context in backend_config.")
if 'image-uri' not in backend_config.keys():
raise ExecutionException("Could not find 'image-uri' in backend_config.")
return kube_config


def _create_docker_build_ctx(work_dir, dockerfile_contents):
Expand All @@ -712,51 +776,46 @@ def _create_docker_build_ctx(work_dir, dockerfile_contents):
return result_path


def _build_docker_image(work_dir, project, active_run):
def _build_docker_image(work_dir, image_uri, base_image, run_id):
"""
Build a docker image containing the project in `work_dir`, using the base image and tagging the
built image with the project name specified by `project`.
Build a docker image containing the project in `work_dir`, using the base image.
"""
if not project.name:
raise ExecutionException("Project name in MLproject must be specified when using docker "
"for image tagging.")
tag_name = _get_docker_tag_name(project.name, work_dir)
tag_name = _get_docker_tag_name(image_uri, work_dir)
dockerfile = (
"FROM {imagename}\n"
"LABEL Name={tag_name}\n"
"COPY {build_context_path}/ /mlflow/projects/code/\n"
"WORKDIR /mlflow/projects/code/\n"
).format(imagename=project.docker_env.get('image'), tag_name=tag_name,
).format(imagename=base_image, tag_name=tag_name,
build_context_path=_PROJECT_TAR_ARCHIVE_NAME)
build_ctx_path = _create_docker_build_ctx(work_dir, dockerfile)
with open(build_ctx_path, 'rb') as docker_build_ctx:
_logger.info("=== Building docker image %s ===", tag_name)
client = docker.from_env()
image = client.images.build(
image, _ = client.images.build(
tag=tag_name, forcerm=True,
dockerfile=posixpath.join(_PROJECT_TAR_ARCHIVE_NAME, _GENERATED_DOCKERFILE_NAME),
fileobj=docker_build_ctx, custom_context=True, encoding="gzip")
try:
os.remove(build_ctx_path)
except Exception: # pylint: disable=broad-except
_logger.info("Temporary docker context file %s was not deleted.", build_ctx_path)
tracking.MlflowClient().set_tag(active_run.info.run_id,

tracking.MlflowClient().set_tag(run_id,
MLFLOW_DOCKER_IMAGE_NAME,
tag_name)
tracking.MlflowClient().set_tag(active_run.info.run_id,
tracking.MlflowClient().set_tag(run_id,
MLFLOW_DOCKER_IMAGE_ID,
image[0].id)
return tag_name
image.id)
return image


def _get_docker_tag_name(project_name, work_dir):
def _get_docker_tag_name(imagename, work_dir):
"""Returns an appropriate Docker tag for a project based on name and git hash."""
project_name = project_name if project_name else "docker-project"
imagename = imagename if imagename else "docker-project"
# Optionally include first 7 digits of git SHA in tag name, if available.
git_commit = _get_git_commit(work_dir)
version_string = "-" + git_commit[:7] if git_commit else ""
return "mlflow-" + project_name + version_string
version_string = ":" + git_commit[:7] if git_commit else ""
return imagename + version_string


__all__ = [
Expand Down
Loading

0 comments on commit 5dcb226

Please sign in to comment.