Skip to content

Commit

Permalink
Testing the merge of test cancel fix from ijrsvt/fix-test-stress-cancel
Browse files Browse the repository at this point in the history
Testing whether the fix to the test stress in test_cancel.py works for the PR
  • Loading branch information
Gabriele Oliaro authored Jan 12, 2021
2 parents 0d9bb0d + 86f0f32 commit 3e8e28f
Show file tree
Hide file tree
Showing 79 changed files with 2,197 additions and 889 deletions.
2 changes: 2 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,6 @@ ignore =
W503
W504
W605
I
N
avoid-escape = no
12 changes: 7 additions & 5 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,18 @@ updates:
# If we want to add more requirements here (Core, RLlib, etc.), then we should make subdirectories for each one.
directory: "/python/requirements"
schedule:
# TODO(amogkam) change this to weekly after some initial validation.
interval: "daily"
# 8 PM
time: "20:00"
# Automatic upgrade checks Saturday at 12 AM.
# Dependabot updates can still be manually triggered via Github at any time.
interval: "weekly"
day: "saturday"
# 12 AM
time: "00:00"
# Use Pacific Standard Time
timezone: "America/Los_Angeles"
commit-message:
prefix: "[tune]"
include: "scope"
# Only 3 upgrade PRs at a time.
# Only 3 upgrade PRs open at a time.
open-pull-requests-limit: 3
reviewers:
- "ray-project/ray-tune"
7 changes: 6 additions & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
name: CI

on: [push, pull_request]
on:
push:
branches-ignore:
# Don't run CI for Dependabot branch pushes.
- "dependabot/**"
pull_request:

env:
# Git GITHUB_... variables are useful for translating Travis environment variables
Expand Down
4 changes: 4 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ git:
depth: false # Shallow clones can prevent diff against base branch
quiet: true

branches:
except:
- /dependabot.*/

before_install:
- unset -f cd # Travis defines this on Mac for RVM, but it breaks the Mac build
- |
Expand Down
1 change: 1 addition & 0 deletions ci/travis/determine_tests_to_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ def list_changed_files(commit_range):
RAY_CI_LINUX_WHEELS_AFFECTED = 1
RAY_CI_MACOS_WHEELS_AFFECTED = 1
elif changed_file.startswith("python/ray/serve"):
RAY_CI_DOC_AFFECTED = 1
RAY_CI_SERVE_AFFECTED = 1
RAY_CI_LINUX_WHEELS_AFFECTED = 1
RAY_CI_MACOS_WHEELS_AFFECTED = 1
Expand Down
90 changes: 90 additions & 0 deletions doc/source/cluster/kubernetes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,19 @@ Deploying on Kubernetes

This document is mainly for advanced Kubernetes usage. The easiest way to run a Ray cluster on Kubernetes is by using the built-in Cluster Launcher. Please see the :ref:`Cluster Launcher documentation <ray-launch-k8s>` for details.



This document assumes that you have access to a Kubernetes cluster and have
``kubectl`` installed locally and configured to access the cluster. It will
first walk you through how to deploy a Ray cluster on your existing Kubernetes
cluster, then explore a few different ways to run programs on the Ray cluster.


To learn about deploying an autoscaling Ray cluster using :ref:`Ray's Kubernetes operator<k8s-operator>`, read
:ref:`here<k8s-operator>`.

For information on using GPUs with Ray on Kubernetes, see :ref:`here<k8s-gpus>`.

The configuration ``yaml`` files used here are provided in the `Ray repository`_
as examples to get you started. When deploying real applications, you will probably
want to build and use your own container images, add more worker nodes to the
Expand Down Expand Up @@ -292,6 +297,80 @@ To delete a running Ray cluster, you can run the following command:
kubectl delete -f ray/doc/kubernetes/ray-cluster.yaml
.. _k8s-gpus:

Using GPUs
----------

To use GPUs on Kubernetes, you will need to configure both your Kubernetes setup and add additional values to your Ray cluster configuration.

For relevant documentation for GPU usage on different clouds, see instructions for `GKE`_, for `EKS`_, and for `AKS`_.

The `Ray Docker Hub <https://hub.docker.com/r/rayproject/>`_ hosts CUDA-based images packaged with Ray for use in Kubernetes pods.
For example, the image ``rayproject/ray-ml:nightly-gpu`` is ideal for running GPU-based ML workloads with the most recent nightly build of Ray.
Read :ref:`here<docker-images>` for further details on Ray images.

Using Nvidia GPUs requires specifying the relevant resource `limits` in the container fields of your Kubernetes configurations.
(Kubernetes `sets <https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins>`_
the GPU request equal to the limit.) The configuration for a pod running a Ray GPU image and
using one Nvidia GPU looks like this:

.. code-block:: yaml
apiVersion: v1
kind: Pod
metadata:
generateName: example-cluster-ray-worker
spec:
...
containers:
- name: ray-node
image: rayproject/ray:nightly-gpu
...
resources:
cpu: 1000m
memory: 512Mi
limits:
memory: 512Mi
nvidia.com/gpu: 1
GPU taints and tolerations
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. note::

Users using a managed Kubernetes service probably don't need to worry about this section.

The `Nvidia gpu plugin`_ for Kubernetes applies `taints`_ to GPU nodes; these taints prevent non-GPU pods from being scheduled on GPU nodes.
Managed Kubernetes services like GKE, EKS, and AKS automatically apply matching `tolerations`_
to pods requesting GPU resources. Tolerations are applied by means of Kubernetes's `ExtendedResourceToleration`_ `admission controller`_.
If this admission controller is not enabled for your Kubernetes cluster, you may need to manually add a GPU toleration each of to your GPU pod configurations. For example,

.. code-block:: yaml
apiVersion: v1
kind: Pod
metadata:
generateName: example-cluster-ray-worker
spec:
...
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
...
containers:
- name: ray-node
image: rayproject/ray:nightly-gpu
...
Further reference and discussion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Read about Kubernetes device plugins `here <https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/>`__,
about Kubernetes GPU plugins `here <https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus>`__,
and about Nvidia's GPU plugin for Kubernetes `here <https://github.com/NVIDIA/k8s-device-plugin>`__.

If you run into problems setting up GPUs for your Ray cluster on Kubernetes, please reach out to us at `<https://discuss.ray.io>`_.

Questions or Issues?
--------------------

Expand All @@ -303,3 +382,14 @@ Questions or Issues?
.. _`Kubernetes Service`: https://kubernetes.io/docs/concepts/services-networking/service/
.. _`Kubernetes Deployment`: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
.. _`Kubernetes Job`: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/

.. _`Discussion Board`: https://discuss.ray.io/
.. _`GKE`: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
.. _`EKS`: https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html
.. _`AKS`: https://docs.microsoft.com/en-us/azure/aks/gpu-cluster

.. _`tolerations`: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
.. _`taints`: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
.. _`Nvidia gpu plugin`: https://github.com/NVIDIA/k8s-device-plugin
.. _`admission controller`: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/
.. _`ExtendedResourceToleration`: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#extendedresourcetoleration
1 change: 1 addition & 0 deletions doc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,7 @@ Papers
xgboost-ray.rst
dask-on-ray.rst
mars-on-ray.rst
ray-client.rst
.. toctree::
:hidden:
Expand Down
1 change: 1 addition & 0 deletions doc/source/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,7 @@ Installing from ``pip`` should be sufficient for most Ray users.
However, should you need to build from source, follow :ref:`these instructions for building <building-ray>` Ray.


.. _docker-images:

Docker Source Images
--------------------
Expand Down
69 changes: 69 additions & 0 deletions doc/source/ray-client.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
**********
Ray Client
**********

.. note::

This feature is still in beta and subject to changes.

===========
Basic usage
===========

While in beta, the server is available as an executable module. To start the server, run

``python -m ray.util.client.server [--host host_ip] [--port port] [--redis-address address] [--redis-password password]``

This runs ``ray.init()`` with default options and exposes the client gRPC port at ``host_ip:port`` (by default, ``0.0.0.0:50051``). Providing ``redis-address`` and ``redis-password`` will be passed into ``ray.init()`` when the server starts, allowing connection to an existing Ray cluster, as per the `cluster setup <cluster/index.html>`_ instructions.

From here, another Ray script can access that server from a networked machine with ``ray.util.connect()``

.. code-block:: python
import ray
import ray.util
ray.util.connect("0.0.0.0:50051") # replace with the appropriate host and port
# Normal Ray code follows
@ray.remote
def f(x):
return x ** x
do_work.remote(2)
#....
When the client disconnects, any object or actor references held by the server on behalf of the client are dropped, as if directly disconnecting from the cluster


===================
``RAY_CLIENT_MODE``
===================

Because Ray client mode affects the behavior of the Ray API, larger scripts or libraries imported before ``ray.util.connect()`` may not realize they're in client mode. This feature is being tracked with `issue #13272 <https://github.com/ray-project/ray/issues/13272>`_ but the workaround here is provided for beta users.

One option is to defer the imports from a ``main`` script that calls ``ray.util.connect()`` first. However, some older scripts or libraries might not support that.

Therefore, an environment variable is also available to force a Ray program into client mode: ``RAY_CLIENT_MODE`` An example usage:

.. code-block:: bash
RAY_CLIENT_MODE=1 python my_ray_program.py
===================================
Programatically creating the server
===================================

For larger use-cases, it may be desirable to connect remote Ray clients to an existing Ray environment. The server can be started separately via

.. code-block:: python
from ray.util.client.server import serve
server = serve("0.0.0.0:50051")
# Server does some work
# ...
# Time to clean up
server.stop(0)
31 changes: 25 additions & 6 deletions doc/source/rllib-algorithms.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Available Algorithms - Overview
=================== ========== ======================= ================== =========== =============================================================
Algorithm Frameworks Discrete Actions Continuous Actions Multi-Agent Model Support
=================== ========== ======================= ================== =========== =============================================================
`A2C, A3C`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_, `+LSTM auto-wrapping`_, `+Transformer`_, `+autoreg`_
`A2C, A3C`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_
`ARS`_ tf + torch **Yes** **Yes** No
`BC`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_
`ES`_ tf + torch **Yes** **Yes** No
Expand All @@ -20,13 +20,14 @@ Algorithm Frameworks Discrete Actions Continuous Actions Multi-
`Dreamer`_ torch No **Yes** No `+RNN`_
`DQN`_, `Rainbow`_ tf + torch **Yes** `+parametric`_ No **Yes**
`APEX-DQN`_ tf + torch **Yes** `+parametric`_ No **Yes**
`IMPALA`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_, `+LSTM auto-wrapping`_, `+Transformer`_, `+autoreg`_
`IMPALA`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_
`MAML`_ tf + torch No **Yes** No
`MARWIL`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_
`MBMPO`_ torch No **Yes** No
`PG`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_, `+LSTM auto-wrapping`_, `+Transformer`_, `+autoreg`_
`PPO`_, `APPO`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_, `+LSTM auto-wrapping`_, `+Transformer`_, `+autoreg`_
`PG`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_
`PPO`_, `APPO`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_
`SAC`_ tf + torch **Yes** **Yes** **Yes**
`SlateQ`_ torch **Yes** No No
`LinUCB`_, `LinTS`_ torch **Yes** `+parametric`_ No **Yes**
`AlphaZero`_ torch **Yes** `+parametric`_ No No
=================== ========== ======================= ================== =========== =============================================================
Expand Down Expand Up @@ -60,9 +61,9 @@ Algorithm Frameworks Discrete Actions Continuous Acti
.. _`+LSTM auto-wrapping`: rllib-models.html#built-in-models
.. _`+parametric`: rllib-models.html#variable-length-parametric-action-spaces
.. _`Rainbow`: rllib-algorithms.html#dqn
.. _`+RNN`: rllib-models.html#recurrent-models
.. _`+RNN`: rllib-models.html#rnns
.. _`TD3`: rllib-algorithms.html#ddpg
.. _`+Transformer`: rllib-models.html#attention-networks
.. _`+Attention`: rllib-models.html#attention

High-throughput architectures
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -523,6 +524,24 @@ Cheetah-Run 640 ~800
:start-after: __sphinx_doc_begin__
:end-before: __sphinx_doc_end__

.. _slateq:

SlateQ
-------
|pytorch|
`[paper] <https://storage.googleapis.com/pub-tools-public-publication-data/pdf/9f91de1fa0ac351ecb12e4062a37afb896aa1463.pdf>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/agents/slateq/slateq.py>`__

SlateQ is a model-free RL method that builds on top of DQN and generates recommendation slates for recommender system environments. Since these types of environments come with large combinatorial action spaces, SlateQ mitigates this by decomposing the Q-value into single-item Q-values and solves the decomposed objective via mixing integer programming and deep learning optimization. SlateQ can be evaluated on Google's RecSim `environment <https://github.com/google-research/recsim>`__. `An RLlib wrapper for RecSim can be found here < <https://github.com/ray-project/ray/blob/master/rllib/env/wrappers/recsim_wrapper.py>`__.

RecSim environment wrapper: `Google RecSim <https://github.com/ray-project/ray/blob/master/rllib/env/wrappers/recsim_wrapper.py>`__

**SlateQ-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):

.. literalinclude:: ../../rllib/agents/slateq/slateq.py
:language: python
:start-after: __sphinx_doc_begin__
:end-before: __sphinx_doc_end__

Derivative-free
~~~~~~~~~~~~~~~

Expand Down
Loading

0 comments on commit 3e8e28f

Please sign in to comment.