Testing the merge of test cancel fix from ijrsvt/fix-test-stress-cancel

Testing whether the fix to the test stress in test_cancel.py works for the PR
erikerlandson · Jan 12, 2021 · 3e8e28f · 3e8e28f
2 parents 0d9bb0d + 86f0f32
commit 3e8e28f
Show file tree

Hide file tree

Showing 79 changed files with 2,197 additions and 889 deletions.
diff --git a/.flake8 b/.flake8
@@ -20,4 +20,6 @@ ignore =
   W503
   W504
   W605
+  I
+  N
 avoid-escape = no
diff --git a/.github/dependabot.yml b/.github/dependabot.yml
@@ -6,16 +6,18 @@ updates:
     # If we want to add more requirements here (Core, RLlib, etc.), then we should make subdirectories for each one.
     directory: "/python/requirements"
     schedule:
-      # TODO(amogkam) change this to weekly after some initial validation.
-      interval: "daily"
-      # 8 PM
-      time: "20:00"
+      # Automatic upgrade checks Saturday at 12 AM.
+      # Dependabot updates can still be manually triggered via Github at any time.
+      interval: "weekly"
+      day: "saturday"
+      # 12 AM
+      time: "00:00"
       # Use Pacific Standard Time
       timezone: "America/Los_Angeles"
     commit-message:
       prefix: "[tune]"
       include: "scope"
-    # Only 3 upgrade PRs at a time.
+    # Only 3 upgrade PRs open at a time.
     open-pull-requests-limit: 3
     reviewers:
       - "ray-project/ray-tune"
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -1,6 +1,11 @@
 name: CI
 
-on: [push, pull_request]
+on:
+  push:
+    branches-ignore:
+      # Don't run CI for Dependabot branch pushes.
+      - "dependabot/**"
+  pull_request:
 
 env:
   # Git GITHUB_... variables are useful for translating Travis environment variables

diff --git a/.travis.yml b/.travis.yml
@@ -7,6 +7,10 @@ git:
   depth: false  # Shallow clones can prevent diff against base branch
   quiet: true
 
+branches:
+  except:
+    - /dependabot.*/
+
 before_install:
   - unset -f cd  # Travis defines this on Mac for RVM, but it breaks the Mac build
   - |

diff --git a/ci/travis/determine_tests_to_run.py b/ci/travis/determine_tests_to_run.py
@@ -88,6 +88,7 @@ def list_changed_files(commit_range):
                 RAY_CI_LINUX_WHEELS_AFFECTED = 1
                 RAY_CI_MACOS_WHEELS_AFFECTED = 1
             elif changed_file.startswith("python/ray/serve"):
+                RAY_CI_DOC_AFFECTED = 1
                 RAY_CI_SERVE_AFFECTED = 1
                 RAY_CI_LINUX_WHEELS_AFFECTED = 1
                 RAY_CI_MACOS_WHEELS_AFFECTED = 1

diff --git a/doc/source/cluster/kubernetes.rst b/doc/source/cluster/kubernetes.rst
@@ -7,14 +7,19 @@ Deploying on Kubernetes
 
   This document is mainly for advanced Kubernetes usage. The easiest way to run a Ray cluster on Kubernetes is by using the built-in Cluster Launcher. Please see the :ref:`Cluster Launcher documentation <ray-launch-k8s>` for details.
 
+
+
 This document assumes that you have access to a Kubernetes cluster and have
 ``kubectl`` installed locally and configured to access the cluster. It will
 first walk you through how to deploy a Ray cluster on your existing Kubernetes
 cluster, then explore a few different ways to run programs on the Ray cluster.
 
+
 To learn about deploying an autoscaling Ray cluster using :ref:`Ray's Kubernetes operator<k8s-operator>`, read
 :ref:`here<k8s-operator>`.
 
+For information on using GPUs with Ray on Kubernetes, see :ref:`here<k8s-gpus>`.
+
 The configuration ``yaml`` files used here are provided in the `Ray repository`_
 as examples to get you started. When deploying real applications, you will probably
 want to build and use your own container images, add more worker nodes to the
@@ -292,6 +297,80 @@ To delete a running Ray cluster, you can run the following command:
 
   kubectl delete -f ray/doc/kubernetes/ray-cluster.yaml
 
+.. _k8s-gpus:
+
+Using GPUs
+----------
+
+To use GPUs on Kubernetes, you will need to configure both your Kubernetes setup and add additional values to your Ray cluster configuration.
+
+For relevant documentation for GPU usage on different clouds, see instructions for `GKE`_, for `EKS`_, and for `AKS`_.
+
+The `Ray Docker Hub <https://hub.docker.com/r/rayproject/>`_ hosts CUDA-based images packaged with Ray for use in Kubernetes pods. 
+For example, the image ``rayproject/ray-ml:nightly-gpu`` is ideal for running GPU-based ML workloads with the most recent nightly build of Ray.
+Read :ref:`here<docker-images>` for further details on Ray images. 
+
+Using Nvidia GPUs requires specifying the relevant resource `limits` in the container fields of your Kubernetes configurations.
+(Kubernetes `sets <https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins>`_ 
+the GPU request equal to the limit.) The configuration for a pod running a Ray GPU image and
+using one Nvidia GPU looks like this:
+
+.. code-block:: yaml
+
+  apiVersion: v1
+  kind: Pod
+  metadata:
+   generateName: example-cluster-ray-worker
+   spec:
+    ...
+    containers:
+     - name: ray-node
+       image: rayproject/ray:nightly-gpu
+       ...
+       resources:
+        cpu: 1000m
+        memory: 512Mi
+       limits:
+        memory: 512Mi
+        nvidia.com/gpu: 1
+
+GPU taints and tolerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. note::
+
+  Users using a managed Kubernetes service probably don't need to worry about this section. 
+
+The `Nvidia gpu plugin`_ for Kubernetes applies `taints`_ to GPU nodes; these taints prevent non-GPU pods from being scheduled on GPU nodes.
+Managed Kubernetes services like GKE, EKS, and AKS automatically apply matching `tolerations`_ 
+to pods requesting GPU resources. Tolerations are applied by means of Kubernetes's `ExtendedResourceToleration`_ `admission controller`_. 
+If this admission controller is not enabled for your Kubernetes cluster, you may need to manually add a GPU toleration each of to your GPU pod configurations. For example,
+
+.. code-block:: yaml
+
+  apiVersion: v1
+  kind: Pod
+  metadata:
+   generateName: example-cluster-ray-worker
+   spec:
+   ...
+   tolerations:
+   - effect: NoSchedule
+     key: nvidia.com/gpu
+     operator: Exists
+   ...
+   containers:
+   - name: ray-node
+     image: rayproject/ray:nightly-gpu
+     ...
+
+Further reference and discussion
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Read about Kubernetes device plugins `here <https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/>`__,
+about Kubernetes GPU plugins `here <https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus>`__,
+and about Nvidia's GPU plugin for Kubernetes `here <https://github.com/NVIDIA/k8s-device-plugin>`__.
+
+If you run into problems setting up GPUs for your Ray cluster on Kubernetes, please reach out to us at `<https://discuss.ray.io>`_. 
+
 Questions or Issues?
 --------------------
 
@@ -303,3 +382,14 @@ Questions or Issues?
 .. _`Kubernetes Service`: https://kubernetes.io/docs/concepts/services-networking/service/
 .. _`Kubernetes Deployment`: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
 .. _`Kubernetes Job`: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
+
+.. _`Discussion Board`: https://discuss.ray.io/
+.. _`GKE`: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
+.. _`EKS`: https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html
+.. _`AKS`: https://docs.microsoft.com/en-us/azure/aks/gpu-cluster
+
+.. _`tolerations`: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
+.. _`taints`: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
+.. _`Nvidia gpu plugin`: https://github.com/NVIDIA/k8s-device-plugin
+.. _`admission controller`: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/
+.. _`ExtendedResourceToleration`: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#extendedresourcetoleration
diff --git a/doc/source/index.rst b/doc/source/index.rst
@@ -304,6 +304,7 @@ Papers
    xgboost-ray.rst
    dask-on-ray.rst
    mars-on-ray.rst
+   ray-client.rst
 
 .. toctree::
    :hidden:

diff --git a/doc/source/installation.rst b/doc/source/installation.rst
@@ -229,6 +229,7 @@ Installing from ``pip`` should be sufficient for most Ray users.
 However, should you need to build from source, follow :ref:`these instructions for building <building-ray>` Ray.
 
 
+.. _docker-images:
 
 Docker Source Images
 --------------------

diff --git a/doc/source/ray-client.rst b/doc/source/ray-client.rst
@@ -0,0 +1,69 @@
+**********
+Ray Client
+**********
+
+.. note::
+
+   This feature is still in beta and subject to changes.
+
+===========
+Basic usage
+===========
+
+While in beta, the server is available as an executable module. To start the server, run
+
+``python -m ray.util.client.server [--host host_ip] [--port port] [--redis-address address] [--redis-password password]``
+
+This runs ``ray.init()`` with default options and exposes the client gRPC port at ``host_ip:port`` (by default, ``0.0.0.0:50051``). Providing ``redis-address`` and ``redis-password`` will be passed into ``ray.init()`` when the server starts, allowing connection to an existing Ray cluster, as per the `cluster setup <cluster/index.html>`_ instructions.
+
+From here, another Ray script can access that server from a networked machine with ``ray.util.connect()``
+
+.. code-block:: python
+   
+   import ray
+   import ray.util
+
+   ray.util.connect("0.0.0.0:50051")  # replace with the appropriate host and port
+
+   # Normal Ray code follows
+   @ray.remote
+   def f(x):
+       return x ** x
+
+   do_work.remote(2)
+   #....
+  
+When the client disconnects, any object or actor references held by the server on behalf of the client are dropped, as if directly disconnecting from the cluster 
+
+
+===================
+``RAY_CLIENT_MODE``
+===================
+
+Because Ray client mode affects the behavior of the Ray API, larger scripts or libraries imported before ``ray.util.connect()`` may not realize they're in client mode. This feature is being tracked with `issue #13272 <https://github.com/ray-project/ray/issues/13272>`_ but the workaround here is provided for beta users.
+
+One option is to defer the imports from a ``main`` script that calls ``ray.util.connect()`` first. However, some older scripts or libraries might not support that.
+
+Therefore, an environment variable is also available to force a Ray program into client mode: ``RAY_CLIENT_MODE`` An example usage:
+
+.. code-block:: bash
+
+   RAY_CLIENT_MODE=1 python my_ray_program.py
+
+
+===================================
+Programatically creating the server
+===================================
+
+For larger use-cases, it may be desirable to connect remote Ray clients to an existing Ray environment. The server can be started separately via
+
+.. code-block:: python
+
+   from ray.util.client.server import serve
+
+   server = serve("0.0.0.0:50051")
+   # Server does some work
+   # ...
+   # Time to clean up
+   server.stop(0)
+
diff --git a/doc/source/rllib-algorithms.rst b/doc/source/rllib-algorithms.rst
@@ -11,7 +11,7 @@ Available Algorithms - Overview
 =================== ========== ======================= ================== =========== =============================================================
 Algorithm           Frameworks Discrete Actions        Continuous Actions Multi-Agent Model Support
 =================== ========== ======================= ================== =========== =============================================================
-`A2C, A3C`_         tf + torch **Yes** `+parametric`_  **Yes**            **Yes**     `+RNN`_, `+LSTM auto-wrapping`_, `+Transformer`_, `+autoreg`_
+`A2C, A3C`_         tf + torch **Yes** `+parametric`_  **Yes**            **Yes**     `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_
 `ARS`_              tf + torch **Yes**                 **Yes**            No
 `BC`_               tf + torch **Yes** `+parametric`_  **Yes**            **Yes**     `+RNN`_
 `ES`_               tf + torch **Yes**                 **Yes**            No
@@ -20,13 +20,14 @@ Algorithm           Frameworks Discrete Actions        Continuous Actions Multi-
 `Dreamer`_          torch      No                      **Yes**            No          `+RNN`_
 `DQN`_, `Rainbow`_  tf + torch **Yes** `+parametric`_  No                 **Yes**
 `APEX-DQN`_         tf + torch **Yes** `+parametric`_  No                 **Yes**
-`IMPALA`_           tf + torch **Yes** `+parametric`_  **Yes**            **Yes**     `+RNN`_, `+LSTM auto-wrapping`_, `+Transformer`_, `+autoreg`_
+`IMPALA`_           tf + torch **Yes** `+parametric`_  **Yes**            **Yes**     `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_
 `MAML`_             tf + torch No                      **Yes**            No
 `MARWIL`_           tf + torch **Yes** `+parametric`_  **Yes**            **Yes**     `+RNN`_
 `MBMPO`_            torch      No                      **Yes**            No
-`PG`_               tf + torch **Yes** `+parametric`_  **Yes**            **Yes**     `+RNN`_, `+LSTM auto-wrapping`_, `+Transformer`_, `+autoreg`_
-`PPO`_, `APPO`_     tf + torch **Yes** `+parametric`_  **Yes**            **Yes**     `+RNN`_, `+LSTM auto-wrapping`_, `+Transformer`_, `+autoreg`_
+`PG`_               tf + torch **Yes** `+parametric`_  **Yes**            **Yes**     `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_
+`PPO`_, `APPO`_     tf + torch **Yes** `+parametric`_  **Yes**            **Yes**     `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_
 `SAC`_              tf + torch **Yes**                 **Yes**            **Yes**
+`SlateQ`_           torch      **Yes**                 No                 No
 `LinUCB`_, `LinTS`_ torch      **Yes** `+parametric`_  No                 **Yes**
 `AlphaZero`_        torch      **Yes** `+parametric`_  No                 No
 =================== ========== ======================= ================== =========== =============================================================
@@ -60,9 +61,9 @@ Algorithm                     Frameworks Discrete Actions        Continuous Acti
 .. _`+LSTM auto-wrapping`: rllib-models.html#built-in-models
 .. _`+parametric`: rllib-models.html#variable-length-parametric-action-spaces
 .. _`Rainbow`: rllib-algorithms.html#dqn
-.. _`+RNN`: rllib-models.html#recurrent-models
+.. _`+RNN`: rllib-models.html#rnns
 .. _`TD3`: rllib-algorithms.html#ddpg
-.. _`+Transformer`: rllib-models.html#attention-networks
+.. _`+Attention`: rllib-models.html#attention
 
 High-throughput architectures
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -523,6 +524,24 @@ Cheetah-Run    640             ~800
    :start-after: __sphinx_doc_begin__
    :end-before: __sphinx_doc_end__
 
+.. _slateq:
+
+SlateQ
+-------
+|pytorch|
+`[paper] <https://storage.googleapis.com/pub-tools-public-publication-data/pdf/9f91de1fa0ac351ecb12e4062a37afb896aa1463.pdf>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/agents/slateq/slateq.py>`__
+
+SlateQ is a model-free RL method that builds on top of DQN and generates recommendation slates for recommender system environments. Since these types of environments come with large combinatorial action spaces, SlateQ mitigates this by decomposing the Q-value into single-item Q-values and solves the decomposed objective via mixing integer programming and deep learning optimization. SlateQ can be evaluated on Google's RecSim `environment <https://github.com/google-research/recsim>`__. `An RLlib wrapper for RecSim can be found here < <https://github.com/ray-project/ray/blob/master/rllib/env/wrappers/recsim_wrapper.py>`__.
+
+RecSim environment wrapper: `Google RecSim <https://github.com/ray-project/ray/blob/master/rllib/env/wrappers/recsim_wrapper.py>`__
+
+**SlateQ-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
+
+.. literalinclude:: ../../rllib/agents/slateq/slateq.py
+   :language: python
+   :start-after: __sphinx_doc_begin__
+   :end-before: __sphinx_doc_end__
+
 Derivative-free
 ~~~~~~~~~~~~~~~
-Original file line number
+Diff line change
@@ Expand Up / @@ -20,4 +20,6 @@ ignore = @@
       W503
       W504
       W605
+      I
+      N
     avoid-escape = no