-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Enable object reconstruction for retryable actor tasks #9557
[core] Enable object reconstruction for retryable actor tasks #9557
Conversation
Can one of the admins verify this patch? |
Test PASSed. |
Test FAILed. |
Objects | ||
------- | ||
|
||
Task outputs over a configurable threshold (default 100KB) may be stored in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Task outputs over a configurable threshold (default 100KB) may be stored in | |
Task outputs over a configurable threshold (default 100KB) will be stored in |
When there are no copies of an object left, Ray also provides an option to | ||
automatically recover the value by re-executing the task that created the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there are no copies of an object left, Ray also provides an option to | |
automatically recover the value by re-executing the task that created the | |
Instead of raising an ``UnreconstructableError``, Ray also provides an option to | |
automatically recover the value by re-executing the task that created the |
Test FAILed. |
Test FAILed. |
* [Core] Enhance common client connection (ray-project#9367) * enhance client connection * add write buffer async * read message * add test * Bazel move more shell to native rules (ray-project#9314) Co-authored-by: Mehrdad <noreply@github.com> * [tune] Fix github readme (ray-project#9365) Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com> * Combine different severities into the same log files (ray-project#9230) * Combine different severities into the same log files Co-authored-by: Mehrdad <noreply@github.com> * [core] Pass owner address from the workers to the raylet (ray-project#9299) * Add intended worker ID to GetObjectStatus, tests * Remove TaskID owner_id * lint * Add owner address to task args * Make TaskArg a virtual class, remove multi args * Set owner address for task args * merge * Fix tests * Add ObjectRefs to task dependency manager, pass from task spec args * tmp * tmp * Fix * Add ownership info for task arguments * Convert WaitForDirectActorCallArgs * lint * build * update * build * java * Move code * build * Revert "Fix Google log directory again (ray-project#9063)" This reverts commit 275da2e. * Fix free * fix tests * Fix tests * build * build * fix * Change assertion to warning to fix java * [Core] Add placement group scheduler and some api in resource scheduler (ray-project#9039) * Add placement group scheduler and some api of resource scheduler. Merge fix cv hang in multithread variables race (ray-project#8984). * change the bundle id and delete unit count in bundle change vector<bundle_spec> to vector<shared_ptr<bundle_spec>> Add placement group scheduler and some api of resource scheduler. Merge fix cv hang in multithread variables race (ray-project#8984). change the bundle id and delete unit count in bundle remove CheckIfSchedulable() add comments and fix the bug in resource * fix placement group schedule * add placement group scheduler and change some api in resource scheduler * fix by the comments * fix conflict * fix lint * fix lint * fix bug in merge * fix lint Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com> * [Core] New scheduler fixes (ray-project#9186) * . * test_args passes * . * test_basic.py::test_many_fractional_resources causes ray to hang * test_basic.py::test_many_fractional_resources causes ray to hang * . * . * useful * test_many_fractional_resources fails instead of hanging now :) * Passes test_fractional_resources * . * . * Some cleanup * git is hard * cleanup * Fixed scheduling tests * . * . * [Core] put small objects in memory store (ray-project#8972) * remove the put in memory store * put small objects directly in memory store * cast data type * fix another place that uses Put to spill to plasma store * fix multiple tests related to memory limits * partially fix test_metrics * remove not functioning codes * fix core_worker_test * refactor put to plasma codes * add a flag for the new feature * add flag to more places * do a warmup round for the plasma store * lint * lint again * fix warmup store * Update _raylet.pyx Co-authored-by: Eric Liang <ekhliang@gmail.com> * [autoscaler] Move command runners into separate file and clean up interface. (ray-project#9340) * cleanup * wip * fix imports * fix lint * [docs][rllib] Recommended workflow for training, saving, and testing (ray-project#9319) * [autoscaler] Allow users to disable the cluster config cache (ray-project#8117) * [autoscaler] Remove autoscaler config cache. * [autoscaler] Add flag allowing users to explicitly disable the config cache. * Update hiredis and remove Windows patches (ray-project#9289) Co-authored-by: Mehrdad <noreply@github.com> * Fix flaky test_dynres.py (ray-project#9310) * Fix gcs_table_storage testcase bug (ray-project#9393) Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * [HOTFIX] Fix compile direct_actor_transport_test on mac (ray-project#9403) * Change Python's `ObjectID` to `ObjectRef` (ray-project#9353) * [Java] Improve JNI performance when submitting and executing tasks (ray-project#9032) * Remove the RAY_CHECK in Worker::Port() (ray-project#9348) * [RLlib] Issue ray-project#9366 (DQN w/o dueling produces invalid actions). (ray-project#9386) * Fix macos compliation bug (ray-project#9391) * Fix. * [Core] Plasma RAII support (ray-project#9370) * [Serve] Merge router with HTTPProxy (ray-project#9225) * Pass run args to DockerCommandRunner (ray-project#9411) * Fix copy to workspace (ray-project#9400) * [RLlib] Tf2.x native. (ray-project#8752) * Update conda and ray wheel on GCP images (ray-project#9388) * [Core] Simplify Raylet Client (ray-project#9420) * Masking error. With t*valid_mask, we get the error np.inf*0 = np.inf (ray-project#9407) * [RLLib] WindowStat bug fix (ray-project#9213) * WindowStat error catching, which processes NaNs properly instead of erroring. This ought to resolve issue ray-project#7910. ray-project#7910 * [tune] handling nan values (ray-project#9381) * TRAVIS_PULL_REQUEST is false for non-PRs, not empty (ray-project#9439) Co-authored-by: Mehrdad <noreply@github.com> * [GCS] Fix the bug about raylet receiving duplicate actor creation tasks (ray-project#9422) * [Tune] Trainable documentation fix (ray-project#9448) * Allow --lru-evict to be passed into `ray start` (ray-project#8959) * GCP authentication using oauth tokens (ray-project#9279) * Bazel selects compiler flags based on compiler (ray-project#9313) Co-authored-by: Mehrdad <noreply@github.com> * [Core] Build raylet client as an independent component (ray-project#9434) * [tune] sklearn comment out (ray-project#9454) * Add ability to specify SOCKS proxy for SSH connections (ray-project#8833) * [docs] Render ActorPool documentation, etc (ray-project#9433) * [tune] Put examples under proper version control (ray-project#9427) Co-authored-by: krfricke <krfricke@users.noreply.github.com> * Fix test-multi-node (ray-project#9453) * Machine View Sorting / Grouping (ray-project#9214) * Convert NodeInfo.tsx to a functional component * Update NodeRowGroup to be a functional component * lint * Convert TotalRow to functional component. * lint * move node info over to using the sortable table head component. spacing is still a little wonky. * Factor a NoewWorkerRow class out of NodeRowGroup that will be usable when grouping / ungrouping * Compilation checkpoint, I factored the worker filtering logic out of node info into the reducer * Add sort accessors for CPU * Add sort accessors for Disk * Add sort accessors for RAM * add a table sort util for function based accessors (rather than flat attribute-based accessor) * wip refactor node info features * wip * Rendering Checkpoint. I've refactored the features and how they are called to add sorting support. Also reworks the way error counts and log counts are passed to the front-end to remove some ugly logic * wip * wip * wip * Finish adding sorting and grouping of machine view * lint * fix bug in filtration of logs and errors by worker from recent refactor. * Add export of Cluster Disk feature * fix some merge issues Co-authored-by: Max Fitton <max@semprehealth.com> * [RLlib] Layout of Trajectory View API (new class: Trajectory; not used yet). (ray-project#9269) * [RLlib] Issue 9402 MARWIL producing nan rewards. (ray-project#9429) * Fix gcs_pubsub_test bug(ray-project#9438) Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * change error code name of boost timer (ray-project#9417) * [tune] PyTorch CIFAR10 example (ray-project#9338) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Kai Fricke <kai@anyscale.com> * Remove legacy C++ code (ray-project#9459) * Fix ObjectRef and ActorHandle serialization (ray-project#9462) * [Stats] metrics agent exporter (ray-project#9361) * [Core] Support GCS server port assignment. (ray-project#8962) * Add scripts symlink back (ray-project#9219) (ray-project#9475) (cherry picked from commit 77933c9) Co-authored-by: Simon Mo <xmo@berkeley.edu> * [tune] Issue 8821: ExperimentAnalysis doesn't expand user (ray-project#9461) * [docker] Include base-deps image in rayproject Docker Hub (ray-project#9458) * [Core] remove create_and_seal and create_and_seal_batch (ray-project#9457) * Speedups for GitHub Actions (ray-project#9343) Co-authored-by: Mehrdad <noreply@github.com> * Fix flaky test_object_manager.py (ray-project#9472) * [Java] fix redis-server binary path (ray-project#9398) * [core] Handle out-of-order actor table notifications (ray-project#9449) * Drop stale actor table notifications * build * Add num_restarts to disconnect handler * Unit test and increment num_restarts on ALIVE, not RESTARTING * Wait for pid to exit * Fix name clash on Windows (ray-project#9412) Co-authored-by: Mehrdad <noreply@github.com> * Add job configs to gcs (ray-project#9374) * Make pip install verbose (ray-project#9496) Co-authored-by: Mehrdad <noreply@github.com> * Make more tests compatible with Windows (ray-project#9303) * [tune] extend PTL template (GPU, typing fixes, tensorboard) (ray-project#9451) Co-authored-by: Kai Fricke <kai@anyscale.com> * [core] Replace task resubmission in raylet with ownership protocol (ray-project#9394) * Add intended worker ID to GetObjectStatus, tests * Remove TaskID owner_id * lint * Add owner address to task args * Make TaskArg a virtual class, remove multi args * Set owner address for task args * merge * Fix tests * Add ObjectRefs to task dependency manager, pass from task spec args * tmp * tmp * Fix * Add ownership info for task arguments * Convert WaitForDirectActorCallArgs * lint * build * update * build * java * Move code * build * Revert "Fix Google log directory again (ray-project#9063)" This reverts commit 275da2e. * Fix free * Regression tests - shorten timeouts in reconstruction unit tests * Remove timeout for non-actor tasks * Modify tests using ray.internal.free * Clean up future resolution code * Raylet polls the owner * todo * comment * Update src/ray/core_worker/core_worker.cc Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> * Drop stale actor table notifications * Fix bug where actor restart hangs * Revert buggy code for duplicate tasks * build * Fix errors for lru_evict and internal.free * Revert "Drop stale actor table notifications" This reverts commit 193c5d2. * Revert "build" This reverts commit 5644edb. * Fix free test * Fixes for freed objects Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> * release gil in global state accessor (ray-project#9357) * [Java] Named java actor (ray-project#9037) * Fix clang-cl build (ray-project#9494) Co-authored-by: Mehrdad <noreply@github.com> * [GCS Actor Management] Gcs actor management broken detached actor (ray-project#9473) * [RLlib] Issue ray-project#9437 (PyTorch converts to CPU tensor, even if on GPU). (ray-project#9497) * Get rid of build shell scripts and move them to Python (ray-project#6082) * Fix broken test_raylet_info_endpoint (ray-project#9511) * Fix. (ray-project#9464) * [Autoscaler] Making bootstrap config part of the node provider interface (ray-project#9443) * supporting custom bootstrap config for external node providers * bootstrap config * renamed config to cluster_config * lint * remove 2 args from importer * complete move of bootstrap to node_provider * renamed provider_cls * move imports outside functions * lint * Update python/ray/autoscaler/node_provider.py Co-authored-by: Eric Liang <ekhliang@gmail.com> * final fixes * keeping lines to reduce diff * lint * lamba config * filling in -> adding for lint Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local> Co-authored-by: Eric Liang <ekhliang@gmail.com> * Fix flaky test_actor_failures::test_actor_restart (ray-project#9509) * Fix flaky test * os exit * [rllib] MAML Transform (ray-project#9463) * MAML Transform * Moved Inner Adapt to Method in Execution Plan * Cleanup Plasma Store (hash utilities) (ray-project#9524) * [Serve] Improve buffering for simple cases (ray-project#9485) * [Serve] Use pickle instead of clouldpickle (ray-project#9479) * Fix pip and Bazel interaction messing up CI (ray-project#9506) Co-authored-by: Mehrdad <noreply@github.com> * [Core] Fix Java detached error (ray-project#9526) * fix java createActor NPE bug (ray-project#9532) * [RLlib] Issue 9218: PyTorch Policy places Model on GPU even with num_gpus=0 (ray-project#9516) * [Stats] Fix metric exporter test (ray-project#9376) * Hotfix Lint for Serve (ray-project#9535) * Windows cleanup (ray-project#9508) * Remove unneeded code for Windows * Get rid of usleep() * Make platform_shims includes non-transitive Co-authored-by: Mehrdad <noreply@github.com> * [RLlib] Issue 8384: QMIX doesn't learn anything. (ray-project#9527) * Add placement group manager and some code in core_worker (ray-project#9120) Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com> * [core] Add flag to enable object reconstruction during ray start (ray-project#9488) * Add flag * doc * Fix tests * Pipelining task submission to workers (ray-project#9363) * first step of pipelining * pipelining tests & default configs - added pipelining unit tests in direct_task_transport_test.cc - added an entry in ray_config_def.h, ray_config.pxi, and ray_config.pxd to configure the parameter controlling the maximum number of tasks that can be in fligh to each worker - consolidated worker_to_lease_client_ and worker_to_lease_client_ hash maps in direct_task_transport.h into a single one called worker_to_lease_entry_ * post-review revisions * linting, following naming/style convention * linting * [New scheduler] Queueing refactor (ray-project#9491) * . * test_args passes * . * test_basic.py::test_many_fractional_resources causes ray to hang * test_basic.py::test_many_fractional_resources causes ray to hang * . * . * useful * test_many_fractional_resources fails instead of hanging now :) * Passes test_fractional_resources * . * . * Some cleanup * git is hard * cleanup * . * . * . * . * . * . * . * cleanup * address reviews * address reviews * more refactor * :) * travis pls * . * travis pls * . * [Serve] Add internal instruction for running benchmarks (ray-project#9531) * MADDPG learning confirmation test. (ray-project#9538) * Fix Bazel in Docker (ray-project#9530) Co-authored-by: Mehrdad <noreply@github.com> * Fix bug that `test_multi_node.py::test_multi_driver_logging` hangs when GCS actor management is turned on (ray-project#9539) Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * [tune] Unflattened lookup for ProgressReporter (ray-project#9525) Co-authored-by: Kai Fricke <kai@anyscale.com> * Add plasma store benchmark for small objects (ray-project#9549) * [Tune] Copy default_columns in new ProgressReporter instances (ray-project#9537) * quickfix (ray-project#9552) * [tune] pin tune-sklearn (ray-project#9498) * [cli] ray memory: added redis_password (ray-project#9492) * [GCS]Fix lease worker leak bug when gcs server restarts (ray-project#9315) * add part code * fix compile bug * fix review comments * fix review comments * fix review comments * fix review comments * fix review comment * fix ut bug * fix lint error * fix review comment * fix review comments * add testcase * add testcase * fix bug * fix review comments * fix review comment * fix review comment * refine comments Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> * [tune] fix pbt checkpoint_freq (ray-project#9517) * Only delete old checkpoint if it is not the same as the new one * Return early if old checkpoint value coincides with new checkpoint value Co-authored-by: Kai Fricke <kai@anyscale.com> * [Core] Remove socket pair exchange in Plasma Store (ray-project#9565) * try use boost::asio for notification processing * [Metric] new cython interface for python worker metric (ray-project#9469) * Bazel fixes (ray-project#9519) * GCS client add fetch operation before subscribe (ray-project#9564) * [RLlib] Fix combination of lockstep and multiple agnts controlled by the same policy. (ray-project#9521) * Change aggregation when lockstep is activated. Modification of MultiAgentBatch.timeslices to support the combination of lockstep and multiple agents controlled by the same policy. fix ray-project#9295 * Line too long. * [Core] Replace the Plasma eventloop with boost::asio (ray-project#9431) * Fix Java named actor bug (ray-project#9580) * Fix setup.py bug (ray-project#9581) Co-authored-by: Mehrdad <noreply@github.com> * [Serve] Serialize Query object directly (ray-project#9490) * Add dashboard dependencies to default ray installation (ray-project#9447) * Dashboard next-version API support in backend (ray-project#9345) * Fix log losses (ray-project#9559) * Close log on shutdown * Disable log buffering Co-authored-by: Mehrdad <noreply@github.com> * [docker] run Ubuntu 20.04 as base image (ray-project#9556) * Add PTL to README.rst (ray-project#9594) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Skip uneeded steps on CI (ray-project#9582) Co-authored-by: Mehrdad <noreply@github.com> * Fix Windows CI (ray-project#9588) Co-authored-by: Mehrdad <noreply@github.com> * [serve] Rename to `Controller` (ray-project#9566) * Handle warnings in core (ray-project#9575) * [New scheduler] Fix new scheduler bug (ray-project#9467) * fix new scheduler bug * add testcase for soft resource allocation * modify RemoveNode * Ensure unique log file names across same-node raylets. (ray-project#9561) * fix tag key typo (ray-project#9606) * Rename path variable due to zsh conflict (ray-project#9610) * [doc] [minor] Make API docs easier to find. (ray-project#9604) * Issue 9568: `rllib train` framework in config gets overridden with tf. (ray-project#9572) * Use UTF-8 for encoding of python code for collision hashing (ray-project#9586) Co-authored-by: Arne Sachtler <arne.sachtler@dlr.de> Co-authored-by: simon-mo <simon.mo@hey.com> * Add bazel to the PATH in setup.py (ray-project#9590) Co-authored-by: Mehrdad <noreply@github.com> * Fix Lint in setup.py (ray-project#9618) Co-authored-by: Mehrdad <noreply@github.com> * Shellcheck comments (ray-project#9595) * [Serve] Document Metric Infrastructure (ray-project#9389) * [CI] Do not run jenkins test on GHA (ray-project#9621) * Support ray task type checking (ray-project#9574) * [Metrics] Java metric API (ray-project#9377) * [GCS] fix the fault tolerance about gcs node manager (ray-project#9380) * Shellcheck quoting (ray-project#9596) * Fix SC2006: Use $(...) notation instead of legacy backticked `...`. * Fix SC2016: Expressions don't expand in single quotes, use double quotes for that. * Fix SC2046: Quote this to prevent word splitting. * Fix SC2053: Quote the right-hand side of == in [[ ]] to prevent glob matching. * Fix SC2068: Double quote array expansions to avoid re-splitting elements. * Fix SC2086: Double quote to prevent globbing and word splitting. * Fix SC2102: Ranges can only match single chars (mentioned due to duplicates). * Fix SC2140: Word is of the form "A"B"C" (B indicated). Did you mean "ABC" or "A\"B\"C"? * Fix SC2145: Argument mixes string and array. Use * or separate argument. * Fix SC2209: warning: Use var=$(command) to assign output (or quote to assign string). Co-authored-by: Mehrdad <noreply@github.com> * Fix bug in Bazel version check (ray-project#9626) Co-authored-by: Mehrdad <noreply@github.com> * [Java] Avoid data copy from C++ to Java for ByteBuffer type (ray-project#9033) * Revert "Dashboard next-version API support in backend (ray-project#9345)" (ray-project#9639) This reverts commit fca1fb1. * [Autoscaler] Command Line Interface improvements (ray-project#9322) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [Core] GCS Actor management on by default. (ray-project#8845) * GCS Actor management on by default. * Fix travis config. * Change condition. * Remove unnecessary CI. * [Core] Fix concurrency issues in plasma store runner (ray-project#9642) * fix window jni unhappy compiler (ray-project#9635) * Fix TestObjectTableResubscribe testcase bug (ray-project#9650) * fix named actor single process mode bug (ray-project#9652) * [core] Fix Ray service startup when logging redirection is disabled. (ray-project#9547) * Fix TorchDeterministic (ray-project#9241) * [RaySGD] revised existing transformer example to work with transformers>=3.0 (ray-project#9661) Co-authored-by: Kai Fricke <kai@anyscale.com> * [rllib] Fix torch TD error, IMPALA LR updates (ray-project#9477) * update * add test * lint * fix super call * speed es test up * Auto-cancel build when a new commit is pushed (ray-project#8043) Co-authored-by: Mehrdad <noreply@github.com> * Fix lint in remote-watch.py (ray-project#9668) * [Core] Remove unnecessary windows syscall in plasma store (ray-project#9602) * Remove unused windows shims (ray-project#9583) * Temporarily disable remote watcher (ray-project#9669) * Drop support for Python 3.5. (ray-project#9622) * Drop support for Python 3.5. * Update setup.py * [Core] WorkerInterface refactor (ray-project#9655) * . * . * refactor WorkerInterface * . * Basic unit test structure complete? * . * . * . * . * Fixed tests * Fixed tests * . * [core] Enable object reconstruction for retryable actor tasks (ray-project#9557) * Test actor plasma reconstruction * Allow resubmission of actor tasks * doc * Test for actor constructor * Kill PID before removing node * Kill pid before node * fix java coreworker crash (ray-project#9674) * use help proto-init-macro for streaming config (ray-project#9272) * Update release information from 0.8.6. (ray-project#9124) * [BRING BACK TO MASTER] Update release information. * [MERGE TO MASTER] Add microbenchmark result. * Update asan tests to the doc. * Refinements to the Serve documentation (ray-project#9587) Co-authored-by: Dean Wampler <dean@concurrentthought.com> * [tune] survey (ray-project#9670) * Fix ERROR logging not being printed to standard error (ray-project#9633) Co-authored-by: Mehrdad <noreply@github.com> * [Tune Docs] Logging doc fix (ray-project#9691) * [rllib] Type annotations for model classes (ray-project#9646) * [Serve] Allow multiple HTTP servers. (ray-project#9523) * Issue 9631: Tf1.14 does not have tf.config.list_physical_devices. (ray-project#9681) * [Serve] Fix Formatting, stale docs (ray-project#9617) * fixed simplex initialisation seeding bug (ray-project#9660) Co-authored-by: Petros Christodoulou <petrochr@amazon.com> * Switch from GitHub checkout@v2 to checkout@v1 due to bugs in checkout (ray-project#9697) Co-authored-by: Mehrdad <noreply@github.com> * Add Ray Serve to README.rst (ray-project#9688) * Shellcheck rewrites (ray-project#9597) * Fix SC2001: See if you can use ${variable//search/replace} instead. * Fix SC2010: Don't use ls | grep. Use a glob or a for loop with a condition to allow non-alphanumeric filenames. * Fix SC2012: Use find instead of ls to better handle non-alphanumeric filenames. * Fix SC2015: Note that A && B || C is not if-then-else. C may run when A is true. * Fix SC2028: echo may not expand escape sequences. Use printf. * Fix SC2034: variable appears unused. Verify use (or export if used externally). * Fix SC2035: Use ./*glob* or -- *glob* so names with dashes won't become options. * Fix SC2071: > is for string comparisons. Use -gt instead. * Fix SC2154: variable is referenced but not assigned * Fix SC2164: Use 'cd ... || exit' or 'cd ... || return' in case cd fails. * Fix SC2188: This redirection doesn't have a command. Move to its command (or use 'true' as no-op). * Fix SC2236: Use -n instead of ! -z. * Fix SC2242: Can only exit with status 0-255. Other data should be written to stdout/stderr. * Fix SC2086: Double quote to prevent globbing and word splitting. Co-authored-by: Mehrdad <noreply@github.com> * [Autoscaler] CLI Logger docs (ray-project#9690) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update rllib-algorithms.rst (ray-project#9640) * [tune] move jenkins tests to travis (ray-project#9609) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Kai Fricke <kai@anyscale.com> * [RLlib] Implement DQN PyTorch distributional head. (ray-project#9589) * Add placement group java api (ray-project#9611) * add part code * add part code * add part code * fix code style * fix review comment * fix review comment * add part code * add part code * add part code * add part code * fix review comment * fix review comment * fix code style * fix review comment * fix lint error * fix lint error Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * [Stats] Improve Stats::Init & Add it to GCS server (ray-project#9563) * [Core] Try remove all windows compat shims (ray-project#9671) * try remove compat for arrow * remove unistd.h * remove socket compat * delete arrow windows patch * Fix a few flaky tests (ray-project#9709) Fix test_custom_resources, Remove test_pandas_parquet_serialization, Better error message for test_output.py, Potentially fix test_dynres::test_dynamic_res_creation_scheduler_consistency * [GCS]Open test_gcs_fault_tolerance testcase (ray-project#9677) * enable test_gcs_fault_tolerance * fix lint error Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * [Tests]lock vector to avoid potential flaky test (ray-project#9656) * [tune] distributed torch wrapper (ray-project#9550) * changes * add-working * checkpoint * ccleanu * fix * ok * formatting * ok * tests * some-good-stuff * fix-torch * ddp-torch * torch-test * sessions * add-small-test * fix * remove * gpu-working * update-tests * ok * try-test * formgat * ok * ok * [GCS] Fix actor task hang when its owner exits before local dependencies resolved (ray-project#8045) * Only update raylet map when autoscaler configured (ray-project#9435) * [Dashboard] New dashboard skeleton (ray-project#9099) * Fixing multiple building issues * Make wait_for_condition raise exception when timing out. (ray-project#9710) * [GCS]GCS client support multi-thread subscribe&resubscribe&unsubscribe (ray-project#9718) * Package and upload ray cross-platform jar (ray-project#9540) * Revert "Package and upload ray cross-platform jar (ray-project#9540)" (ray-project#9730) This reverts commit 8810325. * Only build docker wheels in LINUX_WHEELS env (ray-project#9729) * Keep build-autoscaler-images.sh alive in CI (ray-project#9720) * [core] Removes Error when Internal Config is not set (ray-project#9700) * [Cluster Launcher] Re Org the cluster launcher pages. (ray-project#9687) * [RLlib] Offline Type Annotations (ray-project#9676) * Offline Annotations * Modifications * Fixed circular dependencies * Linter fix * Python api of placement group (ray-project#9243) * Include open-ssh-client for transparency (ray-project#9693) * Fix remote-watch.py (ray-project#9625) Co-authored-by: Mehrdad <noreply@github.com> * [docker] Uses Latest Conda & Py 3.7 (ray-project#9732) * Fix broken actor failure tests. (ray-project#9737) * [Stats] fix stats shutdown crash if opencensus exporter not initialized (ray-project#9727) * Fix package and upload ray jar (ray-project#9742) * Introduce file_mounts_sync_continuously cluster option (ray-project#9544) * Separate out file_mounts contents hashing into its own separate hash Add an option to continuously sync file_mounts from head node to worker nodes: monitor.py will re-sync file mounts whenver contents change but will only run setup_commands if the config also changes * add test and default value for file_mounts_sync_continuously * format code * Update comments * Add param to skip setup commands when only file_mounts content changed during monitor.py's update tick Fixed so setup commands run when ray up is run and file_mounts content changes * Refactor so that runtime_hash retains previous behavior runtime_hash is almost identical as before this PR. It is used to determine if setup_commands need to run file_mounts_contents_hash is an additional hash of the file_mounts content that is used to detect when only file syncing has to occur. Note: runtime_hash value will have changed from before the PR because we hash the hash of the contents of the file_mounts as a performance optimization * fix issue with hashing a hash * fix bug where trying to set contents hash when it wasn't generated * Fix lint error Fix bug in command_runner where check_output was no longer returning the output of the command * clear out provider between tests to get rid of flakyness * reduce chance of race condition from node_launcher launching a node in the middle of an autoscaler.update call * [dist] swap mac/linux wheel build order (ray-project#9746) * [RLlib] Enhance reward clipping test; add action_clipping tests. (ray-project#9684) * [RLlib] Issue 9667 DDPG Torch bugs and enhancements. (ray-project#9680) * [Metrics]Ray java worker metric registry (ray-project#9636) * ray worker metrics gauge init * ray java metric mapping * add jni source files for gauge and tagkey * mapping all metric classes to stats object * check non-null for tags and name * lint * add symbol for native metric JNI * extern c for symbol * add tests for all metrics * Update Metric.java use metricNativePointer instead. * unify metric native stuff to one class * fix jni file * add comments for metric transform function in jni utils * move metric function to native metric file * remove unused disconnect jni * Add a metric registry for java metircs * Restore install-bazel.sh * Add some comments for metric registry * Fix thread safe problem of metrics * Fix metric tests and remove sleep code from tests * Fix comments of metrics Co-authored-by: lingxuan.zlx <skyzlxuan@gmail.com> * fix windows compile bug (ray-project#9741) Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * Run _with_interactive in Docker (ray-project#9747) * [New scheduler] First unit test for task manager (ray-project#9696) * . * . * refactor WorkerInterface * . * Basic unit test structure complete? * . * bad git >:-( * small clean up * CR * . * . * One more fixture * One more fixture * . * . * bazel-format * . * [Stats] Basic Metrics Infrastructure (Metrics Agent + Prometheus Exporter) (ray-project#9607) * [Release] Fix release tests (ray-project#9733) * Register function race (ray-project#9346) * Revert "[dist] swap mac/linux wheel build order (ray-project#9746)" and "Fix package and upload ray jar (ray-project#9742)" (ray-project#9758) * Revert "[dist] swap mac/linux wheel build order (ray-project#9746)" This reverts commit a934056. * Revert "Fix package and upload ray jar (ray-project#9742)" This reverts commit c290c30. * Fix some Windows CI issues (ray-project#9708) Co-authored-by: Mehrdad <noreply@github.com> * Pin pytest version (ray-project#9767) * [Java] Use test groups to filter tests of different run modes (ray-project#9703) * [Java] Fix MetricTest.java due to incomplete changes from ray-project#9703 (ray-project#9770) * Fix leased worker leak bug if lease worker requests that are still waiting to be scheduled when GCS restarts (ray-project#9719) * [Stats] enable core worker stats (ray-project#9355) * [GCS]Use a separate thread in node failure detector to handle heartbeat (ray-project#9416) * use a sole thread to handle heartbeat * separate signal thread * use work to avoid exiting when task is underway * protect shared data structure to avoid deadlock * add comments * decrease io service num * minor changes * fix test * per stephanie's comments * use single io service instead of 1-size io service pool * typo * [GCS Actor Management] Fix flaky test_dead_actors. (ray-project#9715) * Fix. * Add logs. * Add an unit test. * [TUNE] Tune Docs re-organization (ray-project#9600) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [RLlib] Trajectory View API (preparatory cleanup and enhancements). (ray-project#9678) * [Core] Socket creation race condition bug fixes (ray-project#9764) * fix issues * hot fixes * test * test * Always info log * Fixed stderr logging (9765) * [Core] Custom socket name (ray-project#9766) * fix issues * hot fixes * test * test * socket name change only * Fix src/ray/core_worker/common.h deleted constructor (ray-project#9785) Co-authored-by: Mehrdad <noreply@github.com> * [Stats] Fix harvestor threads + Fix flaky stats shutdown. (ray-project#9745) * More fixes * Applying latest changes in travis.yml * Fixing fixture data exclusions * Disable some java tests * Fix some CI errors * Update hash * Fixing more build issues * Fixing more build issues * Fix pipeline cache path * More fixes * Fix bazel test command * Fix bazel test * Fix general info steps * Custom env var for docker build * Trying a different way to install bazel * Bazel fix * Updating hash Co-authored-by: Siyuan (Ryans) Zhuang <suquark@gmail.com> Co-authored-by: mehrdadn <mehrdadn@users.noreply.github.com> Co-authored-by: Mehrdad <noreply@github.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Alisa <wuminyan0607@gmail.com> Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com> Co-authored-by: Alex Wu <itswu.alex@gmail.com> Co-authored-by: Zhuohan Li <zhuohan123@vip.qq.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Stefan Schneider <stefan.schneider@upb.de> Co-authored-by: Patrick Ames <pdames@amazon.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> Co-authored-by: fangfengbin <869218239a@zju.edu.cn> Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> Co-authored-by: Tao Wang <dooku.wt@antfin.com> Co-authored-by: Kai Yang <kfstorm@outlook.com> Co-authored-by: Sven Mika <sven@anyscale.io> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Ian Rodney <ian.rodney@gmail.com> Co-authored-by: Henk Tillman <henktillman@gmail.com> Co-authored-by: Tanay Wakhare <twakhare@gmail.com> Co-authored-by: Nicolaus93 <nicolo.campolongo@unimi.it> Co-authored-by: Vasily Litvinov <45396231+vnlitvinov@users.noreply.github.com> Co-authored-by: krfricke <krfricke@users.noreply.github.com> Co-authored-by: Max Fitton <maxfitton@gmail.com> Co-authored-by: Max Fitton <max@semprehealth.com> Co-authored-by: kisuke95 <2522134184@qq.com> Co-authored-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Simon Mo <xmo@berkeley.edu> Co-authored-by: Michael Mui <68102089+heyitsmui@users.noreply.github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: chaokunyang <shawn.ck.yang@gmail.com> Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu> Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local> Co-authored-by: Michael Luo <michael.luo123456789@gmail.com> Co-authored-by: Gabriele Oliaro <gabriele_oliaro@college.harvard.edu> Co-authored-by: Tom <veniat.tom@gmail.com> Co-authored-by: jerrylee.io <JerryDeKo@gmail.com> Co-authored-by: Raphael Avalos <raphael@avalos.fr> Co-authored-by: William Falcon <waf2107@columbia.edu> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: Robert Nishihara <robertnishihara@gmail.com> Co-authored-by: Arne Sachtler <arne.sachtler@gmail.com> Co-authored-by: Arne Sachtler <arne.sachtler@dlr.de> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: ZhuSenlin <wumuzi520@126.com> Co-authored-by: Max Fitton <mfitton@berkeley.edu> Co-authored-by: Maksim Smolin <maximsmol@gmail.com> Co-authored-by: Dean Wampler <dean@polyglotprogramming.com> Co-authored-by: Dean Wampler <dean@concurrentthought.com> Co-authored-by: Bill Chambers <bill@anyscale.com> Co-authored-by: Petros Christodoulou <p.christodoulou2@gmail.com> Co-authored-by: Petros Christodoulou <petrochr@amazon.com> Co-authored-by: Justin Terry <justinkterry@gmail.com> Co-authored-by: Tao Wang <wangtaothetonic@163.com> Co-authored-by: fyrestone <fyrestone@outlook.com> Co-authored-by: Alan Guo <aguo@anyscale.com> Co-authored-by: bermaker <495571751@qq.com>
* Set up CI with Azure Pipelines Specifically, we are setting a travis like ADO pipeline following what is already present in the .travis.yml file in the root of the repo. * Separating travis like pipeline from main pipeline * Adding Jenkings jobs equivalent * Making some improvements * Adding validation of the upstream CI * Disabling Tune and large memory tests * Changing threshold for simple reservoir sampling test * Addressing comments * Updating Azure Pipelines with travis updates * Updating Azure Pipelines with more travis updates * Updating CI with new cpp worker tests * Setting code owners * Fixing the version number generation * Making main pipeline also our release pipeline * Updating Azure Pipelines with travis updates * Fixing wheels test * Fixing codeowners * Updating Azure Pipelines with travis updates * Bumping up MACOSX_DEPLOYMENT_TARGET * Updating Azure Pipelines with travis updates * Updating Azure Pipelines with travis updates * Updating Azure Pipelines with travis updates * Disabling Serve tests * Making explicit which branches GitHubActions workflows should watch * Desabling Ray serve tests * Installing numpy explicitly * consolidating Ray test steps in one yml * Syncing with upstream master 2020-07-30 (#21) * [Core] Enhance common client connection (#9367) * enhance client connection * add write buffer async * read message * add test * Bazel move more shell to native rules (#9314) Co-authored-by: Mehrdad <noreply@github.com> * [tune] Fix github readme (#9365) Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com> * Combine different severities into the same log files (#9230) * Combine different severities into the same log files Co-authored-by: Mehrdad <noreply@github.com> * [core] Pass owner address from the workers to the raylet (#9299) * Add intended worker ID to GetObjectStatus, tests * Remove TaskID owner_id * lint * Add owner address to task args * Make TaskArg a virtual class, remove multi args * Set owner address for task args * merge * Fix tests * Add ObjectRefs to task dependency manager, pass from task spec args * tmp * tmp * Fix * Add ownership info for task arguments * Convert WaitForDirectActorCallArgs * lint * build * update * build * java * Move code * build * Revert "Fix Google log directory again (#9063)" This reverts commit 275da2e4003b56e5c315ceae53a2e5f5ad7874c1. * Fix free * fix tests * Fix tests * build * build * fix * Change assertion to warning to fix java * [Core] Add placement group scheduler and some api in resource scheduler (#9039) * Add placement group scheduler and some api of resource scheduler. Merge fix cv hang in multithread variables race (#8984). * change the bundle id and delete unit count in bundle change vector<bundle_spec> to vector<shared_ptr<bundle_spec>> Add placement group scheduler and some api of resource scheduler. Merge fix cv hang in multithread variables race (#8984). change the bundle id and delete unit count in bundle remove CheckIfSchedulable() add comments and fix the bug in resource * fix placement group schedule * add placement group scheduler and change some api in resource scheduler * fix by the comments * fix conflict * fix lint * fix lint * fix bug in merge * fix lint Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com> * [Core] New scheduler fixes (#9186) * . * test_args passes * . * test_basic.py::test_many_fractional_resources causes ray to hang * test_basic.py::test_many_fractional_resources causes ray to hang * . * . * useful * test_many_fractional_resources fails instead of hanging now :) * Passes test_fractional_resources * . * . * Some cleanup * git is hard * cleanup * Fixed scheduling tests * . * . * [Core] put small objects in memory store (#8972) * remove the put in memory store * put small objects directly in memory store * cast data type * fix another place that uses Put to spill to plasma store * fix multiple tests related to memory limits * partially fix test_metrics * remove not functioning codes * fix core_worker_test * refactor put to plasma codes * add a flag for the new feature * add flag to more places * do a warmup round for the plasma store * lint * lint again * fix warmup store * Update _raylet.pyx Co-authored-by: Eric Liang <ekhliang@gmail.com> * [autoscaler] Move command runners into separate file and clean up interface. (#9340) * cleanup * wip * fix imports * fix lint * [docs][rllib] Recommended workflow for training, saving, and testing (#9319) * [autoscaler] Allow users to disable the cluster config cache (#8117) * [autoscaler] Remove autoscaler config cache. * [autoscaler] Add flag allowing users to explicitly disable the config cache. * Update hiredis and remove Windows patches (#9289) Co-authored-by: Mehrdad <noreply@github.com> * Fix flaky test_dynres.py (#9310) * Fix gcs_table_storage testcase bug (#9393) Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * [HOTFIX] Fix compile direct_actor_transport_test on mac (#9403) * Change Python's `ObjectID` to `ObjectRef` (#9353) * [Java] Improve JNI performance when submitting and executing tasks (#9032) * Remove the RAY_CHECK in Worker::Port() (#9348) * [RLlib] Issue #9366 (DQN w/o dueling produces invalid actions). (#9386) * Fix macos compliation bug (#9391) * Fix. * [Core] Plasma RAII support (#9370) * [Serve] Merge router with HTTPProxy (#9225) * Pass run args to DockerCommandRunner (#9411) * Fix copy to workspace (#9400) * [RLlib] Tf2.x native. (#8752) * Update conda and ray wheel on GCP images (#9388) * [Core] Simplify Raylet Client (#9420) * Masking error. With t*valid_mask, we get the error np.inf*0 = np.inf (#9407) * [RLLib] WindowStat bug fix (#9213) * WindowStat error catching, which processes NaNs properly instead of erroring. This ought to resolve issue #7910. https://github.com/ray-project/ray/issues/7910 * [tune] handling nan values (#9381) * TRAVIS_PULL_REQUEST is false for non-PRs, not empty (#9439) Co-authored-by: Mehrdad <noreply@github.com> * [GCS] Fix the bug about raylet receiving duplicate actor creation tasks (#9422) * [Tune] Trainable documentation fix (#9448) * Allow --lru-evict to be passed into `ray start` (#8959) * GCP authentication using oauth tokens (#9279) * Bazel selects compiler flags based on compiler (#9313) Co-authored-by: Mehrdad <noreply@github.com> * [Core] Build raylet client as an independent component (#9434) * [tune] sklearn comment out (#9454) * Add ability to specify SOCKS proxy for SSH connections (#8833) * [docs] Render ActorPool documentation, etc (#9433) * [tune] Put examples under proper version control (#9427) Co-authored-by: krfricke <krfricke@users.noreply.github.com> * Fix test-multi-node (#9453) * Machine View Sorting / Grouping (#9214) * Convert NodeInfo.tsx to a functional component * Update NodeRowGroup to be a functional component * lint * Convert TotalRow to functional component. * lint * move node info over to using the sortable table head component. spacing is still a little wonky. * Factor a NoewWorkerRow class out of NodeRowGroup that will be usable when grouping / ungrouping * Compilation checkpoint, I factored the worker filtering logic out of node info into the reducer * Add sort accessors for CPU * Add sort accessors for Disk * Add sort accessors for RAM * add a table sort util for function based accessors (rather than flat attribute-based accessor) * wip refactor node info features * wip * Rendering Checkpoint. I've refactored the features and how they are called to add sorting support. Also reworks the way error counts and log counts are passed to the front-end to remove some ugly logic * wip * wip * wip * Finish adding sorting and grouping of machine view * lint * fix bug in filtration of logs and errors by worker from recent refactor. * Add export of Cluster Disk feature * fix some merge issues Co-authored-by: Max Fitton <max@semprehealth.com> * [RLlib] Layout of Trajectory View API (new class: Trajectory; not used yet). (#9269) * [RLlib] Issue 9402 MARWIL producing nan rewards. (#9429) * Fix gcs_pubsub_test bug(#9438) Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * change error code name of boost timer (#9417) * [tune] PyTorch CIFAR10 example (#9338) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Kai Fricke <kai@anyscale.com> * Remove legacy C++ code (#9459) * Fix ObjectRef and ActorHandle serialization (#9462) * [Stats] metrics agent exporter (#9361) * [Core] Support GCS server port assignment. (#8962) * Add scripts symlink back (#9219) (#9475) (cherry picked from commit 77933c922d5136c5c2e2f0ac2edb4da67111d690) Co-authored-by: Simon Mo <xmo@berkeley.edu> * [tune] Issue 8821: ExperimentAnalysis doesn't expand user (#9461) * [docker] Include base-deps image in rayproject Docker Hub (#9458) * [Core] remove create_and_seal and create_and_seal_batch (#9457) * Speedups for GitHub Actions (#9343) Co-authored-by: Mehrdad <noreply@github.com> * Fix flaky test_object_manager.py (#9472) * [Java] fix redis-server binary path (#9398) * [core] Handle out-of-order actor table notifications (#9449) * Drop stale actor table notifications * build * Add num_restarts to disconnect handler * Unit test and increment num_restarts on ALIVE, not RESTARTING * Wait for pid to exit * Fix name clash on Windows (#9412) Co-authored-by: Mehrdad <noreply@github.com> * Add job configs to gcs (#9374) * Make pip install verbose (#9496) Co-authored-by: Mehrdad <noreply@github.com> * Make more tests compatible with Windows (#9303) * [tune] extend PTL template (GPU, typing fixes, tensorboard) (#9451) Co-authored-by: Kai Fricke <kai@anyscale.com> * [core] Replace task resubmission in raylet with ownership protocol (#9394) * Add intended worker ID to GetObjectStatus, tests * Remove TaskID owner_id * lint * Add owner address to task args * Make TaskArg a virtual class, remove multi args * Set owner address for task args * merge * Fix tests * Add ObjectRefs to task dependency manager, pass from task spec args * tmp * tmp * Fix * Add ownership info for task arguments * Convert WaitForDirectActorCallArgs * lint * build * update * build * java * Move code * build * Revert "Fix Google log directory again (#9063)" This reverts commit 275da2e4003b56e5c315ceae53a2e5f5ad7874c1. * Fix free * Regression tests - shorten timeouts in reconstruction unit tests * Remove timeout for non-actor tasks * Modify tests using ray.internal.free * Clean up future resolution code * Raylet polls the owner * todo * comment * Update src/ray/core_worker/core_worker.cc Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> * Drop stale actor table notifications * Fix bug where actor restart hangs * Revert buggy code for duplicate tasks * build * Fix errors for lru_evict and internal.free * Revert "Drop stale actor table notifications" This reverts commit 193c5d20e5577befd43f166e16c972e2f9247c91. * Revert "build" This reverts commit 5644edbac906ff6ef98feb40b6f62c9e63698c29. * Fix free test * Fixes for freed objects Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> * release gil in global state accessor (#9357) * [Java] Named java actor (#9037) * Fix clang-cl build (#9494) Co-authored-by: Mehrdad <noreply@github.com> * [GCS Actor Management] Gcs actor management broken detached actor (#9473) * [RLlib] Issue #9437 (PyTorch converts to CPU tensor, even if on GPU). (#9497) * Get rid of build shell scripts and move them to Python (#6082) * Fix broken test_raylet_info_endpoint (#9511) * Fix. (#9464) * [Autoscaler] Making bootstrap config part of the node provider interface (#9443) * supporting custom bootstrap config for external node providers * bootstrap config * renamed config to cluster_config * lint * remove 2 args from importer * complete move of bootstrap to node_provider * renamed provider_cls * move imports outside functions * lint * Update python/ray/autoscaler/node_provider.py Co-authored-by: Eric Liang <ekhliang@gmail.com> * final fixes * keeping lines to reduce diff * lint * lamba config * filling in -> adding for lint Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local> Co-authored-by: Eric Liang <ekhliang@gmail.com> * Fix flaky test_actor_failures::test_actor_restart (#9509) * Fix flaky test * os exit * [rllib] MAML Transform (#9463) * MAML Transform * Moved Inner Adapt to Method in Execution Plan * Cleanup Plasma Store (hash utilities) (#9524) * [Serve] Improve buffering for simple cases (#9485) * [Serve] Use pickle instead of clouldpickle (#9479) * Fix pip and Bazel interaction messing up CI (#9506) Co-authored-by: Mehrdad <noreply@github.com> * [Core] Fix Java detached error (#9526) * fix java createActor NPE bug (#9532) * [RLlib] Issue 9218: PyTorch Policy places Model on GPU even with num_gpus=0 (#9516) * [Stats] Fix metric exporter test (#9376) * Hotfix Lint for Serve (#9535) * Windows cleanup (#9508) * Remove unneeded code for Windows * Get rid of usleep() * Make platform_shims includes non-transitive Co-authored-by: Mehrdad <noreply@github.com> * [RLlib] Issue 8384: QMIX doesn't learn anything. (#9527) * Add placement group manager and some code in core_worker (#9120) Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com> * [core] Add flag to enable object reconstruction during ray start (#9488) * Add flag * doc * Fix tests * Pipelining task submission to workers (#9363) * first step of pipelining * pipelining tests & default configs - added pipelining unit tests in direct_task_transport_test.cc - added an entry in ray_config_def.h, ray_config.pxi, and ray_config.pxd to configure the parameter controlling the maximum number of tasks that can be in fligh to each worker - consolidated worker_to_lease_client_ and worker_to_lease_client_ hash maps in direct_task_transport.h into a single one called worker_to_lease_entry_ * post-review revisions * linting, following naming/style convention * linting * [New scheduler] Queueing refactor (#9491) * . * test_args passes * . * test_basic.py::test_many_fractional_resources causes ray to hang * test_basic.py::test_many_fractional_resources causes ray to hang * . * . * useful * test_many_fractional_resources fails instead of hanging now :) * Passes test_fractional_resources * . * . * Some cleanup * git is hard * cleanup * . * . * . * . * . * . * . * cleanup * address reviews * address reviews * more refactor * :) * travis pls * . * travis pls * . * [Serve] Add internal instruction for running benchmarks (#9531) * MADDPG learning confirmation test. (#9538) * Fix Bazel in Docker (#9530) Co-authored-by: Mehrdad <noreply@github.com> * Fix bug that `test_multi_node.py::test_multi_driver_logging` hangs when GCS actor management is turned on (#9539) Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * [tune] Unflattened lookup for ProgressReporter (#9525) Co-authored-by: Kai Fricke <kai@anyscale.com> * Add plasma store benchmark for small objects (#9549) * [Tune] Copy default_columns in new ProgressReporter instances (#9537) * quickfix (#9552) * [tune] pin tune-sklearn (#9498) * [cli] ray memory: added redis_password (#9492) * [GCS]Fix lease worker leak bug when gcs server restarts (#9315) * add part code * fix compile bug * fix review comments * fix review comments * fix review comments * fix review comments * fix review comment * fix ut bug * fix lint error * fix review comment * fix review comments * add testcase * add testcase * fix bug * fix review comments * fix review comment * fix review comment * refine comments Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> * [tune] fix pbt checkpoint_freq (#9517) * Only delete old checkpoint if it is not the same as the new one * Return early if old checkpoint value coincides with new checkpoint value Co-authored-by: Kai Fricke <kai@anyscale.com> * [Core] Remove socket pair exchange in Plasma Store (#9565) * try use boost::asio for notification processing * [Metric] new cython interface for python worker metric (#9469) * Bazel fixes (#9519) * GCS client add fetch operation before subscribe (#9564) * [RLlib] Fix combination of lockstep and multiple agnts controlled by the same policy. (#9521) * Change aggregation when lockstep is activated. Modification of MultiAgentBatch.timeslices to support the combination of lockstep and multiple agents controlled by the same policy. fix ray-project/ray#9295 * Line too long. * [Core] Replace the Plasma eventloop with boost::asio (#9431) * Fix Java named actor bug (#9580) * Fix setup.py bug (#9581) Co-authored-by: Mehrdad <noreply@github.com> * [Serve] Serialize Query object directly (#9490) * Add dashboard dependencies to default ray installation (#9447) * Dashboard next-version API support in backend (#9345) * Fix log losses (#9559) * Close log on shutdown * Disable log buffering Co-authored-by: Mehrdad <noreply@github.com> * [docker] run Ubuntu 20.04 as base image (#9556) * Add PTL to README.rst (#9594) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Skip uneeded steps on CI (#9582) Co-authored-by: Mehrdad <noreply@github.com> * Fix Windows CI (#9588) Co-authored-by: Mehrdad <noreply@github.com> * [serve] Rename to `Controller` (#9566) * Handle warnings in core (#9575) * [New scheduler] Fix new scheduler bug (#9467) * fix new scheduler bug * add testcase for soft resource allocation * modify RemoveNode * Ensure unique log file names across same-node raylets. (#9561) * fix tag key typo (#9606) * Rename path variable due to zsh conflict (#9610) * [doc] [minor] Make API docs easier to find. (#9604) * Issue 9568: `rllib train` framework in config gets overridden with tf. (#9572) * Use UTF-8 for encoding of python code for collision hashing (#9586) Co-authored-by: Arne Sachtler <arne.sachtler@dlr.de> Co-authored-by: simon-mo <simon.mo@hey.com> * Add bazel to the PATH in setup.py (#9590) Co-authored-by: Mehrdad <noreply@github.com> * Fix Lint in setup.py (#9618) Co-authored-by: Mehrdad <noreply@github.com> * Shellcheck comments (#9595) * [Serve] Document Metric Infrastructure (#9389) * [CI] Do not run jenkins test on GHA (#9621) * Support ray task type checking (#9574) * [Metrics] Java metric API (#9377) * [GCS] fix the fault tolerance about gcs node manager (#9380) * Shellcheck quoting (#9596) * Fix SC2006: Use $(...) notation instead of legacy backticked `...`. * Fix SC2016: Expressions don't expand in single quotes, use double quotes for that. * Fix SC2046: Quote this to prevent word splitting. * Fix SC2053: Quote the right-hand side of == in [[ ]] to prevent glob matching. * Fix SC2068: Double quote array expansions to avoid re-splitting elements. * Fix SC2086: Double quote to prevent globbing and word splitting. * Fix SC2102: Ranges can only match single chars (mentioned due to duplicates). * Fix SC2140: Word is of the form "A"B"C" (B indicated). Did you mean "ABC" or "A\"B\"C"? * Fix SC2145: Argument mixes string and array. Use * or separate argument. * Fix SC2209: warning: Use var=$(command) to assign output (or quote to assign string). Co-authored-by: Mehrdad <noreply@github.com> * Fix bug in Bazel version check (#9626) Co-authored-by: Mehrdad <noreply@github.com> * [Java] Avoid data copy from C++ to Java for ByteBuffer type (#9033) * Revert "Dashboard next-version API support in backend (#9345)" (#9639) This reverts commit fca1fb18f366ebff6016978cb6440dd1ed8637fe. * [Autoscaler] Command Line Interface improvements (#9322) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [Core] GCS Actor management on by default. (#8845) * GCS Actor management on by default. * Fix travis config. * Change condition. * Remove unnecessary CI. * [Core] Fix concurrency issues in plasma store runner (#9642) * fix window jni unhappy compiler (#9635) * Fix TestObjectTableResubscribe testcase bug (#9650) * fix named actor single process mode bug (#9652) * [core] Fix Ray service startup when logging redirection is disabled. (#9547) * Fix TorchDeterministic (#9241) * [RaySGD] revised existing transformer example to work with transformers>=3.0 (#9661) Co-authored-by: Kai Fricke <kai@anyscale.com> * [rllib] Fix torch TD error, IMPALA LR updates (#9477) * update * add test * lint * fix super call * speed es test up * Auto-cancel build when a new commit is pushed (#8043) Co-authored-by: Mehrdad <noreply@github.com> * Fix lint in remote-watch.py (#9668) * [Core] Remove unnecessary windows syscall in plasma store (#9602) * Remove unused windows shims (#9583) * Temporarily disable remote watcher (#9669) * Drop support for Python 3.5. (#9622) * Drop support for Python 3.5. * Update setup.py * [Core] WorkerInterface refactor (#9655) * . * . * refactor WorkerInterface * . * Basic unit test structure complete? * . * . * . * . * Fixed tests * Fixed tests * . * [core] Enable object reconstruction for retryable actor tasks (#9557) * Test actor plasma reconstruction * Allow resubmission of actor tasks * doc * Test for actor constructor * Kill PID before removing node * Kill pid before node * fix java coreworker crash (#9674) * use help proto-init-macro for streaming config (#9272) * Update release information from 0.8.6. (#9124) * [BRING BACK TO MASTER] Update release information. * [MERGE TO MASTER] Add microbenchmark result. * Update asan tests to the doc. * Refinements to the Serve documentation (#9587) Co-authored-by: Dean Wampler <dean@concurrentthought.com> * [tune] survey (#9670) * Fix ERROR logging not being printed to standard error (#9633) Co-authored-by: Mehrdad <noreply@github.com> * [Tune Docs] Logging doc fix (#9691) * [rllib] Type annotations for model classes (#9646) * [Serve] Allow multiple HTTP servers. (#9523) * Issue 9631: Tf1.14 does not have tf.config.list_physical_devices. (#9681) * [Serve] Fix Formatting, stale docs (#9617) * fixed simplex initialisation seeding bug (#9660) Co-authored-by: Petros Christodoulou <petrochr@amazon.com> * Switch from GitHub checkout@v2 to checkout@v1 due to bugs in checkout (#9697) Co-authored-by: Mehrdad <noreply@github.com> * Add Ray Serve to README.rst (#9688) * Shellcheck rewrites (#9597) * Fix SC2001: See if you can use ${variable//search/replace} instead. * Fix SC2010: Don't use ls | grep. Use a glob or a for loop with a condition to allow non-alphanumeric filenames. * Fix SC2012: Use find instead of ls to better handle non-alphanumeric filenames. * Fix SC2015: Note that A && B || C is not if-then-else. C may run when A is true. * Fix SC2028: echo may not expand escape sequences. Use printf. * Fix SC2034: variable appears unused. Verify use (or export if used externally). * Fix SC2035: Use ./*glob* or -- *glob* so names with dashes won't become options. * Fix SC2071: > is for string comparisons. Use -gt instead. * Fix SC2154: variable is referenced but not assigned * Fix SC2164: Use 'cd ... || exit' or 'cd ... || return' in case cd fails. * Fix SC2188: This redirection doesn't have a command. Move to its command (or use 'true' as no-op). * Fix SC2236: Use -n instead of ! -z. * Fix SC2242: Can only exit with status 0-255. Other data should be written to stdout/stderr. * Fix SC2086: Double quote to prevent globbing and word splitting. Co-authored-by: Mehrdad <noreply@github.com> * [Autoscaler] CLI Logger docs (#9690) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update rllib-algorithms.rst (#9640) * [tune] move jenkins tests to travis (#9609) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Kai Fricke <kai@anyscale.com> * [RLlib] Implement DQN PyTorch distributional head. (#9589) * Add placement group java api (#9611) * add part code * add part code * add part code * fix code style * fix review comment * fix review comment * add part code * add part code * add part code * add part code * fix review comment * fix review comment * fix code style * fix review comment * fix lint error * fix lint error Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * [Stats] Improve Stats::Init & Add it to GCS server (#9563) * [Core] Try remove all windows compat shims (#9671) * try remove compat for arrow * remove unistd.h * remove socket compat * delete arrow windows patch * Fix a few flaky tests (#9709) Fix test_custom_resources, Remove test_pandas_parquet_serialization, Better error message for test_output.py, Potentially fix test_dynres::test_dynamic_res_creation_scheduler_consistency * [GCS]Open test_gcs_fault_tolerance testcase (#9677) * enable test_gcs_fault_tolerance * fix lint error Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * [Tests]lock vector to avoid potential flaky test (#9656) * [tune] distributed torch wrapper (#9550) * changes * add-working * checkpoint * ccleanu * fix * ok * formatting * ok * tests * some-good-stuff * fix-torch * ddp-torch * torch-test * sessions * add-small-test * fix * remove * gpu-working * update-tests * ok * try-test * formgat * ok * ok * [GCS] Fix actor task hang when its owner exits before local dependencies resolved (#8045) * Only update raylet map when autoscaler configured (#9435) * [Dashboard] New dashboard skeleton (#9099) * Fixing multiple building issues * Make wait_for_condition raise exception when timing out. (#9710) * [GCS]GCS client support multi-thread subscribe&resubscribe&unsubscribe (#9718) * Package and upload ray cross-platform jar (#9540) * Revert "Package and upload ray cross-platform jar (#9540)" (#9730) This reverts commit 881032593d3c1b9360ea641c24d50a022677a25e. * Only build docker wheels in LINUX_WHEELS env (#9729) * Keep build-autoscaler-images.sh alive in CI (#9720) * [core] Removes Error when Internal Config is not set (#9700) * [Cluster Launcher] Re Org the cluster launcher pages. (#9687) * [RLlib] Offline Type Annotations (#9676) * Offline Annotations * Modifications * Fixed circular dependencies * Linter fix * Python api of placement group (#9243) * Include open-ssh-client for transparency (#9693) * Fix remote-watch.py (#9625) Co-authored-by: Mehrdad <noreply@github.com> * [docker] Uses Latest Conda & Py 3.7 (#9732) * Fix broken actor failure tests. (#9737) * [Stats] fix stats shutdown crash if opencensus exporter not initialized (#9727) * Fix package and upload ray jar (#9742) * Introduce file_mounts_sync_continuously cluster option (#9544) * Separate out file_mounts contents hashing into its own separate hash Add an option to continuously sync file_mounts from head node to worker nodes: monitor.py will re-sync file mounts whenver contents change but will only run setup_commands if the config also changes * add test and default value for file_mounts_sync_continuously * format code * Update comments * Add param to skip setup commands when only file_mounts content changed during monitor.py's update tick Fixed so setup commands run when ray up is run and file_mounts content changes * Refactor so that runtime_hash retains previous behavior runtime_hash is almost identical as before this PR. It is used to determine if setup_commands need to run file_mounts_contents_hash is an additional hash of the file_mounts content that is used to detect when only file syncing has to occur. Note: runtime_hash value will have changed from before the PR because we hash the hash of the contents of the file_mounts as a performance optimization * fix issue with hashing a hash * fix bug where trying to set contents hash when it wasn't generated * Fix lint error Fix bug in command_runner where check_output was no longer returning the output of the command * clear out provider between tests to get rid of flakyness * reduce chance of race condition from node_launcher launching a node in the middle of an autoscaler.update call * [dist] swap mac/linux wheel build order (#9746) * [RLlib] Enhance reward clipping test; add action_clipping tests. (#9684) * [RLlib] Issue 9667 DDPG Torch bugs and enhancements. (#9680) * [Metrics]Ray java worker metric registry (#9636) * ray worker metrics gauge init * ray java metric mapping * add jni source files for gauge and tagkey * mapping all metric classes to stats object * check non-null for tags and name * lint * add symbol for native metric JNI * extern c for symbol * add tests for all metrics * Update Metric.java use metricNativePointer instead. * unify metric native stuff to one class * fix jni file * add comments for metric transform function in jni utils * move metric function to native metric file * remove unused disconnect jni * Add a metric registry for java metircs * Restore install-bazel.sh * Add some comments for metric registry * Fix thread safe problem of metrics * Fix metric tests and remove sleep code from tests * Fix comments of metrics Co-authored-by: lingxuan.zlx <skyzlxuan@gmail.com> * fix windows compile bug (#9741) Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> * Run _with_interactive in Docker (#9747) * [New scheduler] First unit test for task manager (#9696) * . * . * refactor WorkerInterface * . * Basic unit test structure complete? * . * bad git >:-( * small clean up * CR * . * . * One more fixture * One more fixture * . * . * bazel-format * . * [Stats] Basic Metrics Infrastructure (Metrics Agent + Prometheus Exporter) (#9607) * [Release] Fix release tests (#9733) * Register function race (#9346) * Revert "[dist] swap mac/linux wheel build order (#9746)" and "Fix package and upload ray jar (#9742)" (#9758) * Revert "[dist] swap mac/linux wheel build order (#9746)" This reverts commit a9340565ff46626b18fd36f22a37d0380ae18d85. * Revert "Fix package and upload ray jar (#9742)" This reverts commit c290c308fe1e496480db5c37489df619cff6168f. * Fix some Windows CI issues (#9708) Co-authored-by: Mehrdad <noreply@github.com> * Pin pytest version (#9767) * [Java] Use test groups to filter tests of different run modes (#9703) * [Java] Fix MetricTest.java due to incomplete changes from #9703 (#9770) * Fix leased worker leak bug if lease worker requests that are still waiting to be scheduled when GCS restarts (#9719) * [Stats] enable core worker stats (#9355) * [GCS]Use a separate thread in node failure detector to handle heartbeat (#9416) * use a sole thread to handle heartbeat * separate signal thread * use work to avoid exiting when task is underway * protect shared data structure to avoid deadlock * add comments * decrease io service num * minor changes * fix test * per stephanie's comments * use single io service instead of 1-size io service pool * typo * [GCS Actor Management] Fix flaky test_dead_actors. (#9715) * Fix. * Add logs. * Add an unit test. * [TUNE] Tune Docs re-organization (#9600) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [RLlib] Trajectory View API (preparatory cleanup and enhancements). (#9678) * [Core] Socket creation race condition bug fixes (#9764) * fix issues * hot fixes * test * test * Always info log * Fixed stderr logging (9765) * [Core] Custom socket name (#9766) * fix issues * hot fixes * test * test * socket name change only * Fix src/ray/core_worker/common.h deleted constructor (#9785) Co-authored-by: Mehrdad <noreply@github.com> * [Stats] Fix harvestor threads + Fix flaky stats shutdown. (#9745) * More fixes * Applying latest changes in travis.yml * Fixing fixture data exclusions * Disable some java tests * Fix some CI errors * Update hash * Fixing more build issues * Fixing more build issues * Fix pipeline cache path * More fixes * Fix bazel test command * Fix bazel test * Fix general info steps * Custom env var for docker build * Trying a different way to install bazel * Bazel fix * Updating hash Co-authored-by: Siyuan (Ryans) Zhuang <suquark@gmail.com> Co-authored-by: mehrdadn <mehrdadn@users.noreply.github.com> Co-authored-by: Mehrdad <noreply@github.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Alisa <wuminyan0607@gmail.com> Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com> Co-authored-by: Alex Wu <itswu.alex@gmail.com> Co-authored-by: Zhuohan Li <zhuohan123@vip.qq.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Stefan Schneider <stefan.schneider@upb.de> Co-authored-by: Patrick Ames <pdames@amazon.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> Co-authored-by: fangfengbin <869218239a@zju.edu.cn> Co-authored-by: 灵洵 <fengbin.ffb@antfin.com> Co-authored-by: Tao Wang <dooku.wt@antfin.com> Co-authored-by: Kai Yang <kfstorm@outlook.com> Co-authored-by: Sven Mika <sven@anyscale.io> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Ian Rodney <ian.rodney@gmail.com> Co-authored-by: Henk Tillman <henktillman@gmail.com> Co-authored-by: Tanay Wakhare <twakhare@gmail.com> Co-authored-by: Nicolaus93 <nicolo.campolongo@unimi.it> Co-authored-by: Vasily Litvinov <45396231+vnlitvinov@users.noreply.github.com> Co-authored-by: krfricke <krfricke@users.noreply.github.com> Co-authored-by: Max Fitton <maxfitton@gmail.com> Co-authored-by: Max Fitton <max@semprehealth.com> Co-authored-by: kisuke95 <2522134184@qq.com> Co-authored-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Simon Mo <xmo@berkeley.edu> Co-authored-by: Michael Mui <68102089+heyitsmui@users.noreply.github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: chaokunyang <shawn.ck.yang@gmail.com> Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu> Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local> Co-authored-by: Michael Luo <michael.luo123456789@gmail.com> Co-authored-by: Gabriele Oliaro <gabriele_oliaro@college.harvard.edu> Co-authored-by: Tom <veniat.tom@gmail.com> Co-authored-by: jerrylee.io <JerryDeKo@gmail.com> Co-authored-by: Raphael Avalos <raphael@avalos.fr> Co-authored-by: William Falcon <waf2107@columbia.edu> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: Robert Nishihara <robertnishihara@gmail.com> Co-authored-by: Arne Sachtler <arne.sachtler@gmail.com> Co-authored-by: Arne Sachtler <arne.sachtler@dlr.de> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: ZhuSenlin <wumuzi520@126.com> Co-authored-by: Max Fitton <mfitton@berkeley.edu> Co-authored-by: Maksim Smolin <maximsmol@gmail.com> Co-authored-by: Dean Wampler <dean@polyglotprogramming.com> Co-authored-by: Dean Wampler <dean@concurrentthought.com> Co-authored-by: Bill Chambers <bill@anyscale.com> Co-authored-by: Petros Christodoulou <p.christodoulou2@gmail.com> Co-authored-by: Petros Christodoulou <petrochr@amazon.com> Co-authored-by: Justin Terry <justinkterry@gmail.com> Co-authored-by: Tao Wang <wangtaothetonic@163.com> Co-authored-by: fyrestone <fyrestone@outlook.com> Co-authored-by: Alan Guo <aguo@anyscale.com> Co-authored-by: bermaker <495571751@qq.com> * Sync Upstream master (#50) * [core] Pull Manager exponential backoff (#13024) * [RLlib] Issue 12789: RLlib throws the warning "The given NumPy array is not writeable" (#12793) * [release tests] test_many_tasks fix (#12984) * Add "beta" documentation for enabling object spilling manually (#13047) * [Serve] Handle Bug Fixes (#12971) * [Dashboard] Add GET /logical/actors API (#12913) * [GCS]Decouple gcs resource manager and gcs node manager (#13012) * [ray_client]: Insert decorators into the real ray module to allow for client mode (#13031) * [GCS] Delete redis gcs client and redis_xxx_accessor (#12996) * [RLlib] Fix broken unity3d_env import in example server script. (#13040) * [RLlib] TorchPolicies: Accessing "infos" dict in train_batch causes `TypeError`. (#13039) * [joblib] Fix flaky joblib test. (#13046) * [Tune]Add integer loguniform support (#12994) * Add integer quantization and loguniform support * Fix hyperopt qloguniform not being np.log'd first * Add tests, __init__ * Try to fix tests, better exceptions * Tweak docstrings * Type checks in SearchSpaceTest * Update docs * Lint, tests * Update doc/source/tune/api_docs/search_space.rst Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> * [core][new scheduler] Move tasks from ready to dispatch to waiting on argument eviction (#13048) * Add index for tasks to dispatch * Task dependency manager interface * Unsubscribe dependencies and tests * NodeManager * Revert "Add index for tasks to dispatch" This reverts commit c6ccb9aa306e00f80d34b991055e4e83872595ea. * tmp * Move back to waiting if args not ready * update * Update to new form of brew cask install command * [Autoscaler] New output log format (#12772) * Fix typo RMSProp -> RMSprop (#13063) * [serve] Centralize HTTP-related logic in HTTPState (#13020) * Remove suppress output to see why wheel is not building * Refactor TaskDependencyManager, allow passing bundles of objects to ObjectManager (#13006) * New dependency manager * Switch raylet to new DependencyManager * PullManager accepts bundles * Cleanup, remove old task dependency manager * x * PullManager unit tests * lint * Unit tests * Rename * lint * test * Update src/ray/raylet/dependency_manager.cc Co-authored-by: SangBin Cho <rkooo567@gmail.com> * Update src/ray/raylet/dependency_manager.cc Co-authored-by: SangBin Cho <rkooo567@gmail.com> * x * lint Co-authored-by: SangBin Cho <rkooo567@gmail.com> * [docs] Fix args + kwargs instead of docstrings (#13068) * functools wraps * Fix typo (functoools -> functools) * Fix OS X Wheel Build - Update brew cask install (#13062) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * speed up local mode object store get (#13052) Co-authored-by: senlin.zsl <senlin.zsl@antfin.com> * [RLlib] Execution Annotation (#13036) * [RLlib] Improved Documentation for PPO, DDPG, and SAC (#12943) * [C++ API] Added reference counting to ObjectRef (#13058) * Added reference counting to ObjectRef * Addressed the comments * [Core] Remove cuda support in plasma store (#13070) * remove cuda support in plasma store * [Core] Remote outdated external store (#13080) * remove outdated external store * [GCS] Move resource usage info to gcs resource manager (#13059) * [RLlib] JAXPolicy prep. PR #1. (#13077) * [RLlib] Preprocessor fixes (multi-discrete) and tests. (#13083) * [RLlib] BC/MARWIL/recurrent nets minor cleanups and bug fixes. (#13064) * [Collective][PR 3.5/6] Send/Recv calls and some initial code for communicator caching (#12935) * other collectives all work * auto-linting * mannual linting #1 * mannual linting 2 * bugfix * add send/recv point-to-point calls * add some initial code for communicator caching * auto linting * optimize imports * minor fix * fix unpassed tests * support more dtypes * rerun some distributed tests for send/recv * linting * [Serve] [Doc] Front page update (#13032) * Deprecate experimental / dynamic resources (#13019) * [docs] fix wandb url (#13094) * [Serve] Implement Graceful Shutdown (#13028) * [Serve] Use ServeHandle in HTTP proxy (#12523) * [Java] Format ray java code (#13056) * [docker] Fix restart behavior with Docker (#12898) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: ijrsvt <ilr@anyscale.com> * Disable broken streaming tests (#13095) * [autoscaler] Make placement groups bypass max launch limit (#13089) * Serve metrics docs (#13096) * [RLlib] run_regression_tests.py: --framework flag (instead of --torch). (#13097) * [RLLib] Readme.md Documentation for Almost All Algorithms in rllib/agents (#13035) * [Doc] Fix Sphinx.add_stylesheet deprecation (#13067) * Fix streaming ci failure (#12830) * [RLlib] New Offline RL Algorithm: CQL (based on SAC) (#13118) * [Bugfix][Dashboard] Fix undefined logCount, errorCount UI crash (#13113) * [RLlib] Deflake test case: 2-step game MADDPG. (#13121) * [RLlib] Trajectory view API docs. (#12718) * Job module without submission (#13081) Co-authored-by: 刘宝 <po.lb@antfin.com> * [RLlib] JAXPolicy prep PR #2 (move get_activation_fn (backward-compatibly), minor fixes and preparations). (#13091) * [Java] Avoid failure of serializing a user-defined unserializable exception. (#13119) * [Tune] Update URL to fix 403 not found error in PBT tranformers test case (#13131) * [serve] Async controller (#13111) * [dashboard] Fix RAY_RAYLET_PID KeyError on Windows (#12948) * [Serve] Use a small object to track requests (#13125) * [docs][kubernetes][minor] Update K8s examples in doce (#13129) * [RLlib] Support easy `use_attention=True` flag for using the GTrXL model. (#11698) * [docs] Documentation + example for the C++ language API (#13138) * [Java] Support `wasCurrentActorRestarted` in actor task. (#13120) * Remove check. * Add test * fix lint * lint * Fix spotless lint * Address comments. * Fix lint Co-authored-by: Qing Wang <jovany.wq@antgroup.com> * [docs] Minor change to formating C++ docs. (#13151) * Deprecate setResource java api (#13117) * [docs] Small fix in C++ documentation. (#13154) * prepare for head node * move command runner interface outside _private * remove space * Eric * flake * min_workers in multi node type * fixing edge cases * eric not idle * fix target_workers to consider min_workers of node types * idle timeout * minor * minor fix * test * lint * eric v2 * eric 3 * min_workers constraint before bin packing * Update resource_demand_scheduler.py * Revert "Update resource_demand_scheduler.py" This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5. * reducing diff * make get_nodes_to_launch return a dict * merge * weird merge fix * auto fill instance types for AWS * Alex/Eric * Update doc/source/cluster/autoscaling.rst * merge autofill and input from user * logger.exception * make the yaml use the default autofill * docs Eric * remove test_autoscaler_yaml from windows tests * lets try changing the test a bit * return test * lets see * edward * Limit max launch concurrency * commenting frac TODO * move to resource demand scheduler * use STATUS UP TO DATE * Eric * make logger of gc freed refs debug instead of info * add cluster name to docker mount prefix directory * grrR * fix tests * moving docker directory to sdk * move the import to prevent circular dependency * smallf fix * ian * fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running * small fix * deflake test_joblib * lint * placement groups bypass * remove space * Eric * first ocmmit * lint * exmaple * documentation * hmm * file path fix * fix test * some format issue in docs * modified docs Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan> Co-authored-by: Alex Wu <alex@anyscale.io> Co-authored-by: Alex Wu <itswu.alex@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local> Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal> * [Serve] [Doc] Add existing web server integration ServeHandle tutorial (#13127) * [kubernetes][docs][minor] Kubernetes version warning (#13161) * [Core] Locality-aware leasing: Milestone 1 - Owned refs, pinned location (#12817) * Locality-aware leasing for owned refs (pinned locations). * LessorPicker --> LeasePolicy. * Consolidate GetBestNodeIdForTask and GetBestNodeIdForObjects. * Update comments. * Turn on locality-aware leasing feature flag by default. * Move local fallback logic to LeasePolicy, move feature flag check to CoreWorker constructor, add local-only lease policy. * Add lease policy consulting assertions to the direct task submitter tests. * Add lease policy tests. * LocalityLeasePolicy --> LocalityAwareLeasePolicy. * Add missing const declarations. Co-authored-by: SangBin Cho <rkooo567@gmail.com> * Add RAY_CHECK for raylet address nullptr when creating lease client. * Make the fact that LocalLeasePolicy always returns the local node more explicit. * Flatten GetLocalityData conditionals to make it more readable. * Add ReferenceCounter::GetLocalityData() unit test. * Add data-intensive microbenchmarks for single-node perf testing. * Add data-intensive microbenchmarks for simulated cluster perf testing. * Remove redundant comment. * Remove data-intensive benchmarks. * Add locality-aware leasing Python test. * Formatting changes in ray_perf.py. Co-authored-by: SangBin Cho <rkooo567@gmail.com> * Enabling the cancellation of non-actor tasks in a worker's queue (#12117) * wrote code to enable cancellation of queued non-actor tasks * minor changes * bug fixes * added comments * rev1 * linting * making ActorSchedulingQueue::CancelTaskIfFound raise a fatal error * bug fix * added two unit tests * linting * iterating through pending_normal_tasks starting from end * fixup! iterating through pending_normal_tasks starting from end * fixup! fixup! iterating through pending_normal_tasks starting from end * post merge fixes * added debugging instructions, pulled Accept() out of guarded loop * removed debugging instructions, linting * [Serve] Bug in Serve node memory-related resources calculation #11198 (#13061) * [Release] Update Release Process Documentation (#13123) * [Core] Remove Arrow dependencies (#13157) * remove arrow ubsan * remove arrow build depend * remove arrow buffer * [XGboost] Update Documentation (#13017) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [SGD] Fix Docstring for `as_trainable` (#13173) * Revert "Enabling the cancellation of non-actor tasks in a worker's queue (#12117)" (#13178) This reverts commit b4d688b4a64c595a071e8c7380b653e0bfea4ad2. * Surface object store spilling statistics in `ray memory` (#13124) * [ray_client]: Move from experimental to util (#13176) Change-Id: I9f054881f0429092d265cd6944d89804cce9d946 * Remove unused file(object_manager_integration_test.cc) (#12989) * Notify listeners after registered node stored (#13069) * [build]Update description and add some keywords (#13163) * [Collective][PR 2/6] Driver program declarative interfaces (#12874) * scaffold of the code * some scratch and options change * NCCL mostly done, supporting API#1 * interface 2.1 2.2 scratch * put code into ray and fix some importing issues * add an addtional Rendezvous class to safely meet at named actor * fix some small bugs in nccl_util * some small fix * scaffold of the code * some scratch and options change * NCCL mostly done, supporting API#1 * interface 2.1 2.2 scratch * put code into ray and fix some importing issues * add an addtional Rendezvous class to safely meet at named actor * fix some small bugs in nccl_util * some small fix * add a Backend class to make Backend string more robust * add several useful APIs * add some tests * added allreduce test * fix typos * fix several bugs found via unittests * fix and update torch test * changed back actor * rearange a bit before importing distributed test * add distributed test * remove scratch code * auto-linting * linting 2 * linting 2 * linting 3 * linting 4 * linting 5 * linting 6 * 2.1 2.2 * fix small bugs * minor updates * linting again * auto linting * linting 2 * final linting * Update python/ray/util/collective_utils.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update python/ray/util/collective_utils.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update python/ray/util/collective_utils.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * added actor test * lint * remove local sh * address most of richard's comments * minor update * remove the actor.option() interface to avoid changes in ray core * minor updates Co-authored-by: YLJALDC <dal177@ucsd.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [serve] Merge ActorReconciler and BackendState (#13139) * [tune] better signature check for `tune.sample_from` (#13171) * [tune] better signature check for `tune.sample_from` * Update python/ray/tune/sample.py Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com> Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com> * Disable atexit test on windows (#13207) * [serve] Move controller state into separate files (#13204) * Update multi_agent_independent_learning.py (#13196) pettingzoo.utils.error.DeprecatedEnv: waterworld_v0 is now depreciated, use waterworld_v2 instead * [Collective] Some necessary abstraction of collective calls before introducing stream management (#13162) * [Tune] Fix PBT Transformers Example (#13174) * [Serve] HTTPOptions for deployment modes (#13142) * [tests] Fix Autoscaler Test failure on Windows (#13211) * skip create_or_update tests * Update python/ray/tests/test_autoscaler.py Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu> Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu> * [BugFix][GCS]Fix gcs_actor_manager_test multithreading bug (#13158) * [GCS]Fix TestActorSubscribeAll bug (#13193) * [Metrics] Record per node and raylet cpu / mem usage (#12982) * Record per node and raylet cpu / mem usage * Add comments. * Addressed code review. * [Tune] Fix tune serve integration example (#13233) * [Redis] Note that each Redis Connect retry takes two minutes (#12183) * Slightly alter error message so it's the same in both cases. * Each retry takes about two minutes. * [Log] fix spdlog init race (#12973) * fix spdlog init race * use global logger * refine logger name and constructor * [Release] Add 1.1.0 release test logs (#13054) * Add microbenchmark to release logs * check in many_tasks stress test result * Add results of placement group stress test for 1.1.0 * Add result for test_dead_actors test and correct the name of test_many_tasks.txt * Add rllib regression test result * Add pytorch test results for rllib * remove extraneous log entries * [Core] Fix incorrect comment (#13228) * [Serialization] Fix cloudpickle (#13242) * [GCS]Fix gcs table storage `GetAll` and `GetByJobId` api bug (#13195) * Start ray client server with 'ray start' (#13217) * [GCS]Add gcs actor schedule strategy (#13156) * Publish job/worker info with Hex format instead of Binary (#13235) * [RLlib] SquashedGaussians should throw error when entropy or kl are called. (#13126) * [Serve] Rescale Serve's Long Running Test to Cluster Mode (#13247) Now that `HeadOnly` becomes the new default HTTP location, we can re-enable the long running tests to use local multi-clusters. (also fixed the controller's API to match up to date, we should have caught these, I will open issues for this.) * Update autoscaler-cluster yaml files for release tests (#13114) * [Release] Use ray-ml image for logn running test (#13267) * [RLlib] Fix missing "info_batch" arg (None) in `compute_actions` calls. (#13237) * [Tune] Improve error message for Session Detection (#13255) * Improve error message * log once * [Tune] Pin Tune Dependencies (#13027) Co-authored-by: Ian <ian.rodney@gmail.com> * [Dependabot] Add Dependabot (#13278) Co-authored-by: Ian <ian.rodney@gmail.com> * [docker] Pull if image is not present (#13136) * [GCS] Remove old lightweight resource usage report code path (#13192) * [Dashboard] Add GET /log_proxy API (#13165) * Fix a crash problem caused by GetActorHandle in ActorManager (#13164) * [ray_client] Add metadata to gRPC requests (#13167) * [RLlib] Preparatory PR for: Documentation on Model Building. (#13260) * [tune](deps): Bump mlflow from 1.13.0 to 1.13.1 in /python/requirements (#13286) * [tune](deps): Bump gluoncv from 0.9.0 to 0.9.1 in /python/requirements (#13287) * Remove top-level ray.connect() and ray.disconnect() APIs (#13273) * [Pull manager] Only pull once per retry period (#13245) * . * docs * cleanup * . * . * . * . Co-authored-by: Alex <alex@anyscale.com> * [Cancellation] Make Test Cancel Easier to Debug (#13243) * first commit * lint-fix * [ray_client]: first draft of documentation (#13216) * Do not give an error if both `RAY_ADDRESS` and `address` is specified on initialization (#13305) * Finalize handling of RAY_ADDRESS * lint * [serve] Clean up EndpointState interface, move checkpointing inside of EndpointState (#13215) * [RLlib] SlateQ Documentation (#13266) * [RLlib] Add more detailed Documentation on Model building API (#13261) * [tune] convert search spaces: parse spec before flattening (#12785) * Parse spec before flattening * flatten after parse * Test for ValueError if grid search is passed to search algorithms * remove empty extras streaming deps (#12933) * add the method annotation and a comment explaining what's happening (#13306) Change-Id: I848cc2f0beaed95340d9de7cca19a50c78d9da9a * Use wait_for_condition to reduce flakiness in test_queue.py::test_custom_resources (#13210) * [RLlib] Issue 13330: No TF installed causes crash in `ModelCatalog.get_action_shape()` (#13332) * [serve] Cleanup backend state, move checkpointing and async goal logic inside (#13298) * fix removal of task dependencies (#13333) Co-authored-by: senlin.zsl <senlin.zsl@antfin.com> * [Serve] Support Starlette streaming response (#13328) * [RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339) * [client] Report number of currently active clients on connect (#13326) * wip * update * update * reset worker * fix conn * fix * disable pycodestyle * Implement internal kv in ray client (#13344) * kv internal * fix * [Tune] Rename MLFlow to MLflow (#13301) * Forgot overwrite parameter in Ray client internal kv * Fix typo in Tune Docs (Checkpointing) (#13348) See issue #13299 * [Kubernetes][Docs] GPU usage (#13325) * gpu-note * gpu-note * More info * lint? * Update doc/source/cluster/kubernetes.rst Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update doc/source/cluster/kubernetes.rst Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update doc/source/cluster/kubernetes.rst Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update doc/source/cluster/kubernetes.rst Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * GKE->Kubernetes Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Revert "[RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339)" (#13361) This reverts commit e2b2abb88b82c0c2402a338bba51e5dbd1739419. * [Dependabot] [CI] Re-configure Dependabot and disable duplicate builds (#13359) * [tune] buffer trainable results (#13236) * Working prototype * Pass buffer length, fix tests * Don't buffer per default * Dispatch and process save in one go, added tests * Fix tests * Pass adaptive seconds to train_buffered, stop result processing after STOP decision * Fix tests, add release test * Update tests * Added detailed logs for slow operations * Update python/ray/tune/trial_runner.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Apply suggestions from code review * Revert tests and go back to old tuning loop * nit Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [Serve] Add dependency management support for driver not running in a conda env (#13269) * [RLlib] Add `__len__()` method to SampleBatch (#13371) * [Serve] Backend state unit tests (#13319) * trigger doc build for serve updates (#13373) * [Object Spilling] Long running object spilling test (#13331) * done. * formatting. * Remove unimplemented GetAll method in actor info accessor (#13362) * [Doc] Remove trailing whitespaces (#13390) * Enable Ray client server by default (#13350) * update * fix * fix test * update * [RLlib] Trajectory View API: Atari framestacking. (#13315) * [ray_client]: Wait for ready and retry on ray.connect() (#13376) * [ray_client]: wait until connection ready Change-Id: Ie443be60c33ab7d6da406b3dcaa57fbb7ba57dd6 * lint Change-Id: I30f8e870bbd5f8859a9f11ae244e210f077cedd0 * docs and retry minimum Change-Id: I43f5378322029267ddd69f518ce8206876e2129d * [Dashboard] Fix missing actor pid (#13229) * [ray_client]: Fix multiple attempts at checking connection (#13422) * Plumb retries update (#13411) * [Serve] [Doc] Improve batching doc (#13389) * [autoscaler/k8s] [CI] Kubernetes test ray up, exec, down (#12514) * Fix Serve release test (#13385) * Add bazel logs upload to GHA (#13251) * [tune] Fix f-string in error message (#13423) * [serve] Pull out goal management logic into AsyncGoalManager class (#13341) * Make request_resources() use internal kv instead of redis pub sub (#13410) * Remove unused handler methods (#13394) * [Tune] Pin Transitive Dependencies (#13358) * Split out the part of get_node_ip_address for which the docstring is correct (#12796) * Fix raylet::MockWorker::GetProcess crashes (#13440) Co-authored-by: 刘宝 <po.lb@antfin.com> * Revert "Enable Ray client server by default (#13350)" (#13429) This reverts commit 912d0cbbf912d5b52d6176155bdff02f504b657d. * Fix linter error (#13451) * [GCS]Add gcs resource scheduler (#13072) * [RLlib] Redo: Make TFModelV2 fully modular like TorchModelV2 (soft-deprecate register_variables, unify var names wrt torch). (#13363) * [Core]Fix raylet scheduling bug (#13452) * [Core]Fix raylet scheduling bug * fix lint error * fix lint error Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com> * [joblib] joblib strikes again but this time on windows (#13212) * [ray_client]: fix exceptions raised while executing on the server on behalf of the client (#13424) * [kubernetes][minor] Operator garbage collection fix (#13392) * [Core][CLI] `ray status` and `ray memory` no longer starts a new job (#13391) * Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init() * Modify ray status cli so that it doesn't start a new job via ray.init() * Remove local test file * Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init() * Modify ray status cli so that it doesn't start a new job via ray.init() * Remove local test file * Make status and error args required in commands.py#debug.status * Remove unnecessary imports * Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init() * Modify ray status cli so that it doesn't start a new job via ray.init() * Remove local test file * Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init() * Modify ray status cli so that it doesn't start a new job via ray.init() * Remove local test file * Make status and error args required in commands.py#debug.status * Remove unnecessary imports * Job 38482.1 should now pass * Resolve merge conflict * [RLlib] Deflake 2x remote & local inference tests (external env). (#13459) * [docs] Add more guideline on using ray in slurm cluster (#12819) Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com> Co-authored-by: PENG Zhenghao <pengzh@ie.cuhk.edu.hk> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * [Dashboard] Fix GPU resource rendering issue (#13388) * [Release] Fix Serve release test (#13303) The Docker image we were using now uses `ray` users so we have to call sudo. * [serve] Properly obey SERVE_LOG_DEBUG=0 (#13460) * Fix getting runtime context dict in driver (#13417) * [xgb] re-enable xgboost_ray tests (#13416) * re-enable * fix * update xgb_ray version * [Serialization] New custom serialization API (#13291) * new serialization API with doc & test * add more notes * refine notes * doc * [Core] Ownership-based Object Directory: Consolidate location table and reference table. (#13220) * Added owned object reference before Plasma put on Create() + Seal() path. * Consolidated location table and reference table in reference counter. * Restore type in definition. * Clean up owned reference on failed Seal(). * Added RemoveOwnedObject test for reference counter. * Guard against ref going out of scope before location RPCs. * Add 'owner must have ref in scope' precondition to documentation for object location methods. * Move to separate Create() + Seal() methods for existing objects. * Clearer distinction between Create() and Seal() methods. * Make it clear that references will normally be cleaned up by reference counting. * [ray_client]: Support runtime_context as metadata (#13428) * [GCS]Remove unused class variable (#13454) * [Object Spilling] Dedup restore objects (#13470) * done. * Addressed code review. * [CI] Enable Dashboard tests for master (#13425) * [docker/dashboard] Fix ray dashboard (#12899) * [CI] Fix Windows Bazel Upload (#13436) * Return version info from Ray client connect, to allow for discovering version mismatches * Update ID specification doc (#13356) * [ray_client]: fix wrong reference in server_pickler (#13474) Change-Id: Ie3d219541b1875e986e72e3ae73ece145c715acf * Bump dev branch to 2.0 to avoid endless version bump toil (#13497) * wip * fix * fix * Remove an unnecessary file (#13499) * [Tests] Skip failing windows tests (#13495) * skip failing windows tests * skip more * remove * updates * [tune] fix small docs typo (#13355) Signed-off-by: Richard Liaw <rliaw@berkeley.edu> * move message to debug (#13472) * Minimal version of piping autoscaler events to driver logs (#13434) * sync write internal config in gcs (#13197) * Refactor node manager to eliminate `new_scheduler_enabled_` (#12936) * [GCS]Only publish changed field when node dead (#13364) * Only update changed field when node dead * node_id missed * [CI] Buildkite PR Environment for Simple Tests (#13130) * [GCS] Remove task info publish as nowhere uses it (#13509) * Remove task info publish as nowhere uses it * simplify right publish channel * [RLlib] Solve PyTorch/TF-eager A3C async race condition between calling model and its value function. (#13467) * [tune] placement group support (#13370) * [Serve] Allow ObjectRef for Composition (#12592) * Add Dashboard Python Test to Buildkite (#13530) * Add ability to not start Monitor when calling `ray start` (#13505) * [tune] support experiment checkpointing for grid search (#13357) * Fix typo (#13098) * Remove PYTHON_MODE that is not defined in Ray so that import * will work from other packages. (#13544) * [RLlib] MARWIL loss function test case and cleanup. (#134…
Why are these changes needed?
This enables automatic object reconstruction for actor tasks, up to the specified number of
max_task_retries
. The main change required is to reset the task counter in the actor's task spec, since the resubmitted task should not be executed according to the original order of submission.This also adds a unit test for actor constructor tasks that depend on a plasma object.
Checks
scripts/format.sh
to lint the changes in this PR.