Skip to content

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Jul 24, 2020

This PR fixes issue #9631 (Tf1.14 does not have tf.config.list_physical_devices; must use tf.config.experimental, instead).

Issue #9631

Closes issue #9631

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/latest/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested (please justify below)

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@sven1977 sven1977 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 24, 2020
@sven1977 sven1977 merged commit e4c5d35 into ray-project:master Jul 24, 2020
Edilmo added a commit to BonsaiAI/ray that referenced this pull request Aug 20, 2020
* [Core] Enhance common client connection (ray-project#9367)

* enhance client connection

* add write buffer async

* read message

* add test

* Bazel move more shell to native rules (ray-project#9314)

Co-authored-by: Mehrdad <noreply@github.com>

* [tune] Fix github readme (ray-project#9365)

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

* Combine different severities into the same log files (ray-project#9230)

* Combine different severities into the same log files

Co-authored-by: Mehrdad <noreply@github.com>

* [core] Pass owner address from the workers to the raylet (ray-project#9299)

* Add intended worker ID to GetObjectStatus, tests

* Remove TaskID owner_id

* lint

* Add owner address to task args

* Make TaskArg a virtual class, remove multi args

* Set owner address for task args

* merge

* Fix tests

* Add ObjectRefs to task dependency manager, pass from task spec args

* tmp

* tmp

* Fix

* Add ownership info for task arguments

* Convert WaitForDirectActorCallArgs

* lint

* build

* update

* build

* java

* Move code

* build

* Revert "Fix Google log directory again (ray-project#9063)"

This reverts commit 275da2e.

* Fix free

* fix tests

* Fix tests

* build

* build

* fix

* Change assertion to warning to fix java

* [Core] Add placement group scheduler and some api in resource scheduler (ray-project#9039)

* Add placement group scheduler and some api of resource scheduler.
Merge fix cv hang in multithread variables race (ray-project#8984).

* change the bundle id and delete unit count in bundle

change vector<bundle_spec> to vector<shared_ptr<bundle_spec>>

Add placement group scheduler and some api of resource scheduler.
Merge fix cv hang in multithread variables race (ray-project#8984).

change the bundle id and delete unit count in bundle

remove CheckIfSchedulable()

add comments and fix the bug in resource

* fix placement group schedule

* add placement group scheduler and change some api in resource scheduler

* fix by the comments

* fix conflict

* fix lint

* fix lint

* fix bug in merge

* fix lint

Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com>

* [Core] New scheduler fixes (ray-project#9186)

* .

* test_args passes

* .

* test_basic.py::test_many_fractional_resources causes ray to hang

* test_basic.py::test_many_fractional_resources causes ray to hang

* .

* .

* useful

* test_many_fractional_resources fails instead of hanging now :)

* Passes test_fractional_resources

* .

* .

* Some cleanup

* git is hard

* cleanup

* Fixed scheduling tests

* .

* .

* [Core] put small objects in memory store (ray-project#8972)

* remove the put in memory store

* put small objects directly in memory store

* cast data type

* fix another place that uses Put to spill to plasma store

* fix multiple tests related to memory limits

* partially fix test_metrics

* remove not functioning codes

* fix core_worker_test

* refactor put to plasma codes

* add a flag for the new feature

* add flag to more places

* do a warmup round for the plasma store

* lint

* lint again

* fix warmup store

* Update _raylet.pyx

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* [autoscaler] Move command runners into separate file and clean up interface. (ray-project#9340)

* cleanup

* wip

* fix imports

* fix lint

* [docs][rllib] Recommended workflow for training, saving, and testing (ray-project#9319)

* [autoscaler] Allow users to disable the cluster config cache (ray-project#8117)

* [autoscaler] Remove autoscaler config cache.

* [autoscaler] Add flag allowing users to explicitly disable the config cache.

* Update hiredis and remove Windows patches (ray-project#9289)

Co-authored-by: Mehrdad <noreply@github.com>

* Fix flaky test_dynres.py (ray-project#9310)

* Fix gcs_table_storage testcase bug (ray-project#9393)

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>

* [HOTFIX] Fix compile direct_actor_transport_test on mac (ray-project#9403)

* Change Python's `ObjectID` to `ObjectRef` (ray-project#9353)

* [Java] Improve JNI performance when submitting and executing tasks (ray-project#9032)

* Remove the RAY_CHECK in Worker::Port() (ray-project#9348)

* [RLlib] Issue ray-project#9366 (DQN w/o dueling produces invalid actions). (ray-project#9386)

* Fix macos compliation bug (ray-project#9391)

* Fix.

* [Core] Plasma RAII support (ray-project#9370)

* [Serve] Merge router with HTTPProxy (ray-project#9225)

* Pass run args to DockerCommandRunner (ray-project#9411)

* Fix copy to workspace (ray-project#9400)

* [RLlib] Tf2.x native. (ray-project#8752)

* Update conda and ray wheel on GCP images (ray-project#9388)

* [Core] Simplify Raylet Client (ray-project#9420)

* Masking error. With t*valid_mask, we get the error np.inf*0 = np.inf (ray-project#9407)

* [RLLib] WindowStat bug fix (ray-project#9213)

* WindowStat error catching, which processes NaNs properly instead of erroring. This ought to resolve issue ray-project#7910.
ray-project#7910

* [tune] handling nan values (ray-project#9381)

* TRAVIS_PULL_REQUEST is false for non-PRs, not empty (ray-project#9439)

Co-authored-by: Mehrdad <noreply@github.com>

* [GCS] Fix the bug about raylet receiving duplicate actor creation tasks (ray-project#9422)

* [Tune] Trainable documentation fix (ray-project#9448)

* Allow --lru-evict to be passed into `ray start` (ray-project#8959)

* GCP authentication using oauth tokens (ray-project#9279)

* Bazel selects compiler flags based on compiler (ray-project#9313)



Co-authored-by: Mehrdad <noreply@github.com>

* [Core] Build raylet client as an independent component (ray-project#9434)

* [tune] sklearn comment out (ray-project#9454)

* Add ability to specify SOCKS proxy for SSH connections (ray-project#8833)

* [docs] Render ActorPool documentation, etc (ray-project#9433)

* [tune] Put examples under proper version control (ray-project#9427)

Co-authored-by: krfricke <krfricke@users.noreply.github.com>

* Fix test-multi-node (ray-project#9453)

* Machine View Sorting / Grouping (ray-project#9214)

* Convert NodeInfo.tsx to a functional component

* Update NodeRowGroup to be a functional component

* lint

* Convert TotalRow to functional component.

* lint

* move node info over to using the sortable table head component. spacing is still a little wonky.

* Factor a NoewWorkerRow class out of NodeRowGroup that will be usable when grouping / ungrouping

* Compilation checkpoint, I factored the worker filtering logic out of node info into the reducer

* Add sort accessors for CPU

* Add sort accessors for Disk

* Add sort accessors for RAM

* add a table sort util for function based accessors (rather than flat attribute-based accessor)

* wip refactor node info features

* wip

* Rendering Checkpoint. I've refactored the features and how they are called to add sorting support. Also reworks the way error counts and log counts are passed to the front-end to remove some ugly logic

* wip

* wip

* wip

* Finish adding sorting and grouping of machine view

* lint

* fix bug in filtration of logs and errors by worker from recent refactor.

* Add export of Cluster Disk feature

* fix some merge issues

Co-authored-by: Max Fitton <max@semprehealth.com>

* [RLlib] Layout of Trajectory View API (new class: Trajectory; not used yet). (ray-project#9269)

* [RLlib] Issue 9402 MARWIL producing nan rewards. (ray-project#9429)

* Fix gcs_pubsub_test bug(ray-project#9438)

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>

* change error code name of boost timer (ray-project#9417)

* [tune] PyTorch CIFAR10 example (ray-project#9338)

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Kai Fricke <kai@anyscale.com>

* Remove legacy C++ code (ray-project#9459)

* Fix ObjectRef and ActorHandle serialization (ray-project#9462)

* [Stats] metrics agent exporter (ray-project#9361)

* [Core] Support GCS server port assignment. (ray-project#8962)

* Add scripts symlink back (ray-project#9219) (ray-project#9475)

(cherry picked from commit 77933c9)

Co-authored-by: Simon Mo <xmo@berkeley.edu>

* [tune] Issue 8821: ExperimentAnalysis doesn't expand user (ray-project#9461)

* [docker] Include base-deps image in rayproject Docker Hub (ray-project#9458)

* [Core] remove create_and_seal and create_and_seal_batch (ray-project#9457)

* Speedups for GitHub Actions (ray-project#9343)

Co-authored-by: Mehrdad <noreply@github.com>

* Fix flaky test_object_manager.py (ray-project#9472)

* [Java] fix redis-server binary path (ray-project#9398)

* [core] Handle out-of-order actor table notifications (ray-project#9449)

* Drop stale actor table notifications

* build

* Add num_restarts to disconnect handler

* Unit test and increment num_restarts on ALIVE, not RESTARTING

* Wait for pid to exit

* Fix name clash on Windows (ray-project#9412)

Co-authored-by: Mehrdad <noreply@github.com>

* Add job configs to gcs (ray-project#9374)

* Make pip install verbose (ray-project#9496)

Co-authored-by: Mehrdad <noreply@github.com>

* Make more tests compatible with Windows (ray-project#9303)

* [tune] extend PTL template (GPU, typing fixes, tensorboard) (ray-project#9451)

Co-authored-by: Kai Fricke <kai@anyscale.com>

* [core] Replace task resubmission in raylet with ownership protocol (ray-project#9394)

* Add intended worker ID to GetObjectStatus, tests

* Remove TaskID owner_id

* lint

* Add owner address to task args

* Make TaskArg a virtual class, remove multi args

* Set owner address for task args

* merge

* Fix tests

* Add ObjectRefs to task dependency manager, pass from task spec args

* tmp

* tmp

* Fix

* Add ownership info for task arguments

* Convert WaitForDirectActorCallArgs

* lint

* build

* update

* build

* java

* Move code

* build

* Revert "Fix Google log directory again (ray-project#9063)"

This reverts commit 275da2e.

* Fix free

* Regression tests - shorten timeouts in reconstruction unit tests

* Remove timeout for non-actor tasks

* Modify tests using ray.internal.free

* Clean up future resolution code

* Raylet polls the owner

* todo

* comment

* Update src/ray/core_worker/core_worker.cc

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

* Drop stale actor table notifications

* Fix bug where actor restart hangs

* Revert buggy code for duplicate tasks

* build

* Fix errors for lru_evict and internal.free

* Revert "Drop stale actor table notifications"

This reverts commit 193c5d2.

* Revert "build"

This reverts commit 5644edb.

* Fix free test

* Fixes for freed objects

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

* release gil in global state accessor (ray-project#9357)

* [Java] Named java actor (ray-project#9037)

* Fix clang-cl build (ray-project#9494)

Co-authored-by: Mehrdad <noreply@github.com>

* [GCS Actor Management] Gcs actor management broken detached actor (ray-project#9473)

* [RLlib] Issue ray-project#9437 (PyTorch converts to CPU tensor, even if on GPU). (ray-project#9497)

* Get rid of build shell scripts and move them to Python (ray-project#6082)

* Fix broken test_raylet_info_endpoint (ray-project#9511)

* Fix. (ray-project#9464)

* [Autoscaler] Making bootstrap config part of the node provider interface (ray-project#9443)

* supporting custom bootstrap config for external node providers

* bootstrap config

* renamed config to cluster_config

* lint

* remove 2 args from importer

* complete move of bootstrap to node_provider

* renamed provider_cls

* move imports outside functions

* lint

* Update python/ray/autoscaler/node_provider.py

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* final fixes

* keeping lines to reduce diff

* lint

* lamba config

* filling in -> adding for lint

Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Fix flaky test_actor_failures::test_actor_restart (ray-project#9509)

* Fix flaky test

* os exit

* [rllib] MAML Transform (ray-project#9463)

* MAML Transform

* Moved Inner Adapt to Method in Execution Plan

* Cleanup Plasma Store (hash utilities) (ray-project#9524)

* [Serve] Improve buffering for simple cases (ray-project#9485)

* [Serve] Use pickle instead of clouldpickle (ray-project#9479)

* Fix pip and Bazel interaction messing up CI (ray-project#9506)

Co-authored-by: Mehrdad <noreply@github.com>

* [Core] Fix Java detached error (ray-project#9526)

* fix java createActor NPE bug (ray-project#9532)

* [RLlib] Issue 9218: PyTorch Policy places Model on GPU even with num_gpus=0 (ray-project#9516)

* [Stats] Fix metric exporter test (ray-project#9376)

* Hotfix Lint for Serve (ray-project#9535)

* Windows cleanup (ray-project#9508)

* Remove unneeded code for Windows

* Get rid of usleep()

* Make platform_shims includes non-transitive

Co-authored-by: Mehrdad <noreply@github.com>

* [RLlib] Issue 8384: QMIX doesn't learn anything. (ray-project#9527)

* Add placement group manager and some code in core_worker (ray-project#9120)

Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com>

* [core] Add flag to enable object reconstruction during ray start (ray-project#9488)

* Add flag

* doc

* Fix tests

* Pipelining task submission to workers (ray-project#9363)

* first step of pipelining

* pipelining tests & default configs
- added pipelining unit tests in direct_task_transport_test.cc
- added an entry in ray_config_def.h, ray_config.pxi, and ray_config.pxd to configure the parameter controlling the maximum number of tasks that can be in fligh to each worker
- consolidated worker_to_lease_client_ and worker_to_lease_client_ hash maps in direct_task_transport.h into a single one called worker_to_lease_entry_

* post-review revisions

* linting, following naming/style convention

* linting

* [New scheduler] Queueing refactor (ray-project#9491)

* .

* test_args passes

* .

* test_basic.py::test_many_fractional_resources causes ray to hang

* test_basic.py::test_many_fractional_resources causes ray to hang

* .

* .

* useful

* test_many_fractional_resources fails instead of hanging now :)

* Passes test_fractional_resources

* .

* .

* Some cleanup

* git is hard

* cleanup

* .

* .

* .

* .

* .

* .

* .

* cleanup

* address reviews

* address reviews

* more refactor

* :)

* travis pls

* .

* travis pls

* .

* [Serve] Add internal instruction for running benchmarks (ray-project#9531)

* MADDPG learning confirmation test. (ray-project#9538)

* Fix Bazel in Docker (ray-project#9530)

Co-authored-by: Mehrdad <noreply@github.com>

* Fix bug that `test_multi_node.py::test_multi_driver_logging` hangs when GCS actor management is turned on (ray-project#9539)

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>

* [tune] Unflattened lookup for ProgressReporter (ray-project#9525)

Co-authored-by: Kai Fricke <kai@anyscale.com>

* Add plasma store benchmark for small objects (ray-project#9549)

* [Tune] Copy default_columns in new ProgressReporter instances (ray-project#9537)

* quickfix (ray-project#9552)

* [tune] pin tune-sklearn (ray-project#9498)

* [cli] ray memory: added redis_password (ray-project#9492)

* [GCS]Fix lease worker leak bug when gcs server restarts (ray-project#9315)

* add part code

* fix compile bug

* fix review comments

* fix review comments

* fix review comments

* fix review comments

* fix review comment

* fix ut bug

* fix lint error

* fix review comment

* fix review comments

* add testcase

* add testcase

* fix bug

* fix review comments

* fix review comment

* fix review comment

* refine comments

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
Co-authored-by: Hao Chen <chenh1024@gmail.com>

* [tune] fix pbt checkpoint_freq (ray-project#9517)

* Only delete old checkpoint if it is not the same as the new one

* Return early if old checkpoint value coincides with new checkpoint value

Co-authored-by: Kai Fricke <kai@anyscale.com>

* [Core] Remove socket pair exchange in Plasma Store (ray-project#9565)

* try use boost::asio for notification processing

* [Metric] new cython interface for python worker metric (ray-project#9469)

* Bazel fixes (ray-project#9519)

* GCS client add fetch operation before subscribe (ray-project#9564)

* [RLlib] Fix combination of lockstep and multiple agnts controlled by the same policy. (ray-project#9521)

* Change aggregation when lockstep is activated.

Modification of MultiAgentBatch.timeslices to support the combination of lockstep and multiple agents controlled by the same policy.

fix ray-project#9295

* Line too long.

* [Core] Replace the Plasma eventloop with boost::asio (ray-project#9431)

* Fix Java named actor bug (ray-project#9580)

* Fix setup.py bug (ray-project#9581)

Co-authored-by: Mehrdad <noreply@github.com>

* [Serve] Serialize Query object directly (ray-project#9490)

* Add dashboard dependencies to default ray installation (ray-project#9447)

* Dashboard next-version API support in backend (ray-project#9345)

* Fix log losses (ray-project#9559)

* Close log on shutdown

* Disable log buffering

Co-authored-by: Mehrdad <noreply@github.com>

* [docker] run Ubuntu 20.04 as base image (ray-project#9556)

* Add PTL to README.rst (ray-project#9594)

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Skip uneeded steps on CI (ray-project#9582)

Co-authored-by: Mehrdad <noreply@github.com>

* Fix Windows CI (ray-project#9588)

Co-authored-by: Mehrdad <noreply@github.com>

* [serve] Rename to `Controller` (ray-project#9566)

* Handle warnings in core (ray-project#9575)

* [New scheduler] Fix new scheduler bug (ray-project#9467)

* fix new scheduler bug

* add testcase for soft resource allocation

* modify RemoveNode

* Ensure unique log file names across same-node raylets. (ray-project#9561)

* fix tag key typo (ray-project#9606)

* Rename path variable due to zsh conflict (ray-project#9610)

* [doc] [minor] Make API docs easier to find. (ray-project#9604)

* Issue 9568: `rllib train` framework in config gets overridden with tf. (ray-project#9572)

* Use UTF-8 for encoding of python code for collision hashing (ray-project#9586)

Co-authored-by: Arne Sachtler <arne.sachtler@dlr.de>
Co-authored-by: simon-mo <simon.mo@hey.com>

* Add bazel to the PATH in setup.py (ray-project#9590)

Co-authored-by: Mehrdad <noreply@github.com>

* Fix Lint in setup.py (ray-project#9618)

Co-authored-by: Mehrdad <noreply@github.com>

* Shellcheck comments (ray-project#9595)

* [Serve] Document Metric Infrastructure (ray-project#9389)

* [CI] Do not run jenkins test on GHA (ray-project#9621)

* Support ray task type checking (ray-project#9574)

* [Metrics] Java metric API (ray-project#9377)

* [GCS] fix the fault tolerance about gcs node manager (ray-project#9380)

* Shellcheck quoting (ray-project#9596)

* Fix SC2006: Use $(...) notation instead of legacy backticked `...`.

* Fix SC2016: Expressions don't expand in single quotes, use double quotes for that.

* Fix SC2046: Quote this to prevent word splitting.

* Fix SC2053: Quote the right-hand side of == in [[ ]] to prevent glob matching.

* Fix SC2068: Double quote array expansions to avoid re-splitting elements.

* Fix SC2086: Double quote to prevent globbing and word splitting.

* Fix SC2102: Ranges can only match single chars (mentioned due to duplicates).

* Fix SC2140: Word is of the form "A"B"C" (B indicated). Did you mean "ABC" or "A\"B\"C"?

* Fix SC2145: Argument mixes string and array. Use * or separate argument.

* Fix SC2209: warning: Use var=$(command) to assign output (or quote to assign string).

Co-authored-by: Mehrdad <noreply@github.com>

* Fix bug in Bazel version check (ray-project#9626)

Co-authored-by: Mehrdad <noreply@github.com>

* [Java] Avoid data copy from C++ to Java for ByteBuffer type (ray-project#9033)

* Revert "Dashboard next-version API support in backend (ray-project#9345)" (ray-project#9639)

This reverts commit fca1fb1.

* [Autoscaler] Command Line Interface improvements (ray-project#9322)

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* [Core] GCS Actor management on by default. (ray-project#8845)

* GCS Actor management on by default.

* Fix travis config.

* Change condition.

* Remove unnecessary CI.

* [Core] Fix concurrency issues in plasma store runner (ray-project#9642)

* fix window jni unhappy compiler (ray-project#9635)

* Fix TestObjectTableResubscribe testcase bug (ray-project#9650)

* fix named actor single process mode bug (ray-project#9652)

* [core] Fix Ray service startup when logging redirection is disabled. (ray-project#9547)

* Fix TorchDeterministic (ray-project#9241)

* [RaySGD] revised existing transformer example to work with transformers>=3.0 (ray-project#9661)

Co-authored-by: Kai Fricke <kai@anyscale.com>

* [rllib] Fix torch TD error, IMPALA LR updates (ray-project#9477)

* update

* add test

* lint

* fix super call

* speed es test up

* Auto-cancel build when a new commit is pushed (ray-project#8043)

Co-authored-by: Mehrdad <noreply@github.com>

* Fix lint in remote-watch.py (ray-project#9668)

* [Core] Remove unnecessary windows syscall in plasma store (ray-project#9602)

* Remove unused windows shims (ray-project#9583)

* Temporarily disable remote watcher (ray-project#9669)

* Drop support for Python 3.5. (ray-project#9622)

* Drop support for Python 3.5.

* Update setup.py

* [Core] WorkerInterface refactor (ray-project#9655)

* .

* .

* refactor WorkerInterface

* .

* Basic unit test structure complete?

* .

* .

* .

* .

* Fixed tests

* Fixed tests

* .

* [core] Enable object reconstruction for retryable actor tasks (ray-project#9557)

* Test actor plasma reconstruction

* Allow resubmission of actor tasks

* doc

* Test for actor constructor

* Kill PID before removing node

* Kill pid before node

* fix java coreworker crash (ray-project#9674)

* use help proto-init-macro for streaming config (ray-project#9272)

* Update release information from 0.8.6. (ray-project#9124)

* [BRING BACK TO MASTER] Update release information.

* [MERGE TO MASTER] Add microbenchmark result.

* Update asan tests to the doc.

* Refinements to the Serve documentation (ray-project#9587)

Co-authored-by: Dean Wampler <dean@concurrentthought.com>

* [tune] survey (ray-project#9670)

* Fix ERROR logging not being printed to standard error (ray-project#9633)

Co-authored-by: Mehrdad <noreply@github.com>

* [Tune Docs] Logging doc fix (ray-project#9691)

* [rllib] Type annotations for model classes (ray-project#9646)

* [Serve] Allow multiple HTTP servers. (ray-project#9523)

* Issue 9631: Tf1.14 does not have tf.config.list_physical_devices. (ray-project#9681)

* [Serve] Fix Formatting, stale docs (ray-project#9617)

* fixed simplex initialisation seeding bug (ray-project#9660)

Co-authored-by: Petros Christodoulou <petrochr@amazon.com>

* Switch from GitHub checkout@v2 to checkout@v1 due to bugs in checkout (ray-project#9697)

Co-authored-by: Mehrdad <noreply@github.com>

* Add Ray Serve to README.rst (ray-project#9688)

* Shellcheck rewrites (ray-project#9597)

* Fix SC2001: See if you can use ${variable//search/replace} instead.

* Fix SC2010: Don't use ls | grep. Use a glob or a for loop with a condition to allow non-alphanumeric filenames.

* Fix SC2012: Use find instead of ls to better handle non-alphanumeric filenames.

* Fix SC2015: Note that A && B || C is not if-then-else. C may run when A is true.

* Fix SC2028: echo may not expand escape sequences. Use printf.

* Fix SC2034: variable appears unused. Verify use (or export if used externally).

* Fix SC2035: Use ./*glob* or -- *glob* so names with dashes won't become options.

* Fix SC2071: > is for string comparisons. Use -gt instead.

* Fix SC2154: variable is referenced but not assigned

* Fix SC2164: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

* Fix SC2188: This redirection doesn't have a command. Move to its command (or use 'true' as no-op).

* Fix SC2236: Use -n instead of ! -z.

* Fix SC2242: Can only exit with status 0-255. Other data should be written to stdout/stderr.

* Fix SC2086: Double quote to prevent globbing and word splitting.

Co-authored-by: Mehrdad <noreply@github.com>

* [Autoscaler] CLI Logger docs (ray-project#9690)

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Update rllib-algorithms.rst (ray-project#9640)

* [tune] move jenkins tests to travis (ray-project#9609)

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Kai Fricke <kai@anyscale.com>

* [RLlib] Implement DQN PyTorch distributional head. (ray-project#9589)

* Add placement group java api (ray-project#9611)

* add part code

* add part code

* add part code

* fix code style

* fix review comment

* fix review comment

* add part code

* add part code

* add part code

* add part code

* fix review comment

* fix review comment

* fix code style

* fix review comment

* fix lint error

* fix lint error

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>

* [Stats] Improve Stats::Init & Add it to GCS server (ray-project#9563)

* [Core] Try remove all windows compat shims (ray-project#9671)

* try remove compat for arrow

* remove unistd.h

* remove socket compat

* delete arrow windows patch

* Fix a few flaky tests (ray-project#9709)

Fix test_custom_resources, Remove test_pandas_parquet_serialization, Better error message for test_output.py, Potentially fix test_dynres::test_dynamic_res_creation_scheduler_consistency

* [GCS]Open test_gcs_fault_tolerance testcase (ray-project#9677)

* enable test_gcs_fault_tolerance

* fix lint error

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>

* [Tests]lock vector to avoid potential flaky test (ray-project#9656)

* [tune] distributed torch wrapper (ray-project#9550)

* changes

* add-working

* checkpoint

* ccleanu

* fix

* ok

* formatting

* ok

* tests

* some-good-stuff

* fix-torch

* ddp-torch

* torch-test

* sessions

* add-small-test

* fix

* remove

* gpu-working

* update-tests

* ok

* try-test

* formgat

* ok

* ok

* [GCS] Fix actor task hang when its owner exits before local dependencies resolved (ray-project#8045)

* Only update raylet map when autoscaler configured (ray-project#9435)

* [Dashboard] New dashboard skeleton (ray-project#9099)

* Fixing multiple building issues

* Make wait_for_condition raise exception when timing out. (ray-project#9710)

* [GCS]GCS client support multi-thread subscribe&resubscribe&unsubscribe (ray-project#9718)

* Package and upload ray cross-platform jar (ray-project#9540)

* Revert "Package and upload ray cross-platform jar (ray-project#9540)" (ray-project#9730)

This reverts commit 8810325.

* Only build docker wheels in LINUX_WHEELS env (ray-project#9729)

* Keep build-autoscaler-images.sh alive in CI (ray-project#9720)

* [core] Removes Error when Internal Config is not set (ray-project#9700)

* [Cluster Launcher] Re Org the cluster launcher pages. (ray-project#9687)

* [RLlib] Offline Type Annotations (ray-project#9676)

* Offline Annotations

* Modifications

* Fixed circular dependencies

* Linter fix

* Python api of placement group (ray-project#9243)

* Include open-ssh-client for transparency (ray-project#9693)

* Fix remote-watch.py (ray-project#9625)

Co-authored-by: Mehrdad <noreply@github.com>

* [docker] Uses Latest Conda & Py 3.7 (ray-project#9732)

* Fix broken actor failure tests. (ray-project#9737)

* [Stats] fix stats shutdown crash if opencensus exporter not initialized (ray-project#9727)

* Fix package and upload ray jar (ray-project#9742)

* Introduce file_mounts_sync_continuously cluster option (ray-project#9544)

* Separate out file_mounts contents hashing into its own separate hash

Add an option to continuously sync file_mounts from head node to worker nodes:
monitor.py will re-sync file mounts whenver contents change but will only run setup_commands if the config also changes

* add test and default value for file_mounts_sync_continuously

* format code

* Update comments

* Add param to skip setup commands when only file_mounts content changed during monitor.py's update tick

Fixed so setup commands run when ray up is run and file_mounts content changes

* Refactor so that runtime_hash retains previous behavior

runtime_hash is almost identical as before this PR. It is used to determine if setup_commands need to run
file_mounts_contents_hash is an additional hash of the file_mounts content that is used to detect when only file syncing has to occur.

Note: runtime_hash value will have changed from before the PR because we hash the hash of the contents of the file_mounts as a performance optimization

* fix issue with hashing a hash

* fix bug where trying to set contents hash when it wasn't generated

* Fix lint error

Fix bug in command_runner where check_output was no longer returning the output of the command

* clear out provider between tests to get rid of flakyness

* reduce chance of race condition from node_launcher launching a node in the middle of an autoscaler.update call

* [dist] swap mac/linux wheel build order (ray-project#9746)

* [RLlib] Enhance reward clipping test; add action_clipping tests. (ray-project#9684)

* [RLlib] Issue 9667 DDPG Torch bugs and enhancements. (ray-project#9680)

* [Metrics]Ray java worker metric registry (ray-project#9636)

* ray worker metrics gauge init

* ray java metric mapping

* add jni source files for gauge and tagkey

* mapping all metric classes to stats object

* check non-null for tags and name

* lint

* add symbol for native metric JNI

* extern c for symbol

* add tests for all metrics

* Update Metric.java

use metricNativePointer instead.

* unify metric native stuff to one class

* fix jni file

* add comments for metric transform function in jni utils

* move metric function to native metric file

* remove unused disconnect jni

* Add a metric registry for java metircs

* Restore install-bazel.sh

* Add some comments for metric registry

* Fix thread safe problem of metrics

* Fix metric tests and remove sleep code from tests

* Fix comments of metrics

Co-authored-by: lingxuan.zlx <skyzlxuan@gmail.com>

* fix windows compile bug (ray-project#9741)

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>

* Run _with_interactive in Docker (ray-project#9747)

* [New scheduler] First unit test for task manager (ray-project#9696)

* .

* .

* refactor WorkerInterface

* .

* Basic unit test structure complete?

* .

* bad git >:-(

* small clean up

* CR

* .

* .

* One more fixture

* One more fixture

* .

* .

* bazel-format

* .

* [Stats] Basic Metrics Infrastructure (Metrics Agent + Prometheus Exporter) (ray-project#9607)

* [Release] Fix release tests (ray-project#9733)

* Register function race (ray-project#9346)

* Revert "[dist] swap mac/linux wheel build order (ray-project#9746)" and "Fix package and upload ray jar (ray-project#9742)" (ray-project#9758)

* Revert "[dist] swap mac/linux wheel build order (ray-project#9746)"

This reverts commit a934056.

* Revert "Fix package and upload ray jar (ray-project#9742)"

This reverts commit c290c30.

* Fix some Windows CI issues (ray-project#9708)

Co-authored-by: Mehrdad <noreply@github.com>

* Pin pytest version (ray-project#9767)

* [Java] Use test groups to filter tests of different run modes (ray-project#9703)

* [Java] Fix MetricTest.java due to incomplete changes from ray-project#9703 (ray-project#9770)

* Fix leased worker leak bug if lease worker requests that are still waiting to be scheduled when GCS restarts (ray-project#9719)

* [Stats] enable core worker stats (ray-project#9355)

* [GCS]Use a separate thread in node failure detector to handle heartbeat (ray-project#9416)

* use a sole thread to handle heartbeat

* separate signal thread

* use work to avoid exiting when task is underway

* protect shared data structure to avoid deadlock

* add comments

* decrease io service num

* minor changes

* fix test

* per stephanie's comments

* use single io service instead of 1-size io service pool

* typo

* [GCS Actor Management] Fix flaky test_dead_actors. (ray-project#9715)

* Fix.

* Add logs.

* Add an unit test.

* [TUNE] Tune Docs re-organization (ray-project#9600)

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* [RLlib] Trajectory View API (preparatory cleanup and enhancements). (ray-project#9678)

* [Core] Socket creation race condition bug fixes (ray-project#9764)

* fix issues

* hot fixes

* test

* test

* Always info log

* Fixed stderr logging (9765)

* [Core] Custom socket name (ray-project#9766)

* fix issues

* hot fixes

* test

* test

* socket name change only

* Fix src/ray/core_worker/common.h deleted constructor (ray-project#9785)

Co-authored-by: Mehrdad <noreply@github.com>

* [Stats] Fix harvestor threads + Fix flaky stats shutdown. (ray-project#9745)

* More fixes

* Applying latest changes in travis.yml

* Fixing fixture data exclusions

* Disable some java tests

* Fix some CI errors

* Update hash

* Fixing more build issues

* Fixing more build issues

* Fix pipeline cache path

* More fixes

* Fix bazel test command

* Fix bazel test

* Fix general info steps

* Custom env var for docker build

* Trying a different way to install bazel

* Bazel fix

* Updating hash

Co-authored-by: Siyuan (Ryans) Zhuang <suquark@gmail.com>
Co-authored-by: mehrdadn <mehrdadn@users.noreply.github.com>
Co-authored-by: Mehrdad <noreply@github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Alisa <wuminyan0607@gmail.com>
Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@vip.qq.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Stefan Schneider <stefan.schneider@upb.de>
Co-authored-by: Patrick Ames <pdames@amazon.com>
Co-authored-by: Hao Chen <chenh1024@gmail.com>
Co-authored-by: fangfengbin <869218239a@zju.edu.cn>
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
Co-authored-by: Tao Wang <dooku.wt@antfin.com>
Co-authored-by: Kai Yang <kfstorm@outlook.com>
Co-authored-by: Sven Mika <sven@anyscale.io>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Ian Rodney <ian.rodney@gmail.com>
Co-authored-by: Henk Tillman <henktillman@gmail.com>
Co-authored-by: Tanay Wakhare <twakhare@gmail.com>
Co-authored-by: Nicolaus93 <nicolo.campolongo@unimi.it>
Co-authored-by: Vasily Litvinov <45396231+vnlitvinov@users.noreply.github.com>
Co-authored-by: krfricke <krfricke@users.noreply.github.com>
Co-authored-by: Max Fitton <maxfitton@gmail.com>
Co-authored-by: Max Fitton <max@semprehealth.com>
Co-authored-by: kisuke95 <2522134184@qq.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Simon Mo <xmo@berkeley.edu>
Co-authored-by: Michael Mui <68102089+heyitsmui@users.noreply.github.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: chaokunyang <shawn.ck.yang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: Michael Luo <michael.luo123456789@gmail.com>
Co-authored-by: Gabriele Oliaro <gabriele_oliaro@college.harvard.edu>
Co-authored-by: Tom <veniat.tom@gmail.com>
Co-authored-by: jerrylee.io <JerryDeKo@gmail.com>
Co-authored-by: Raphael Avalos <raphael@avalos.fr>
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>
Co-authored-by: Arne Sachtler <arne.sachtler@gmail.com>
Co-authored-by: Arne Sachtler <arne.sachtler@dlr.de>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: ZhuSenlin <wumuzi520@126.com>
Co-authored-by: Max Fitton <mfitton@berkeley.edu>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
Co-authored-by: Dean Wampler <dean@polyglotprogramming.com>
Co-authored-by: Dean Wampler <dean@concurrentthought.com>
Co-authored-by: Bill Chambers <bill@anyscale.com>
Co-authored-by: Petros Christodoulou <p.christodoulou2@gmail.com>
Co-authored-by: Petros Christodoulou <petrochr@amazon.com>
Co-authored-by: Justin Terry <justinkterry@gmail.com>
Co-authored-by: Tao Wang <wangtaothetonic@163.com>
Co-authored-by: fyrestone <fyrestone@outlook.com>
Co-authored-by: Alan Guo <aguo@anyscale.com>
Co-authored-by: bermaker <495571751@qq.com>
@sven1977 sven1977 deleted the issue_9631_tf14_does_not_have_list_physical_devices branch August 21, 2020 07:42
Edilmo added a commit to BonsaiAI/ray that referenced this pull request May 14, 2021
* Set up CI with Azure Pipelines

Specifically, we are setting a
travis like ADO pipeline following
what is already present in the .travis.yml
file in the root of the repo.

* Separating travis like pipeline from main pipeline

* Adding Jenkings jobs equivalent

* Making some improvements

* Adding validation of the upstream CI

* Disabling Tune and large memory tests

* Changing threshold for simple reservoir sampling test

* Addressing comments

* Updating Azure Pipelines with travis updates

* Updating Azure Pipelines with more travis updates

* Updating CI with new cpp worker tests

* Setting code owners

* Fixing the version number generation

* Making main pipeline also our release pipeline

* Updating Azure Pipelines with travis updates

* Fixing wheels test

* Fixing codeowners

* Updating Azure Pipelines with travis updates

* Bumping up MACOSX_DEPLOYMENT_TARGET

* Updating Azure Pipelines with travis updates

* Updating Azure Pipelines with travis updates

* Updating Azure Pipelines with travis updates

* Disabling Serve tests

* Making explicit which branches GitHubActions workflows should watch

* Desabling Ray serve tests

* Installing numpy explicitly

* consolidating Ray test steps in one yml

* Syncing with upstream master 2020-07-30 (#21)

* [Core] Enhance common client connection (#9367)

* enhance client connection

* add write buffer async

* read message

* add test

* Bazel move more shell to native rules (#9314)

Co-authored-by: Mehrdad <noreply@github.com>

* [tune] Fix github readme (#9365)

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

* Combine different severities into the same log files (#9230)

* Combine different severities into the same log files

Co-authored-by: Mehrdad <noreply@github.com>

* [core] Pass owner address from the workers to the raylet (#9299)

* Add intended worker ID to GetObjectStatus, tests

* Remove TaskID owner_id

* lint

* Add owner address to task args

* Make TaskArg a virtual class, remove multi args

* Set owner address for task args

* merge

* Fix tests

* Add ObjectRefs to task dependency manager, pass from task spec args

* tmp

* tmp

* Fix

* Add ownership info for task arguments

* Convert WaitForDirectActorCallArgs

* lint

* build

* update

* build

* java

* Move code

* build

* Revert "Fix Google log directory again (#9063)"

This reverts commit 275da2e4003b56e5c315ceae53a2e5f5ad7874c1.

* Fix free

* fix tests

* Fix tests

* build

* build

* fix

* Change assertion to warning to fix java

* [Core] Add placement group scheduler and some api in resource scheduler (#9039)

* Add placement group scheduler and some api of resource scheduler.
Merge fix cv hang in multithread variables race (#8984).

* change the bundle id and delete unit count in bundle

change vector<bundle_spec> to vector<shared_ptr<bundle_spec>>

Add placement group scheduler and some api of resource scheduler.
Merge fix cv hang in multithread variables race (#8984).

change the bundle id and delete unit count in bundle

remove CheckIfSchedulable()

add comments and fix the bug in resource

* fix placement group schedule

* add placement group scheduler and change some api in resource scheduler

* fix by the comments

* fix conflict

* fix lint

* fix lint

* fix bug in merge

* fix lint

Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com>

* [Core] New scheduler fixes (#9186)

* .

* test_args passes

* .

* test_basic.py::test_many_fractional_resources causes ray to hang

* test_basic.py::test_many_fractional_resources causes ray to hang

* .

* .

* useful

* test_many_fractional_resources fails instead of hanging now :)

* Passes test_fractional_resources

* .

* .

* Some cleanup

* git is hard

* cleanup

* Fixed scheduling tests

* .

* .

* [Core] put small objects in memory store (#8972)

* remove the put in memory store

* put small objects directly in memory store

* cast data type

* fix another place that uses Put to spill to plasma store

* fix multiple tests related to memory limits

* partially fix test_metrics

* remove not functioning codes

* fix core_worker_test

* refactor put to plasma codes

* add a flag for the new feature

* add flag to more places

* do a warmup round for the plasma store

* lint

* lint again

* fix warmup store

* Update _raylet.pyx

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* [autoscaler] Move command runners into separate file and clean up interface. (#9340)

* cleanup

* wip

* fix imports

* fix lint

* [docs][rllib] Recommended workflow for training, saving, and testing (#9319)

* [autoscaler] Allow users to disable the cluster config cache (#8117)

* [autoscaler] Remove autoscaler config cache.

* [autoscaler] Add flag allowing users to explicitly disable the config cache.

* Update hiredis and remove Windows patches (#9289)

Co-authored-by: Mehrdad <noreply@github.com>

* Fix flaky test_dynres.py (#9310)

* Fix gcs_table_storage testcase bug (#9393)

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>

* [HOTFIX] Fix compile direct_actor_transport_test on mac (#9403)

* Change Python's `ObjectID` to `ObjectRef` (#9353)

* [Java] Improve JNI performance when submitting and executing tasks (#9032)

* Remove the RAY_CHECK in Worker::Port() (#9348)

* [RLlib] Issue #9366 (DQN w/o dueling produces invalid actions). (#9386)

* Fix macos compliation bug (#9391)

* Fix.

* [Core] Plasma RAII support (#9370)

* [Serve] Merge router with HTTPProxy (#9225)

* Pass run args to DockerCommandRunner (#9411)

* Fix copy to workspace (#9400)

* [RLlib] Tf2.x native. (#8752)

* Update conda and ray wheel on GCP images (#9388)

* [Core] Simplify Raylet Client (#9420)

* Masking error. With t*valid_mask, we get the error np.inf*0 = np.inf (#9407)

* [RLLib] WindowStat bug fix (#9213)

* WindowStat error catching, which processes NaNs properly instead of erroring. This ought to resolve issue #7910.
https://github.com/ray-project/ray/issues/7910

* [tune] handling nan values (#9381)

* TRAVIS_PULL_REQUEST is false for non-PRs, not empty (#9439)

Co-authored-by: Mehrdad <noreply@github.com>

* [GCS] Fix the bug about raylet receiving duplicate actor creation tasks (#9422)

* [Tune] Trainable documentation fix (#9448)

* Allow --lru-evict to be passed into `ray start` (#8959)

* GCP authentication using oauth tokens (#9279)

* Bazel selects compiler flags based on compiler (#9313)



Co-authored-by: Mehrdad <noreply@github.com>

* [Core] Build raylet client as an independent component (#9434)

* [tune] sklearn comment out (#9454)

* Add ability to specify SOCKS proxy for SSH connections (#8833)

* [docs] Render ActorPool documentation, etc (#9433)

* [tune] Put examples under proper version control (#9427)

Co-authored-by: krfricke <krfricke@users.noreply.github.com>

* Fix test-multi-node (#9453)

* Machine View Sorting / Grouping (#9214)

* Convert NodeInfo.tsx to a functional component

* Update NodeRowGroup to be a functional component

* lint

* Convert TotalRow to functional component.

* lint

* move node info over to using the sortable table head component. spacing is still a little wonky.

* Factor a NoewWorkerRow class out of NodeRowGroup that will be usable when grouping / ungrouping

* Compilation checkpoint, I factored the worker filtering logic out of node info into the reducer

* Add sort accessors for CPU

* Add sort accessors for Disk

* Add sort accessors for RAM

* add a table sort util for function based accessors (rather than flat attribute-based accessor)

* wip refactor node info features

* wip

* Rendering Checkpoint. I've refactored the features and how they are called to add sorting support. Also reworks the way error counts and log counts are passed to the front-end to remove some ugly logic

* wip

* wip

* wip

* Finish adding sorting and grouping of machine view

* lint

* fix bug in filtration of logs and errors by worker from recent refactor.

* Add export of Cluster Disk feature

* fix some merge issues

Co-authored-by: Max Fitton <max@semprehealth.com>

* [RLlib] Layout of Trajectory View API (new class: Trajectory; not used yet). (#9269)

* [RLlib] Issue 9402 MARWIL producing nan rewards. (#9429)

* Fix gcs_pubsub_test bug(#9438)

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>

* change error code name of boost timer (#9417)

* [tune] PyTorch CIFAR10 example (#9338)

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Kai Fricke <kai@anyscale.com>

* Remove legacy C++ code (#9459)

* Fix ObjectRef and ActorHandle serialization (#9462)

* [Stats] metrics agent exporter (#9361)

* [Core] Support GCS server port assignment. (#8962)

* Add scripts symlink back (#9219) (#9475)

(cherry picked from commit 77933c922d5136c5c2e2f0ac2edb4da67111d690)

Co-authored-by: Simon Mo <xmo@berkeley.edu>

* [tune] Issue 8821: ExperimentAnalysis doesn't expand user (#9461)

* [docker] Include base-deps image in rayproject Docker Hub (#9458)

* [Core] remove create_and_seal and create_and_seal_batch (#9457)

* Speedups for GitHub Actions (#9343)

Co-authored-by: Mehrdad <noreply@github.com>

* Fix flaky test_object_manager.py (#9472)

* [Java] fix redis-server binary path (#9398)

* [core] Handle out-of-order actor table notifications (#9449)

* Drop stale actor table notifications

* build

* Add num_restarts to disconnect handler

* Unit test and increment num_restarts on ALIVE, not RESTARTING

* Wait for pid to exit

* Fix name clash on Windows (#9412)

Co-authored-by: Mehrdad <noreply@github.com>

* Add job configs to gcs (#9374)

* Make pip install verbose (#9496)

Co-authored-by: Mehrdad <noreply@github.com>

* Make more tests compatible with Windows (#9303)

* [tune] extend PTL template (GPU, typing fixes, tensorboard) (#9451)

Co-authored-by: Kai Fricke <kai@anyscale.com>

* [core] Replace task resubmission in raylet with ownership protocol (#9394)

* Add intended worker ID to GetObjectStatus, tests

* Remove TaskID owner_id

* lint

* Add owner address to task args

* Make TaskArg a virtual class, remove multi args

* Set owner address for task args

* merge

* Fix tests

* Add ObjectRefs to task dependency manager, pass from task spec args

* tmp

* tmp

* Fix

* Add ownership info for task arguments

* Convert WaitForDirectActorCallArgs

* lint

* build

* update

* build

* java

* Move code

* build

* Revert "Fix Google log directory again (#9063)"

This reverts commit 275da2e4003b56e5c315ceae53a2e5f5ad7874c1.

* Fix free

* Regression tests - shorten timeouts in reconstruction unit tests

* Remove timeout for non-actor tasks

* Modify tests using ray.internal.free

* Clean up future resolution code

* Raylet polls the owner

* todo

* comment

* Update src/ray/core_worker/core_worker.cc

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

* Drop stale actor table notifications

* Fix bug where actor restart hangs

* Revert buggy code for duplicate tasks

* build

* Fix errors for lru_evict and internal.free

* Revert "Drop stale actor table notifications"

This reverts commit 193c5d20e5577befd43f166e16c972e2f9247c91.

* Revert "build"

This reverts commit 5644edbac906ff6ef98feb40b6f62c9e63698c29.

* Fix free test

* Fixes for freed objects

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

* release gil in global state accessor (#9357)

* [Java] Named java actor (#9037)

* Fix clang-cl build (#9494)

Co-authored-by: Mehrdad <noreply@github.com>

* [GCS Actor Management] Gcs actor management broken detached actor (#9473)

* [RLlib] Issue #9437 (PyTorch converts to CPU tensor, even if on GPU). (#9497)

* Get rid of build shell scripts and move them to Python (#6082)

* Fix broken test_raylet_info_endpoint (#9511)

* Fix. (#9464)

* [Autoscaler] Making bootstrap config part of the node provider interface (#9443)

* supporting custom bootstrap config for external node providers

* bootstrap config

* renamed config to cluster_config

* lint

* remove 2 args from importer

* complete move of bootstrap to node_provider

* renamed provider_cls

* move imports outside functions

* lint

* Update python/ray/autoscaler/node_provider.py

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* final fixes

* keeping lines to reduce diff

* lint

* lamba config

* filling in -> adding for lint

Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Fix flaky test_actor_failures::test_actor_restart (#9509)

* Fix flaky test

* os exit

* [rllib] MAML Transform (#9463)

* MAML Transform

* Moved Inner Adapt to Method in Execution Plan

* Cleanup Plasma Store (hash utilities) (#9524)

* [Serve] Improve buffering for simple cases (#9485)

* [Serve] Use pickle instead of clouldpickle (#9479)

* Fix pip and Bazel interaction messing up CI (#9506)

Co-authored-by: Mehrdad <noreply@github.com>

* [Core] Fix Java detached error (#9526)

* fix java createActor NPE bug (#9532)

* [RLlib] Issue 9218: PyTorch Policy places Model on GPU even with num_gpus=0 (#9516)

* [Stats] Fix metric exporter test (#9376)

* Hotfix Lint for Serve (#9535)

* Windows cleanup (#9508)

* Remove unneeded code for Windows

* Get rid of usleep()

* Make platform_shims includes non-transitive

Co-authored-by: Mehrdad <noreply@github.com>

* [RLlib] Issue 8384: QMIX doesn't learn anything. (#9527)

* Add placement group manager and some code in core_worker (#9120)

Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com>

* [core] Add flag to enable object reconstruction during ray start (#9488)

* Add flag

* doc

* Fix tests

* Pipelining task submission to workers (#9363)

* first step of pipelining

* pipelining tests & default configs
- added pipelining unit tests in direct_task_transport_test.cc
- added an entry in ray_config_def.h, ray_config.pxi, and ray_config.pxd to configure the parameter controlling the maximum number of tasks that can be in fligh to each worker
- consolidated worker_to_lease_client_ and worker_to_lease_client_ hash maps in direct_task_transport.h into a single one called worker_to_lease_entry_

* post-review revisions

* linting, following naming/style convention

* linting

* [New scheduler] Queueing refactor (#9491)

* .

* test_args passes

* .

* test_basic.py::test_many_fractional_resources causes ray to hang

* test_basic.py::test_many_fractional_resources causes ray to hang

* .

* .

* useful

* test_many_fractional_resources fails instead of hanging now :)

* Passes test_fractional_resources

* .

* .

* Some cleanup

* git is hard

* cleanup

* .

* .

* .

* .

* .

* .

* .

* cleanup

* address reviews

* address reviews

* more refactor

* :)

* travis pls

* .

* travis pls

* .

* [Serve] Add internal instruction for running benchmarks (#9531)

* MADDPG learning confirmation test. (#9538)

* Fix Bazel in Docker (#9530)

Co-authored-by: Mehrdad <noreply@github.com>

* Fix bug that `test_multi_node.py::test_multi_driver_logging` hangs when GCS actor management is turned on (#9539)

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>

* [tune] Unflattened lookup for ProgressReporter (#9525)

Co-authored-by: Kai Fricke <kai@anyscale.com>

* Add plasma store benchmark for small objects (#9549)

* [Tune] Copy default_columns in new ProgressReporter instances (#9537)

* quickfix (#9552)

* [tune] pin tune-sklearn (#9498)

* [cli] ray memory: added redis_password (#9492)

* [GCS]Fix lease worker leak bug when gcs server restarts (#9315)

* add part code

* fix compile bug

* fix review comments

* fix review comments

* fix review comments

* fix review comments

* fix review comment

* fix ut bug

* fix lint error

* fix review comment

* fix review comments

* add testcase

* add testcase

* fix bug

* fix review comments

* fix review comment

* fix review comment

* refine comments

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
Co-authored-by: Hao Chen <chenh1024@gmail.com>

* [tune] fix pbt checkpoint_freq (#9517)

* Only delete old checkpoint if it is not the same as the new one

* Return early if old checkpoint value coincides with new checkpoint value

Co-authored-by: Kai Fricke <kai@anyscale.com>

* [Core] Remove socket pair exchange in Plasma Store (#9565)

* try use boost::asio for notification processing

* [Metric] new cython interface for python worker metric (#9469)

* Bazel fixes (#9519)

* GCS client add fetch operation before subscribe (#9564)

* [RLlib] Fix combination of lockstep and multiple agnts controlled by the same policy. (#9521)

* Change aggregation when lockstep is activated.

Modification of MultiAgentBatch.timeslices to support the combination of lockstep and multiple agents controlled by the same policy.

fix ray-project/ray#9295

* Line too long.

* [Core] Replace the Plasma eventloop with boost::asio (#9431)

* Fix Java named actor bug (#9580)

* Fix setup.py bug (#9581)

Co-authored-by: Mehrdad <noreply@github.com>

* [Serve] Serialize Query object directly (#9490)

* Add dashboard dependencies to default ray installation (#9447)

* Dashboard next-version API support in backend (#9345)

* Fix log losses (#9559)

* Close log on shutdown

* Disable log buffering

Co-authored-by: Mehrdad <noreply@github.com>

* [docker] run Ubuntu 20.04 as base image (#9556)

* Add PTL to README.rst (#9594)

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Skip uneeded steps on CI (#9582)

Co-authored-by: Mehrdad <noreply@github.com>

* Fix Windows CI (#9588)

Co-authored-by: Mehrdad <noreply@github.com>

* [serve] Rename to `Controller` (#9566)

* Handle warnings in core (#9575)

* [New scheduler] Fix new scheduler bug (#9467)

* fix new scheduler bug

* add testcase for soft resource allocation

* modify RemoveNode

* Ensure unique log file names across same-node raylets. (#9561)

* fix tag key typo (#9606)

* Rename path variable due to zsh conflict (#9610)

* [doc] [minor] Make API docs easier to find. (#9604)

* Issue 9568: `rllib train` framework in config gets overridden with tf. (#9572)

* Use UTF-8 for encoding of python code for collision hashing (#9586)

Co-authored-by: Arne Sachtler <arne.sachtler@dlr.de>
Co-authored-by: simon-mo <simon.mo@hey.com>

* Add bazel to the PATH in setup.py (#9590)

Co-authored-by: Mehrdad <noreply@github.com>

* Fix Lint in setup.py (#9618)

Co-authored-by: Mehrdad <noreply@github.com>

* Shellcheck comments (#9595)

* [Serve] Document Metric Infrastructure (#9389)

* [CI] Do not run jenkins test on GHA (#9621)

* Support ray task type checking (#9574)

* [Metrics] Java metric API (#9377)

* [GCS] fix the fault tolerance about gcs node manager (#9380)

* Shellcheck quoting (#9596)

* Fix SC2006: Use $(...) notation instead of legacy backticked `...`.

* Fix SC2016: Expressions don't expand in single quotes, use double quotes for that.

* Fix SC2046: Quote this to prevent word splitting.

* Fix SC2053: Quote the right-hand side of == in [[ ]] to prevent glob matching.

* Fix SC2068: Double quote array expansions to avoid re-splitting elements.

* Fix SC2086: Double quote to prevent globbing and word splitting.

* Fix SC2102: Ranges can only match single chars (mentioned due to duplicates).

* Fix SC2140: Word is of the form "A"B"C" (B indicated). Did you mean "ABC" or "A\"B\"C"?

* Fix SC2145: Argument mixes string and array. Use * or separate argument.

* Fix SC2209: warning: Use var=$(command) to assign output (or quote to assign string).

Co-authored-by: Mehrdad <noreply@github.com>

* Fix bug in Bazel version check (#9626)

Co-authored-by: Mehrdad <noreply@github.com>

* [Java] Avoid data copy from C++ to Java for ByteBuffer type (#9033)

* Revert "Dashboard next-version API support in backend (#9345)" (#9639)

This reverts commit fca1fb18f366ebff6016978cb6440dd1ed8637fe.

* [Autoscaler] Command Line Interface improvements (#9322)

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* [Core] GCS Actor management on by default. (#8845)

* GCS Actor management on by default.

* Fix travis config.

* Change condition.

* Remove unnecessary CI.

* [Core] Fix concurrency issues in plasma store runner (#9642)

* fix window jni unhappy compiler (#9635)

* Fix TestObjectTableResubscribe testcase bug (#9650)

* fix named actor single process mode bug (#9652)

* [core] Fix Ray service startup when logging redirection is disabled. (#9547)

* Fix TorchDeterministic (#9241)

* [RaySGD] revised existing transformer example to work with transformers>=3.0 (#9661)

Co-authored-by: Kai Fricke <kai@anyscale.com>

* [rllib] Fix torch TD error, IMPALA LR updates (#9477)

* update

* add test

* lint

* fix super call

* speed es test up

* Auto-cancel build when a new commit is pushed (#8043)

Co-authored-by: Mehrdad <noreply@github.com>

* Fix lint in remote-watch.py (#9668)

* [Core] Remove unnecessary windows syscall in plasma store (#9602)

* Remove unused windows shims (#9583)

* Temporarily disable remote watcher (#9669)

* Drop support for Python 3.5. (#9622)

* Drop support for Python 3.5.

* Update setup.py

* [Core] WorkerInterface refactor (#9655)

* .

* .

* refactor WorkerInterface

* .

* Basic unit test structure complete?

* .

* .

* .

* .

* Fixed tests

* Fixed tests

* .

* [core] Enable object reconstruction for retryable actor tasks (#9557)

* Test actor plasma reconstruction

* Allow resubmission of actor tasks

* doc

* Test for actor constructor

* Kill PID before removing node

* Kill pid before node

* fix java coreworker crash (#9674)

* use help proto-init-macro for streaming config (#9272)

* Update release information from 0.8.6. (#9124)

* [BRING BACK TO MASTER] Update release information.

* [MERGE TO MASTER] Add microbenchmark result.

* Update asan tests to the doc.

* Refinements to the Serve documentation (#9587)

Co-authored-by: Dean Wampler <dean@concurrentthought.com>

* [tune] survey (#9670)

* Fix ERROR logging not being printed to standard error (#9633)

Co-authored-by: Mehrdad <noreply@github.com>

* [Tune Docs] Logging doc fix (#9691)

* [rllib] Type annotations for model classes (#9646)

* [Serve] Allow multiple HTTP servers. (#9523)

* Issue 9631: Tf1.14 does not have tf.config.list_physical_devices. (#9681)

* [Serve] Fix Formatting, stale docs (#9617)

* fixed simplex initialisation seeding bug (#9660)

Co-authored-by: Petros Christodoulou <petrochr@amazon.com>

* Switch from GitHub checkout@v2 to checkout@v1 due to bugs in checkout (#9697)

Co-authored-by: Mehrdad <noreply@github.com>

* Add Ray Serve to README.rst (#9688)

* Shellcheck rewrites (#9597)

* Fix SC2001: See if you can use ${variable//search/replace} instead.

* Fix SC2010: Don't use ls | grep. Use a glob or a for loop with a condition to allow non-alphanumeric filenames.

* Fix SC2012: Use find instead of ls to better handle non-alphanumeric filenames.

* Fix SC2015: Note that A && B || C is not if-then-else. C may run when A is true.

* Fix SC2028: echo may not expand escape sequences. Use printf.

* Fix SC2034: variable appears unused. Verify use (or export if used externally).

* Fix SC2035: Use ./*glob* or -- *glob* so names with dashes won't become options.

* Fix SC2071: > is for string comparisons. Use -gt instead.

* Fix SC2154: variable is referenced but not assigned

* Fix SC2164: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

* Fix SC2188: This redirection doesn't have a command. Move to its command (or use 'true' as no-op).

* Fix SC2236: Use -n instead of ! -z.

* Fix SC2242: Can only exit with status 0-255. Other data should be written to stdout/stderr.

* Fix SC2086: Double quote to prevent globbing and word splitting.

Co-authored-by: Mehrdad <noreply@github.com>

* [Autoscaler] CLI Logger docs (#9690)

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Update rllib-algorithms.rst (#9640)

* [tune] move jenkins tests to travis (#9609)

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Kai Fricke <kai@anyscale.com>

* [RLlib] Implement DQN PyTorch distributional head. (#9589)

* Add placement group java api (#9611)

* add part code

* add part code

* add part code

* fix code style

* fix review comment

* fix review comment

* add part code

* add part code

* add part code

* add part code

* fix review comment

* fix review comment

* fix code style

* fix review comment

* fix lint error

* fix lint error

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>

* [Stats] Improve Stats::Init & Add it to GCS server (#9563)

* [Core] Try remove all windows compat shims (#9671)

* try remove compat for arrow

* remove unistd.h

* remove socket compat

* delete arrow windows patch

* Fix a few flaky tests (#9709)

Fix test_custom_resources, Remove test_pandas_parquet_serialization, Better error message for test_output.py, Potentially fix test_dynres::test_dynamic_res_creation_scheduler_consistency

* [GCS]Open test_gcs_fault_tolerance testcase (#9677)

* enable test_gcs_fault_tolerance

* fix lint error

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>

* [Tests]lock vector to avoid potential flaky test (#9656)

* [tune] distributed torch wrapper (#9550)

* changes

* add-working

* checkpoint

* ccleanu

* fix

* ok

* formatting

* ok

* tests

* some-good-stuff

* fix-torch

* ddp-torch

* torch-test

* sessions

* add-small-test

* fix

* remove

* gpu-working

* update-tests

* ok

* try-test

* formgat

* ok

* ok

* [GCS] Fix actor task hang when its owner exits before local dependencies resolved (#8045)

* Only update raylet map when autoscaler configured (#9435)

* [Dashboard] New dashboard skeleton (#9099)

* Fixing multiple building issues

* Make wait_for_condition raise exception when timing out. (#9710)

* [GCS]GCS client support multi-thread subscribe&resubscribe&unsubscribe (#9718)

* Package and upload ray cross-platform jar (#9540)

* Revert "Package and upload ray cross-platform jar (#9540)" (#9730)

This reverts commit 881032593d3c1b9360ea641c24d50a022677a25e.

* Only build docker wheels in LINUX_WHEELS env (#9729)

* Keep build-autoscaler-images.sh alive in CI (#9720)

* [core] Removes Error when Internal Config is not set (#9700)

* [Cluster Launcher] Re Org the cluster launcher pages. (#9687)

* [RLlib] Offline Type Annotations (#9676)

* Offline Annotations

* Modifications

* Fixed circular dependencies

* Linter fix

* Python api of placement group (#9243)

* Include open-ssh-client for transparency (#9693)

* Fix remote-watch.py (#9625)

Co-authored-by: Mehrdad <noreply@github.com>

* [docker] Uses Latest Conda & Py 3.7 (#9732)

* Fix broken actor failure tests. (#9737)

* [Stats] fix stats shutdown crash if opencensus exporter not initialized (#9727)

* Fix package and upload ray jar (#9742)

* Introduce file_mounts_sync_continuously cluster option (#9544)

* Separate out file_mounts contents hashing into its own separate hash

Add an option to continuously sync file_mounts from head node to worker nodes:
monitor.py will re-sync file mounts whenver contents change but will only run setup_commands if the config also changes

* add test and default value for file_mounts_sync_continuously

* format code

* Update comments

* Add param to skip setup commands when only file_mounts content changed during monitor.py's update tick

Fixed so setup commands run when ray up is run and file_mounts content changes

* Refactor so that runtime_hash retains previous behavior

runtime_hash is almost identical as before this PR. It is used to determine if setup_commands need to run
file_mounts_contents_hash is an additional hash of the file_mounts content that is used to detect when only file syncing has to occur.

Note: runtime_hash value will have changed from before the PR because we hash the hash of the contents of the file_mounts as a performance optimization

* fix issue with hashing a hash

* fix bug where trying to set contents hash when it wasn't generated

* Fix lint error

Fix bug in command_runner where check_output was no longer returning the output of the command

* clear out provider between tests to get rid of flakyness

* reduce chance of race condition from node_launcher launching a node in the middle of an autoscaler.update call

* [dist] swap mac/linux wheel build order (#9746)

* [RLlib] Enhance reward clipping test; add action_clipping tests. (#9684)

* [RLlib] Issue 9667 DDPG Torch bugs and enhancements. (#9680)

* [Metrics]Ray java worker metric registry (#9636)

* ray worker metrics gauge init

* ray java metric mapping

* add jni source files for gauge and tagkey

* mapping all metric classes to stats object

* check non-null for tags and name

* lint

* add symbol for native metric JNI

* extern c for symbol

* add tests for all metrics

* Update Metric.java

use metricNativePointer instead.

* unify metric native stuff to one class

* fix jni file

* add comments for metric transform function in jni utils

* move metric function to native metric file

* remove unused disconnect jni

* Add a metric registry for java metircs

* Restore install-bazel.sh

* Add some comments for metric registry

* Fix thread safe problem of metrics

* Fix metric tests and remove sleep code from tests

* Fix comments of metrics

Co-authored-by: lingxuan.zlx <skyzlxuan@gmail.com>

* fix windows compile bug (#9741)

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>

* Run _with_interactive in Docker (#9747)

* [New scheduler] First unit test for task manager (#9696)

* .

* .

* refactor WorkerInterface

* .

* Basic unit test structure complete?

* .

* bad git >:-(

* small clean up

* CR

* .

* .

* One more fixture

* One more fixture

* .

* .

* bazel-format

* .

* [Stats] Basic Metrics Infrastructure (Metrics Agent + Prometheus Exporter) (#9607)

* [Release] Fix release tests (#9733)

* Register function race (#9346)

* Revert "[dist] swap mac/linux wheel build order (#9746)" and "Fix package and upload ray jar (#9742)" (#9758)

* Revert "[dist] swap mac/linux wheel build order (#9746)"

This reverts commit a9340565ff46626b18fd36f22a37d0380ae18d85.

* Revert "Fix package and upload ray jar (#9742)"

This reverts commit c290c308fe1e496480db5c37489df619cff6168f.

* Fix some Windows CI issues (#9708)

Co-authored-by: Mehrdad <noreply@github.com>

* Pin pytest version (#9767)

* [Java] Use test groups to filter tests of different run modes (#9703)

* [Java] Fix MetricTest.java due to incomplete changes from #9703 (#9770)

* Fix leased worker leak bug if lease worker requests that are still waiting to be scheduled when GCS restarts (#9719)

* [Stats] enable core worker stats (#9355)

* [GCS]Use a separate thread in node failure detector to handle heartbeat (#9416)

* use a sole thread to handle heartbeat

* separate signal thread

* use work to avoid exiting when task is underway

* protect shared data structure to avoid deadlock

* add comments

* decrease io service num

* minor changes

* fix test

* per stephanie's comments

* use single io service instead of 1-size io service pool

* typo

* [GCS Actor Management] Fix flaky test_dead_actors. (#9715)

* Fix.

* Add logs.

* Add an unit test.

* [TUNE] Tune Docs re-organization (#9600)

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* [RLlib] Trajectory View API (preparatory cleanup and enhancements). (#9678)

* [Core] Socket creation race condition bug fixes (#9764)

* fix issues

* hot fixes

* test

* test

* Always info log

* Fixed stderr logging (9765)

* [Core] Custom socket name (#9766)

* fix issues

* hot fixes

* test

* test

* socket name change only

* Fix src/ray/core_worker/common.h deleted constructor (#9785)

Co-authored-by: Mehrdad <noreply@github.com>

* [Stats] Fix harvestor threads + Fix flaky stats shutdown. (#9745)

* More fixes

* Applying latest changes in travis.yml

* Fixing fixture data exclusions

* Disable some java tests

* Fix some CI errors

* Update hash

* Fixing more build issues

* Fixing more build issues

* Fix pipeline cache path

* More fixes

* Fix bazel test command

* Fix bazel test

* Fix general info steps

* Custom env var for docker build

* Trying a different way to install bazel

* Bazel fix

* Updating hash

Co-authored-by: Siyuan (Ryans) Zhuang <suquark@gmail.com>
Co-authored-by: mehrdadn <mehrdadn@users.noreply.github.com>
Co-authored-by: Mehrdad <noreply@github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Alisa <wuminyan0607@gmail.com>
Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@vip.qq.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Stefan Schneider <stefan.schneider@upb.de>
Co-authored-by: Patrick Ames <pdames@amazon.com>
Co-authored-by: Hao Chen <chenh1024@gmail.com>
Co-authored-by: fangfengbin <869218239a@zju.edu.cn>
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
Co-authored-by: Tao Wang <dooku.wt@antfin.com>
Co-authored-by: Kai Yang <kfstorm@outlook.com>
Co-authored-by: Sven Mika <sven@anyscale.io>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Ian Rodney <ian.rodney@gmail.com>
Co-authored-by: Henk Tillman <henktillman@gmail.com>
Co-authored-by: Tanay Wakhare <twakhare@gmail.com>
Co-authored-by: Nicolaus93 <nicolo.campolongo@unimi.it>
Co-authored-by: Vasily Litvinov <45396231+vnlitvinov@users.noreply.github.com>
Co-authored-by: krfricke <krfricke@users.noreply.github.com>
Co-authored-by: Max Fitton <maxfitton@gmail.com>
Co-authored-by: Max Fitton <max@semprehealth.com>
Co-authored-by: kisuke95 <2522134184@qq.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Simon Mo <xmo@berkeley.edu>
Co-authored-by: Michael Mui <68102089+heyitsmui@users.noreply.github.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: chaokunyang <shawn.ck.yang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: Michael Luo <michael.luo123456789@gmail.com>
Co-authored-by: Gabriele Oliaro <gabriele_oliaro@college.harvard.edu>
Co-authored-by: Tom <veniat.tom@gmail.com>
Co-authored-by: jerrylee.io <JerryDeKo@gmail.com>
Co-authored-by: Raphael Avalos <raphael@avalos.fr>
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>
Co-authored-by: Arne Sachtler <arne.sachtler@gmail.com>
Co-authored-by: Arne Sachtler <arne.sachtler@dlr.de>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: ZhuSenlin <wumuzi520@126.com>
Co-authored-by: Max Fitton <mfitton@berkeley.edu>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
Co-authored-by: Dean Wampler <dean@polyglotprogramming.com>
Co-authored-by: Dean Wampler <dean@concurrentthought.com>
Co-authored-by: Bill Chambers <bill@anyscale.com>
Co-authored-by: Petros Christodoulou <p.christodoulou2@gmail.com>
Co-authored-by: Petros Christodoulou <petrochr@amazon.com>
Co-authored-by: Justin Terry <justinkterry@gmail.com>
Co-authored-by: Tao Wang <wangtaothetonic@163.com>
Co-authored-by: fyrestone <fyrestone@outlook.com>
Co-authored-by: Alan Guo <aguo@anyscale.com>
Co-authored-by: bermaker <495571751@qq.com>

* Sync Upstream master (#50)

* [core] Pull Manager exponential backoff (#13024)

* [RLlib] Issue 12789: RLlib throws the warning "The given NumPy array is not writeable" (#12793)

* [release tests] test_many_tasks fix (#12984)

* Add "beta" documentation for enabling object spilling manually (#13047)

* [Serve] Handle Bug Fixes (#12971)

* [Dashboard] Add GET /logical/actors API (#12913)

* [GCS]Decouple gcs resource manager and gcs node manager (#13012)

* [ray_client]: Insert decorators into the real ray module to allow for client mode (#13031)

* [GCS] Delete redis gcs client and redis_xxx_accessor (#12996)

* [RLlib] Fix broken unity3d_env import in example server script. (#13040)

* [RLlib] TorchPolicies: Accessing "infos" dict in train_batch causes `TypeError`. (#13039)

* [joblib] Fix flaky joblib test. (#13046)

* [Tune]Add integer loguniform support (#12994)

* Add integer quantization and loguniform support

* Fix hyperopt qloguniform not being np.log'd first

* Add tests, __init__

* Try to fix tests, better exceptions

* Tweak docstrings

* Type checks in SearchSpaceTest

* Update docs

* Lint, tests

* Update doc/source/tune/api_docs/search_space.rst

Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>

Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>

* [core][new scheduler] Move tasks from ready to dispatch to waiting on argument eviction (#13048)

* Add index for tasks to dispatch

* Task dependency manager interface

* Unsubscribe dependencies and tests

* NodeManager

* Revert "Add index for tasks to dispatch"

This reverts commit c6ccb9aa306e00f80d34b991055e4e83872595ea.

* tmp

* Move back to waiting if args not ready

* update

* Update to new form of brew cask install command

* [Autoscaler] New output log format (#12772)

* Fix typo RMSProp -> RMSprop (#13063)

* [serve] Centralize HTTP-related logic in HTTPState (#13020)

* Remove suppress output to see why wheel is not building

* Refactor TaskDependencyManager, allow passing bundles of objects to ObjectManager (#13006)

* New dependency manager

* Switch raylet to new DependencyManager

* PullManager accepts bundles

* Cleanup, remove old task dependency manager

* x

* PullManager unit tests

* lint

* Unit tests

* Rename

* lint

* test

* Update src/ray/raylet/dependency_manager.cc

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* Update src/ray/raylet/dependency_manager.cc

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* x

* lint

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* [docs] Fix args + kwargs instead of docstrings (#13068)

* functools wraps

* Fix typo (functoools -> functools)

* Fix OS X Wheel Build - Update brew cask install (#13062)

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* speed up local mode object store get (#13052)

Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>

* [RLlib] Execution Annotation (#13036)

* [RLlib] Improved Documentation for PPO, DDPG, and SAC (#12943)

* [C++ API] Added reference counting to ObjectRef (#13058)

* Added reference counting to ObjectRef

* Addressed the comments

* [Core] Remove cuda support in plasma store (#13070)

* remove cuda support in plasma store

* [Core] Remote outdated external store (#13080)

* remove outdated external store

* [GCS] Move resource usage info to gcs resource manager (#13059)

* [RLlib] JAXPolicy prep. PR #1. (#13077)

* [RLlib] Preprocessor fixes (multi-discrete) and tests. (#13083)

* [RLlib] BC/MARWIL/recurrent nets minor cleanups and bug fixes. (#13064)

* [Collective][PR 3.5/6] Send/Recv calls and some initial code for communicator caching (#12935)

* other collectives all work

* auto-linting

* mannual linting #1

* mannual linting 2

* bugfix

* add send/recv point-to-point calls

* add some initial code for communicator caching

* auto linting

* optimize imports

* minor fix

* fix unpassed tests

* support more dtypes

* rerun some distributed tests for send/recv

* linting

* [Serve] [Doc] Front page update (#13032)

* Deprecate experimental / dynamic resources (#13019)

* [docs] fix wandb url (#13094)

* [Serve] Implement Graceful Shutdown (#13028)

* [Serve] Use ServeHandle in HTTP proxy (#12523)

* [Java] Format ray java code (#13056)

* [docker] Fix restart behavior with Docker (#12898)

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: ijrsvt <ilr@anyscale.com>

* Disable broken streaming tests (#13095)

* [autoscaler] Make placement groups bypass max launch limit (#13089)

* Serve metrics docs (#13096)

* [RLlib] run_regression_tests.py: --framework flag (instead of --torch). (#13097)

* [RLLib] Readme.md Documentation for Almost All Algorithms in rllib/agents (#13035)

* [Doc] Fix Sphinx.add_stylesheet deprecation (#13067)

* Fix streaming ci failure (#12830)

* [RLlib] New Offline RL Algorithm: CQL (based on SAC) (#13118)

* [Bugfix][Dashboard] Fix undefined logCount, errorCount UI crash (#13113)

* [RLlib] Deflake test case: 2-step game MADDPG. (#13121)

* [RLlib] Trajectory view API docs. (#12718)

* Job module without submission (#13081)

Co-authored-by: 刘宝 <po.lb@antfin.com>

* [RLlib] JAXPolicy prep PR #2 (move get_activation_fn (backward-compatibly), minor fixes and preparations). (#13091)

* [Java] Avoid failure of serializing a user-defined unserializable exception. (#13119)

* [Tune] Update URL to fix 403 not found error in PBT tranformers test case (#13131)

* [serve] Async controller (#13111)

* [dashboard] Fix RAY_RAYLET_PID KeyError on Windows (#12948)

* [Serve] Use a small object to track requests (#13125)

* [docs][kubernetes][minor] Update K8s examples in doce (#13129)

* [RLlib] Support easy `use_attention=True` flag for using the GTrXL model. (#11698)

* [docs] Documentation + example for the C++ language API (#13138)

* [Java] Support `wasCurrentActorRestarted` in actor task. (#13120)

* Remove check.

* Add test

* fix lint

* lint

* Fix spotless lint

* Address comments.

* Fix lint

Co-authored-by: Qing Wang <jovany.wq@antgroup.com>

* [docs] Minor change to formating C++ docs. (#13151)

* Deprecate setResource java api (#13117)

* [docs] Small fix in C++ documentation. (#13154)

* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* deflake test_joblib

* lint

* placement groups bypass

* remove space

* Eric

* first ocmmit

* lint

* exmaple

* documentation

* hmm

* file path fix

* fix test

* some format issue in docs

* modified docs

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal>

* [Serve] [Doc] Add existing web server integration ServeHandle tutorial (#13127)

* [kubernetes][docs][minor] Kubernetes version warning (#13161)

* [Core] Locality-aware leasing: Milestone 1 - Owned refs, pinned location (#12817)

* Locality-aware leasing for owned refs (pinned locations).

* LessorPicker --> LeasePolicy.

* Consolidate GetBestNodeIdForTask and GetBestNodeIdForObjects.

* Update comments.

* Turn on locality-aware leasing feature flag by default.

* Move local fallback logic to LeasePolicy, move feature flag check to CoreWorker constructor, add local-only lease policy.

* Add lease policy consulting assertions to the direct task submitter tests.

* Add lease policy tests.

* LocalityLeasePolicy --> LocalityAwareLeasePolicy.

* Add missing const declarations.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* Add RAY_CHECK for raylet address nullptr when creating lease client.

* Make the fact that LocalLeasePolicy always returns the local node more explicit.

* Flatten GetLocalityData conditionals to make it more readable.

* Add ReferenceCounter::GetLocalityData() unit test.

* Add data-intensive microbenchmarks for single-node perf testing.

* Add data-intensive microbenchmarks for simulated cluster perf testing.

* Remove redundant comment.

* Remove data-intensive benchmarks.

* Add locality-aware leasing Python test.

* Formatting changes in ray_perf.py.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* Enabling the cancellation of non-actor tasks in a worker's queue (#12117)

* wrote code to enable cancellation of queued non-actor tasks

* minor changes

* bug fixes

* added comments

* rev1

* linting

* making ActorSchedulingQueue::CancelTaskIfFound raise a fatal error

* bug fix

* added two unit tests

* linting

* iterating through pending_normal_tasks starting from end

* fixup! iterating through pending_normal_tasks starting from end

* fixup! fixup! iterating through pending_normal_tasks starting from end

* post merge fixes

* added debugging instructions, pulled Accept() out of guarded loop

* removed debugging instructions, linting

* [Serve] Bug in Serve node memory-related resources calculation #11198 (#13061)

* [Release] Update Release Process Documentation (#13123)

* [Core] Remove Arrow dependencies (#13157)

* remove arrow ubsan

* remove arrow build depend

* remove arrow buffer

* [XGboost] Update Documentation (#13017)

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* [SGD] Fix Docstring for `as_trainable` (#13173)

* Revert "Enabling the cancellation of non-actor tasks in a worker's queue (#12117)" (#13178)

This reverts commit b4d688b4a64c595a071e8c7380b653e0bfea4ad2.

* Surface object store spilling statistics in `ray memory` (#13124)

* [ray_client]: Move from experimental to util (#13176)

Change-Id: I9f054881f0429092d265cd6944d89804cce9d946

* Remove unused file(object_manager_integration_test.cc) (#12989)

* Notify listeners after registered node stored (#13069)

* [build]Update description and add some keywords (#13163)

* [Collective][PR 2/6] Driver program declarative interfaces (#12874)

* scaffold of the code

* some scratch and options change

* NCCL mostly done, supporting API#1

* interface 2.1 2.2 scratch

* put code into ray and fix some importing issues

* add an addtional Rendezvous class to safely meet at named actor

* fix some small bugs in nccl_util

* some small fix

* scaffold of the code

* some scratch and options change

* NCCL mostly done, supporting API#1

* interface 2.1 2.2 scratch

* put code into ray and fix some importing issues

* add an addtional Rendezvous class to safely meet at named actor

* fix some small bugs in nccl_util

* some small fix

* add a Backend class to make Backend string more robust

* add several useful APIs

* add some tests

* added allreduce test

* fix typos

* fix several bugs found via unittests

* fix and update torch test

* changed back actor

* rearange a bit before importing distributed test

* add distributed test

* remove scratch code

* auto-linting

* linting 2

* linting 2

* linting 3

* linting 4

* linting 5

* linting 6

* 2.1 2.2

* fix small bugs

* minor updates

* linting again

* auto linting

* linting 2

* final linting

* Update python/ray/util/collective_utils.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Update python/ray/util/collective_utils.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Update python/ray/util/collective_utils.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* added actor test

* lint

* remove local sh

* address most of richard's comments

* minor update

* remove the actor.option() interface to avoid changes in ray core

* minor updates

Co-authored-by: YLJALDC <dal177@ucsd.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* [serve] Merge ActorReconciler and BackendState (#13139)

* [tune] better signature check for `tune.sample_from` (#13171)

* [tune] better signature check for `tune.sample_from`

* Update python/ray/tune/sample.py

Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com>

Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com>

* Disable atexit test on windows (#13207)

* [serve] Move controller state into separate files (#13204)

* Update multi_agent_independent_learning.py (#13196)

pettingzoo.utils.error.DeprecatedEnv: waterworld_v0 is now depreciated, use waterworld_v2 instead

* [Collective] Some necessary abstraction of collective calls before introducing stream management (#13162)

* [Tune] Fix PBT Transformers Example (#13174)

* [Serve] HTTPOptions for deployment modes (#13142)

* [tests] Fix Autoscaler Test failure on Windows (#13211)

* skip create_or_update tests

* Update python/ray/tests/test_autoscaler.py

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>

* [BugFix][GCS]Fix gcs_actor_manager_test multithreading bug (#13158)

* [GCS]Fix TestActorSubscribeAll bug (#13193)

* [Metrics] Record per node and raylet cpu / mem usage (#12982)

* Record per node and raylet cpu / mem usage

* Add comments.

* Addressed code review.

* [Tune] Fix tune serve integration example (#13233)

* [Redis] Note that each Redis Connect retry takes two minutes (#12183)

* Slightly alter error message so it's the same in both cases.

* Each retry takes about two minutes.

* [Log] fix spdlog init race (#12973)

* fix spdlog init race

* use global logger

* refine logger name and constructor

* [Release] Add 1.1.0 release test logs (#13054)

* Add microbenchmark to release logs

* check in many_tasks stress test result

* Add results of placement group stress test for 1.1.0

* Add result for test_dead_actors test and correct the name of test_many_tasks.txt

* Add rllib regression test result

* Add pytorch test results for rllib

* remove extraneous log entries

* [Core] Fix incorrect comment (#13228)

* [Serialization] Fix cloudpickle (#13242)

* [GCS]Fix gcs table storage `GetAll` and `GetByJobId` api bug (#13195)

* Start ray client server with 'ray start' (#13217)

* [GCS]Add gcs actor schedule strategy (#13156)

* Publish job/worker info with Hex format instead of Binary (#13235)

* [RLlib] SquashedGaussians should throw error when entropy or kl are called. (#13126)

* [Serve] Rescale Serve's Long Running Test to Cluster Mode (#13247)

Now that `HeadOnly` becomes the new default HTTP location, we can
re-enable the long running tests to use local multi-clusters.
(also fixed the controller's API to match up to date, we should
have caught these, I will open issues for this.)

* Update autoscaler-cluster yaml files for release tests (#13114)

* [Release] Use ray-ml image for logn running test (#13267)

* [RLlib] Fix missing "info_batch" arg (None) in `compute_actions` calls. (#13237)

* [Tune] Improve error message for Session Detection (#13255)

* Improve error message

* log once

* [Tune] Pin Tune Dependencies (#13027)

Co-authored-by: Ian <ian.rodney@gmail.com>

* [Dependabot] Add Dependabot (#13278)

Co-authored-by: Ian <ian.rodney@gmail.com>

* [docker] Pull if image is not present (#13136)

* [GCS] Remove old lightweight resource usage report code path (#13192)

* [Dashboard] Add GET /log_proxy API (#13165)

* Fix a crash problem caused by GetActorHandle in ActorManager (#13164)

* [ray_client] Add metadata to gRPC requests (#13167)

* [RLlib] Preparatory PR for: Documentation on Model Building. (#13260)

* [tune](deps): Bump mlflow from 1.13.0 to 1.13.1 in /python/requirements (#13286)

* [tune](deps): Bump gluoncv from 0.9.0 to 0.9.1 in /python/requirements (#13287)

* Remove top-level ray.connect() and ray.disconnect() APIs (#13273)

* [Pull manager] Only pull once per retry period (#13245)

* .

* docs

* cleanup

* .

* .

* .

* .

Co-authored-by: Alex <alex@anyscale.com>

* [Cancellation] Make Test Cancel Easier to Debug (#13243)

* first commit

* lint-fix

* [ray_client]: first draft of documentation (#13216)

* Do not give an error if both `RAY_ADDRESS` and `address` is specified on initialization (#13305)

* Finalize handling of RAY_ADDRESS

* lint

* [serve] Clean up EndpointState interface, move checkpointing inside of EndpointState (#13215)

* [RLlib] SlateQ Documentation (#13266)

* [RLlib] Add more detailed Documentation on Model building API (#13261)

* [tune] convert search spaces: parse spec before flattening (#12785)

* Parse spec before flattening

* flatten after parse

* Test for ValueError if grid search is passed to search algorithms

* remove empty extras streaming deps (#12933)

* add the method annotation and a comment explaining what's happening (#13306)

Change-Id: I848cc2f0beaed95340d9de7cca19a50c78d9da9a

* Use wait_for_condition to reduce flakiness in test_queue.py::test_custom_resources (#13210)

* [RLlib] Issue 13330: No TF installed causes crash in `ModelCatalog.get_action_shape()` (#13332)

* [serve] Cleanup backend state, move checkpointing and async goal logic inside (#13298)

* fix removal of task dependencies (#13333)

Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>

* [Serve] Support Starlette streaming response (#13328)

* [RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339)

* [client] Report number of currently active clients on connect (#13326)

* wip

* update

* update

* reset worker

* fix conn

* fix

* disable pycodestyle

* Implement internal kv in ray client (#13344)

* kv internal

* fix

* [Tune] Rename MLFlow to MLflow (#13301)

* Forgot overwrite parameter in Ray client internal kv

* Fix typo in Tune Docs (Checkpointing) (#13348)

See issue #13299

* [Kubernetes][Docs] GPU usage (#13325)

* gpu-note

* gpu-note

* More info

* lint?

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* GKE->Kubernetes

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Revert "[RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339)" (#13361)

This reverts commit e2b2abb88b82c0c2402a338bba51e5dbd1739419.

* [Dependabot] [CI] Re-configure Dependabot and disable duplicate builds (#13359)

* [tune] buffer trainable results (#13236)

* Working prototype

* Pass buffer length, fix tests

* Don't buffer per default

* Dispatch and process save in one go, added tests

* Fix tests

* Pass adaptive seconds to train_buffered, stop result processing after STOP decision

* Fix tests, add release test

* Update tests

* Added detailed logs for slow operations

* Update python/ray/tune/trial_runner.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Apply suggestions from code review

* Revert tests and go back to old tuning loop

* nit

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* [Serve] Add dependency management support for driver not running in a conda env (#13269)

* [RLlib] Add `__len__()` method to SampleBatch (#13371)

* [Serve] Backend state unit tests (#13319)

* trigger doc build for serve updates (#13373)

* [Object Spilling] Long running object spilling test (#13331)

* done.

* formatting.

* Remove unimplemented GetAll method in actor info accessor (#13362)

* [Doc] Remove trailing whitespaces (#13390)

* Enable Ray client server by default (#13350)

* update

* fix

* fix test

* update

* [RLlib] Trajectory View API: Atari framestacking. (#13315)

* [ray_client]: Wait for ready and retry on ray.connect() (#13376)

* [ray_client]: wait until connection ready

Change-Id: Ie443be60c33ab7d6da406b3dcaa57fbb7ba57dd6

* lint

Change-Id: I30f8e870bbd5f8859a9f11ae244e210f077cedd0

* docs and retry minimum

Change-Id: I43f5378322029267ddd69f518ce8206876e2129d

* [Dashboard] Fix missing actor pid (#13229)

* [ray_client]: Fix multiple attempts at checking connection (#13422)

* Plumb retries update (#13411)

* [Serve] [Doc] Improve batching doc (#13389)

* [autoscaler/k8s] [CI] Kubernetes test ray up, exec, down (#12514)

* Fix Serve release test (#13385)

* Add bazel logs upload to GHA (#13251)

* [tune] Fix f-string in error message (#13423)

* [serve] Pull out goal management logic into AsyncGoalManager class (#13341)

* Make request_resources() use internal kv instead of redis pub sub (#13410)

* Remove unused handler methods (#13394)

* [Tune] Pin Transitive Dependencies (#13358)

* Split out the part of get_node_ip_address for which the docstring is correct (#12796)

* Fix raylet::MockWorker::GetProcess crashes (#13440)

Co-authored-by: 刘宝 <po.lb@antfin.com>

* Revert "Enable Ray client server by default (#13350)" (#13429)

This reverts commit 912d0cbbf912d5b52d6176155bdff02f504b657d.

* Fix linter error (#13451)

* [GCS]Add gcs resource scheduler (#13072)

* [RLlib] Redo: Make TFModelV2 fully modular like TorchModelV2 (soft-deprecate register_variables, unify var names wrt torch). (#13363)

* [Core]Fix raylet scheduling bug (#13452)

* [Core]Fix raylet scheduling bug

* fix lint error

* fix lint error

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>

* [joblib] joblib strikes again but this time on windows (#13212)

* [ray_client]: fix exceptions raised while executing on the server on behalf of the client (#13424)

* [kubernetes][minor] Operator garbage collection fix (#13392)

* [Core][CLI] `ray status` and `ray memory` no longer starts a new job (#13391)

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Make status and error args required in commands.py#debug.status

* Remove unnecessary imports

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Make status and error args required in commands.py#debug.status

* Remove unnecessary imports

* Job 38482.1 should now pass

* Resolve merge conflict

* [RLlib] Deflake 2x remote & local inference tests (external env). (#13459)

* [docs] Add more guideline on using ray in slurm cluster (#12819)

Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com>
Co-authored-by: PENG Zhenghao <pengzh@ie.cuhk.edu.hk>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* [Dashboard] Fix GPU resource rendering issue (#13388)

* [Release] Fix Serve release test (#13303)

The Docker image we were using now uses `ray` users so we have to call
sudo.

* [serve] Properly obey SERVE_LOG_DEBUG=0 (#13460)

* Fix getting runtime context dict in driver (#13417)

* [xgb] re-enable xgboost_ray tests (#13416)

* re-enable

* fix

* update xgb_ray version

* [Serialization] New custom serialization API (#13291)

* new serialization API with doc & test

* add more notes

* refine notes

* doc

* [Core] Ownership-based Object Directory: Consolidate location table and reference table. (#13220)

* Added owned object reference before Plasma put on Create() + Seal() path.

* Consolidated location table and reference table in reference counter.

* Restore type in definition.

* Clean up owned reference on failed Seal().

* Added RemoveOwnedObject test for reference counter.

* Guard against ref going out of scope before location RPCs.

* Add 'owner must have ref in scope' precondition to documentation for object location methods.

* Move to separate Create() + Seal() methods for existing objects.

* Clearer distinction between Create() and Seal() methods.

* Make it clear that references will normally be cleaned up by reference counting.

* [ray_client]: Support runtime_context as metadata (#13428)

* [GCS]Remove unused class variable (#13454)

* [Object Spilling] Dedup restore objects (#13470)

* done.

* Addressed code review.

* [CI] Enable Dashboard tests for master (#13425)

* [docker/dashboard] Fix ray dashboard (#12899)

* [CI] Fix Windows Bazel Upload (#13436)

* Return version info from Ray client connect, to allow for discovering version mismatches

* Update ID specification doc (#13356)

* [ray_client]: fix wrong reference in server_pickler (#13474)

Change-Id: Ie3d219541b1875e986e72e3ae73ece145c715acf

* Bump dev branch to 2.0 to avoid endless version bump toil (#13497)

* wip

* fix

* fix

* Remove an unnecessary file (#13499)

* [Tests] Skip failing windows tests (#13495)

* skip failing windows tests

* skip more

* remove

* updates

* [tune] fix small docs typo (#13355)

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

* move message to debug (#13472)

* Minimal version of piping autoscaler events to driver logs (#13434)

* sync write internal config in gcs (#13197)

* Refactor node manager to eliminate `new_scheduler_enabled_` (#12936)

* [GCS]Only publish changed field when node dead (#13364)

* Only update changed field when node dead

* node_id missed

* [CI] Buildkite PR Environment for Simple Tests (#13130)

* [GCS] Remove task info publish as nowhere uses it (#13509)

* Remove task info publish as nowhere uses it

* simplify right publish channel

* [RLlib] Solve PyTorch/TF-eager A3C async race condition between calling model and its value function. (#13467)

* [tune] placement group support (#13370)

* [Serve] Allow ObjectRef for Composition (#12592)

* Add Dashboard Python Test to Buildkite (#13530)

* Add ability to not start Monitor when calling `ray start` (#13505)

* [tune] support experiment checkpointing for grid search (#13357)

* Fix typo (#13098)

* Remove PYTHON_MODE that is not defined in Ray so that import * will work from other packages. (#13544)

* [RLlib] MARWI…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests-ok The tagger certifies test failures are unrelated and assumes personal liability.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants