Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Object Spilling] Remove retries and use a timer instead. #13175

Merged
merged 10 commits into from
Jan 19, 2021

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Jan 4, 2021

Why are these changes needed?

As @ericl's suggestion, this PR will start using a timer instead of retry.

It removes unnecessary timer config values too.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rkooo567 rkooo567 changed the title [WIP] Remove retries and use a timer instead. [WIP][Object Spilling] Remove retries and use a timer instead. Jan 4, 2021
@rkooo567 rkooo567 changed the title [WIP][Object Spilling] Remove retries and use a timer instead. [Object Spilling] Remove retries and use a timer instead. Jan 5, 2021
if (create_ok) {
FinishRequest(request_it);
last_success_ns_ = now;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to assign success here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise it can OOM without any grace period if the spilling is not invoked. (since the last success_ns is not renewed at all)

Then

We need a grace period since (1) global GC takes a bit of time to
        // kick in

is not reflected. Let me know if I am missing something!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be setting last_success_ns_ in the else branch? It seems like we need another state like "waiting for OOM" so we know whether to set vs check the last_success_ns_.

And we should add a unit test for this case because I think it's currently broken:

  1. Add a create request that succeeds.
  2. More than the oom_grace_period later, add another create request that fails. Check that this doesn't raise OOM <-- I think this will throw OOM right now.

Copy link
Contributor

@ericl ericl Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also set it if spilling is successful, so it's set whenever a creation/spill succeeds. If neither has succeeded for the grace period, OOM is raised.

So the case raised above is fine--- spilling is attempted in the second create case, and that resets the timer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the spill reset is enough because it won't happen if the spill callback is unsuccessful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't want to reset it on OOM right? Otherwise, it will take a long time to OOM many objects (e.g., 10 seconds * N objects)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check code again (I reflected the comment.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note; I am using -1 as a default value of oom_start_time_ns_ instead of 0 because otherwise, tests will fail (since the initial fake clock time is 0).

Copy link
Contributor Author

@rkooo567 rkooo567 Jan 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, added a test case that checks if objects are successfully created after there was no new object created for a long time.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 5, 2021
@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 5, 2021
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Can we also call it get_time to be consistent with other uses in ray?

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 5, 2021
// Spilling is failing.
/*spill_object_callback=*/[&]() { return false; },
/*on_global_gc=*/[&]() { num_global_gc_++; });
/*on_global_gc=*/[&]() { num_global_gc_++; },
/*timer_callback=*/[&]() { return 0; });

auto oom_request = [&](bool evict_if_full, PlasmaObject *result) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we advance the clock during the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just advanced a clock 1 second for 10 seconds.

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a couple issues with the logic and tests (see comments above).

@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 18, 2021
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 18, 2021
@rkooo567
Copy link
Contributor Author

cc @stephanie-wang Can you take a look at it? Change requested needs to be removed before merging it!

@ericl ericl merged commit 99375c4 into ray-project:master Jan 19, 2021
Edilmo added a commit to BonsaiAI/ray that referenced this pull request Jan 22, 2021
* [core] Pull Manager exponential backoff (#13024)

* [RLlib] Issue 12789: RLlib throws the warning "The given NumPy array is not writeable" (#12793)

* [release tests] test_many_tasks fix (#12984)

* Add "beta" documentation for enabling object spilling manually (#13047)

* [Serve] Handle Bug Fixes (#12971)

* [Dashboard] Add GET /logical/actors API (#12913)

* [GCS]Decouple gcs resource manager and gcs node manager (#13012)

* [ray_client]: Insert decorators into the real ray module to allow for client mode (#13031)

* [GCS] Delete redis gcs client and redis_xxx_accessor (#12996)

* [RLlib] Fix broken unity3d_env import in example server script. (#13040)

* [RLlib] TorchPolicies: Accessing "infos" dict in train_batch causes `TypeError`. (#13039)

* [joblib] Fix flaky joblib test. (#13046)

* [Tune]Add integer loguniform support (#12994)

* Add integer quantization and loguniform support

* Fix hyperopt qloguniform not being np.log'd first

* Add tests, __init__

* Try to fix tests, better exceptions

* Tweak docstrings

* Type checks in SearchSpaceTest

* Update docs

* Lint, tests

* Update doc/source/tune/api_docs/search_space.rst

Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>

Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>

* [core][new scheduler] Move tasks from ready to dispatch to waiting on argument eviction (#13048)

* Add index for tasks to dispatch

* Task dependency manager interface

* Unsubscribe dependencies and tests

* NodeManager

* Revert "Add index for tasks to dispatch"

This reverts commit c6ccb9aa306e00f80d34b991055e4e83872595ea.

* tmp

* Move back to waiting if args not ready

* update

* Update to new form of brew cask install command

* [Autoscaler] New output log format (#12772)

* Fix typo RMSProp -> RMSprop (#13063)

* [serve] Centralize HTTP-related logic in HTTPState (#13020)

* Remove suppress output to see why wheel is not building

* Refactor TaskDependencyManager, allow passing bundles of objects to ObjectManager (#13006)

* New dependency manager

* Switch raylet to new DependencyManager

* PullManager accepts bundles

* Cleanup, remove old task dependency manager

* x

* PullManager unit tests

* lint

* Unit tests

* Rename

* lint

* test

* Update src/ray/raylet/dependency_manager.cc

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* Update src/ray/raylet/dependency_manager.cc

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* x

* lint

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* [docs] Fix args + kwargs instead of docstrings (#13068)

* functools wraps

* Fix typo (functoools -> functools)

* Fix OS X Wheel Build - Update brew cask install (#13062)

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* speed up local mode object store get (#13052)

Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>

* [RLlib] Execution Annotation (#13036)

* [RLlib] Improved Documentation for PPO, DDPG, and SAC (#12943)

* [C++ API] Added reference counting to ObjectRef (#13058)

* Added reference counting to ObjectRef

* Addressed the comments

* [Core] Remove cuda support in plasma store (#13070)

* remove cuda support in plasma store

* [Core] Remote outdated external store (#13080)

* remove outdated external store

* [GCS] Move resource usage info to gcs resource manager (#13059)

* [RLlib] JAXPolicy prep. PR #1. (#13077)

* [RLlib] Preprocessor fixes (multi-discrete) and tests. (#13083)

* [RLlib] BC/MARWIL/recurrent nets minor cleanups and bug fixes. (#13064)

* [Collective][PR 3.5/6] Send/Recv calls and some initial code for communicator caching (#12935)

* other collectives all work

* auto-linting

* mannual linting #1

* mannual linting 2

* bugfix

* add send/recv point-to-point calls

* add some initial code for communicator caching

* auto linting

* optimize imports

* minor fix

* fix unpassed tests

* support more dtypes

* rerun some distributed tests for send/recv

* linting

* [Serve] [Doc] Front page update (#13032)

* Deprecate experimental / dynamic resources (#13019)

* [docs] fix wandb url (#13094)

* [Serve] Implement Graceful Shutdown (#13028)

* [Serve] Use ServeHandle in HTTP proxy (#12523)

* [Java] Format ray java code (#13056)

* [docker] Fix restart behavior with Docker (#12898)

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: ijrsvt <ilr@anyscale.com>

* Disable broken streaming tests (#13095)

* [autoscaler] Make placement groups bypass max launch limit (#13089)

* Serve metrics docs (#13096)

* [RLlib] run_regression_tests.py: --framework flag (instead of --torch). (#13097)

* [RLLib] Readme.md Documentation for Almost All Algorithms in rllib/agents (#13035)

* [Doc] Fix Sphinx.add_stylesheet deprecation (#13067)

* Fix streaming ci failure (#12830)

* [RLlib] New Offline RL Algorithm: CQL (based on SAC) (#13118)

* [Bugfix][Dashboard] Fix undefined logCount, errorCount UI crash (#13113)

* [RLlib] Deflake test case: 2-step game MADDPG. (#13121)

* [RLlib] Trajectory view API docs. (#12718)

* Job module without submission (#13081)

Co-authored-by: 刘宝 <po.lb@antfin.com>

* [RLlib] JAXPolicy prep PR #2 (move get_activation_fn (backward-compatibly), minor fixes and preparations). (#13091)

* [Java] Avoid failure of serializing a user-defined unserializable exception. (#13119)

* [Tune] Update URL to fix 403 not found error in PBT tranformers test case (#13131)

* [serve] Async controller (#13111)

* [dashboard] Fix RAY_RAYLET_PID KeyError on Windows (#12948)

* [Serve] Use a small object to track requests (#13125)

* [docs][kubernetes][minor] Update K8s examples in doce (#13129)

* [RLlib] Support easy `use_attention=True` flag for using the GTrXL model. (#11698)

* [docs] Documentation + example for the C++ language API (#13138)

* [Java] Support `wasCurrentActorRestarted` in actor task. (#13120)

* Remove check.

* Add test

* fix lint

* lint

* Fix spotless lint

* Address comments.

* Fix lint

Co-authored-by: Qing Wang <jovany.wq@antgroup.com>

* [docs] Minor change to formating C++ docs. (#13151)

* Deprecate setResource java api (#13117)

* [docs] Small fix in C++ documentation. (#13154)

* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* deflake test_joblib

* lint

* placement groups bypass

* remove space

* Eric

* first ocmmit

* lint

* exmaple

* documentation

* hmm

* file path fix

* fix test

* some format issue in docs

* modified docs

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal>

* [Serve] [Doc] Add existing web server integration ServeHandle tutorial (#13127)

* [kubernetes][docs][minor] Kubernetes version warning (#13161)

* [Core] Locality-aware leasing: Milestone 1 - Owned refs, pinned location (#12817)

* Locality-aware leasing for owned refs (pinned locations).

* LessorPicker --> LeasePolicy.

* Consolidate GetBestNodeIdForTask and GetBestNodeIdForObjects.

* Update comments.

* Turn on locality-aware leasing feature flag by default.

* Move local fallback logic to LeasePolicy, move feature flag check to CoreWorker constructor, add local-only lease policy.

* Add lease policy consulting assertions to the direct task submitter tests.

* Add lease policy tests.

* LocalityLeasePolicy --> LocalityAwareLeasePolicy.

* Add missing const declarations.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* Add RAY_CHECK for raylet address nullptr when creating lease client.

* Make the fact that LocalLeasePolicy always returns the local node more explicit.

* Flatten GetLocalityData conditionals to make it more readable.

* Add ReferenceCounter::GetLocalityData() unit test.

* Add data-intensive microbenchmarks for single-node perf testing.

* Add data-intensive microbenchmarks for simulated cluster perf testing.

* Remove redundant comment.

* Remove data-intensive benchmarks.

* Add locality-aware leasing Python test.

* Formatting changes in ray_perf.py.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* Enabling the cancellation of non-actor tasks in a worker's queue (#12117)

* wrote code to enable cancellation of queued non-actor tasks

* minor changes

* bug fixes

* added comments

* rev1

* linting

* making ActorSchedulingQueue::CancelTaskIfFound raise a fatal error

* bug fix

* added two unit tests

* linting

* iterating through pending_normal_tasks starting from end

* fixup! iterating through pending_normal_tasks starting from end

* fixup! fixup! iterating through pending_normal_tasks starting from end

* post merge fixes

* added debugging instructions, pulled Accept() out of guarded loop

* removed debugging instructions, linting

* [Serve] Bug in Serve node memory-related resources calculation #11198 (#13061)

* [Release] Update Release Process Documentation (#13123)

* [Core] Remove Arrow dependencies (#13157)

* remove arrow ubsan

* remove arrow build depend

* remove arrow buffer

* [XGboost] Update Documentation (#13017)

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* [SGD] Fix Docstring for `as_trainable` (#13173)

* Revert "Enabling the cancellation of non-actor tasks in a worker's queue (#12117)" (#13178)

This reverts commit b4d688b4a64c595a071e8c7380b653e0bfea4ad2.

* Surface object store spilling statistics in `ray memory` (#13124)

* [ray_client]: Move from experimental to util (#13176)

Change-Id: I9f054881f0429092d265cd6944d89804cce9d946

* Remove unused file(object_manager_integration_test.cc) (#12989)

* Notify listeners after registered node stored (#13069)

* [build]Update description and add some keywords (#13163)

* [Collective][PR 2/6] Driver program declarative interfaces (#12874)

* scaffold of the code

* some scratch and options change

* NCCL mostly done, supporting API#1

* interface 2.1 2.2 scratch

* put code into ray and fix some importing issues

* add an addtional Rendezvous class to safely meet at named actor

* fix some small bugs in nccl_util

* some small fix

* scaffold of the code

* some scratch and options change

* NCCL mostly done, supporting API#1

* interface 2.1 2.2 scratch

* put code into ray and fix some importing issues

* add an addtional Rendezvous class to safely meet at named actor

* fix some small bugs in nccl_util

* some small fix

* add a Backend class to make Backend string more robust

* add several useful APIs

* add some tests

* added allreduce test

* fix typos

* fix several bugs found via unittests

* fix and update torch test

* changed back actor

* rearange a bit before importing distributed test

* add distributed test

* remove scratch code

* auto-linting

* linting 2

* linting 2

* linting 3

* linting 4

* linting 5

* linting 6

* 2.1 2.2

* fix small bugs

* minor updates

* linting again

* auto linting

* linting 2

* final linting

* Update python/ray/util/collective_utils.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Update python/ray/util/collective_utils.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Update python/ray/util/collective_utils.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* added actor test

* lint

* remove local sh

* address most of richard's comments

* minor update

* remove the actor.option() interface to avoid changes in ray core

* minor updates

Co-authored-by: YLJALDC <dal177@ucsd.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* [serve] Merge ActorReconciler and BackendState (#13139)

* [tune] better signature check for `tune.sample_from` (#13171)

* [tune] better signature check for `tune.sample_from`

* Update python/ray/tune/sample.py

Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com>

Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com>

* Disable atexit test on windows (#13207)

* [serve] Move controller state into separate files (#13204)

* Update multi_agent_independent_learning.py (#13196)

pettingzoo.utils.error.DeprecatedEnv: waterworld_v0 is now depreciated, use waterworld_v2 instead

* [Collective] Some necessary abstraction of collective calls before introducing stream management (#13162)

* [Tune] Fix PBT Transformers Example (#13174)

* [Serve] HTTPOptions for deployment modes (#13142)

* [tests] Fix Autoscaler Test failure on Windows (#13211)

* skip create_or_update tests

* Update python/ray/tests/test_autoscaler.py

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>

* [BugFix][GCS]Fix gcs_actor_manager_test multithreading bug (#13158)

* [GCS]Fix TestActorSubscribeAll bug (#13193)

* [Metrics] Record per node and raylet cpu / mem usage (#12982)

* Record per node and raylet cpu / mem usage

* Add comments.

* Addressed code review.

* [Tune] Fix tune serve integration example (#13233)

* [Redis] Note that each Redis Connect retry takes two minutes (#12183)

* Slightly alter error message so it's the same in both cases.

* Each retry takes about two minutes.

* [Log] fix spdlog init race (#12973)

* fix spdlog init race

* use global logger

* refine logger name and constructor

* [Release] Add 1.1.0 release test logs (#13054)

* Add microbenchmark to release logs

* check in many_tasks stress test result

* Add results of placement group stress test for 1.1.0

* Add result for test_dead_actors test and correct the name of test_many_tasks.txt

* Add rllib regression test result

* Add pytorch test results for rllib

* remove extraneous log entries

* [Core] Fix incorrect comment (#13228)

* [Serialization] Fix cloudpickle (#13242)

* [GCS]Fix gcs table storage `GetAll` and `GetByJobId` api bug (#13195)

* Start ray client server with 'ray start' (#13217)

* [GCS]Add gcs actor schedule strategy (#13156)

* Publish job/worker info with Hex format instead of Binary (#13235)

* [RLlib] SquashedGaussians should throw error when entropy or kl are called. (#13126)

* [Serve] Rescale Serve's Long Running Test to Cluster Mode (#13247)

Now that `HeadOnly` becomes the new default HTTP location, we can
re-enable the long running tests to use local multi-clusters.
(also fixed the controller's API to match up to date, we should
have caught these, I will open issues for this.)

* Update autoscaler-cluster yaml files for release tests (#13114)

* [Release] Use ray-ml image for logn running test (#13267)

* [RLlib] Fix missing "info_batch" arg (None) in `compute_actions` calls. (#13237)

* [Tune] Improve error message for Session Detection (#13255)

* Improve error message

* log once

* [Tune] Pin Tune Dependencies (#13027)

Co-authored-by: Ian <ian.rodney@gmail.com>

* [Dependabot] Add Dependabot (#13278)

Co-authored-by: Ian <ian.rodney@gmail.com>

* [docker] Pull if image is not present (#13136)

* [GCS] Remove old lightweight resource usage report code path (#13192)

* [Dashboard] Add GET /log_proxy API (#13165)

* Fix a crash problem caused by GetActorHandle in ActorManager (#13164)

* [ray_client] Add metadata to gRPC requests (#13167)

* [RLlib] Preparatory PR for: Documentation on Model Building. (#13260)

* [tune](deps): Bump mlflow from 1.13.0 to 1.13.1 in /python/requirements (#13286)

* [tune](deps): Bump gluoncv from 0.9.0 to 0.9.1 in /python/requirements (#13287)

* Remove top-level ray.connect() and ray.disconnect() APIs (#13273)

* [Pull manager] Only pull once per retry period (#13245)

* .

* docs

* cleanup

* .

* .

* .

* .

Co-authored-by: Alex <alex@anyscale.com>

* [Cancellation] Make Test Cancel Easier to Debug (#13243)

* first commit

* lint-fix

* [ray_client]: first draft of documentation (#13216)

* Do not give an error if both `RAY_ADDRESS` and `address` is specified on initialization (#13305)

* Finalize handling of RAY_ADDRESS

* lint

* [serve] Clean up EndpointState interface, move checkpointing inside of EndpointState (#13215)

* [RLlib] SlateQ Documentation (#13266)

* [RLlib] Add more detailed Documentation on Model building API (#13261)

* [tune] convert search spaces: parse spec before flattening (#12785)

* Parse spec before flattening

* flatten after parse

* Test for ValueError if grid search is passed to search algorithms

* remove empty extras streaming deps (#12933)

* add the method annotation and a comment explaining what's happening (#13306)

Change-Id: I848cc2f0beaed95340d9de7cca19a50c78d9da9a

* Use wait_for_condition to reduce flakiness in test_queue.py::test_custom_resources (#13210)

* [RLlib] Issue 13330: No TF installed causes crash in `ModelCatalog.get_action_shape()` (#13332)

* [serve] Cleanup backend state, move checkpointing and async goal logic inside (#13298)

* fix removal of task dependencies (#13333)

Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>

* [Serve] Support Starlette streaming response (#13328)

* [RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339)

* [client] Report number of currently active clients on connect (#13326)

* wip

* update

* update

* reset worker

* fix conn

* fix

* disable pycodestyle

* Implement internal kv in ray client (#13344)

* kv internal

* fix

* [Tune] Rename MLFlow to MLflow (#13301)

* Forgot overwrite parameter in Ray client internal kv

* Fix typo in Tune Docs (Checkpointing) (#13348)

See issue #13299

* [Kubernetes][Docs] GPU usage (#13325)

* gpu-note

* gpu-note

* More info

* lint?

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* GKE->Kubernetes

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Revert "[RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339)" (#13361)

This reverts commit e2b2abb88b82c0c2402a338bba51e5dbd1739419.

* [Dependabot] [CI] Re-configure Dependabot and disable duplicate builds (#13359)

* [tune] buffer trainable results (#13236)

* Working prototype

* Pass buffer length, fix tests

* Don't buffer per default

* Dispatch and process save in one go, added tests

* Fix tests

* Pass adaptive seconds to train_buffered, stop result processing after STOP decision

* Fix tests, add release test

* Update tests

* Added detailed logs for slow operations

* Update python/ray/tune/trial_runner.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Apply suggestions from code review

* Revert tests and go back to old tuning loop

* nit

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* [Serve] Add dependency management support for driver not running in a conda env (#13269)

* [RLlib] Add `__len__()` method to SampleBatch (#13371)

* [Serve] Backend state unit tests (#13319)

* trigger doc build for serve updates (#13373)

* [Object Spilling] Long running object spilling test (#13331)

* done.

* formatting.

* Remove unimplemented GetAll method in actor info accessor (#13362)

* [Doc] Remove trailing whitespaces (#13390)

* Enable Ray client server by default (#13350)

* update

* fix

* fix test

* update

* [RLlib] Trajectory View API: Atari framestacking. (#13315)

* [ray_client]: Wait for ready and retry on ray.connect() (#13376)

* [ray_client]: wait until connection ready

Change-Id: Ie443be60c33ab7d6da406b3dcaa57fbb7ba57dd6

* lint

Change-Id: I30f8e870bbd5f8859a9f11ae244e210f077cedd0

* docs and retry minimum

Change-Id: I43f5378322029267ddd69f518ce8206876e2129d

* [Dashboard] Fix missing actor pid (#13229)

* [ray_client]: Fix multiple attempts at checking connection (#13422)

* Plumb retries update (#13411)

* [Serve] [Doc] Improve batching doc (#13389)

* [autoscaler/k8s] [CI] Kubernetes test ray up, exec, down (#12514)

* Fix Serve release test (#13385)

* Add bazel logs upload to GHA (#13251)

* [tune] Fix f-string in error message (#13423)

* [serve] Pull out goal management logic into AsyncGoalManager class (#13341)

* Make request_resources() use internal kv instead of redis pub sub (#13410)

* Remove unused handler methods (#13394)

* [Tune] Pin Transitive Dependencies (#13358)

* Split out the part of get_node_ip_address for which the docstring is correct (#12796)

* Fix raylet::MockWorker::GetProcess crashes (#13440)

Co-authored-by: 刘宝 <po.lb@antfin.com>

* Revert "Enable Ray client server by default (#13350)" (#13429)

This reverts commit 912d0cbbf912d5b52d6176155bdff02f504b657d.

* Fix linter error (#13451)

* [GCS]Add gcs resource scheduler (#13072)

* [RLlib] Redo: Make TFModelV2 fully modular like TorchModelV2 (soft-deprecate register_variables, unify var names wrt torch). (#13363)

* [Core]Fix raylet scheduling bug (#13452)

* [Core]Fix raylet scheduling bug

* fix lint error

* fix lint error

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>

* [joblib] joblib strikes again but this time on windows (#13212)

* [ray_client]: fix exceptions raised while executing on the server on behalf of the client (#13424)

* [kubernetes][minor] Operator garbage collection fix (#13392)

* [Core][CLI] `ray status` and `ray memory` no longer starts a new job (#13391)

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Make status and error args required in commands.py#debug.status

* Remove unnecessary imports

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Make status and error args required in commands.py#debug.status

* Remove unnecessary imports

* Job 38482.1 should now pass

* Resolve merge conflict

* [RLlib] Deflake 2x remote & local inference tests (external env). (#13459)

* [docs] Add more guideline on using ray in slurm cluster (#12819)

Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com>
Co-authored-by: PENG Zhenghao <pengzh@ie.cuhk.edu.hk>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* [Dashboard] Fix GPU resource rendering issue (#13388)

* [Release] Fix Serve release test (#13303)

The Docker image we were using now uses `ray` users so we have to call
sudo.

* [serve] Properly obey SERVE_LOG_DEBUG=0 (#13460)

* Fix getting runtime context dict in driver (#13417)

* [xgb] re-enable xgboost_ray tests (#13416)

* re-enable

* fix

* update xgb_ray version

* [Serialization] New custom serialization API (#13291)

* new serialization API with doc & test

* add more notes

* refine notes

* doc

* [Core] Ownership-based Object Directory: Consolidate location table and reference table. (#13220)

* Added owned object reference before Plasma put on Create() + Seal() path.

* Consolidated location table and reference table in reference counter.

* Restore type in definition.

* Clean up owned reference on failed Seal().

* Added RemoveOwnedObject test for reference counter.

* Guard against ref going out of scope before location RPCs.

* Add 'owner must have ref in scope' precondition to documentation for object location methods.

* Move to separate Create() + Seal() methods for existing objects.

* Clearer distinction between Create() and Seal() methods.

* Make it clear that references will normally be cleaned up by reference counting.

* [ray_client]: Support runtime_context as metadata (#13428)

* [GCS]Remove unused class variable (#13454)

* [Object Spilling] Dedup restore objects (#13470)

* done.

* Addressed code review.

* [CI] Enable Dashboard tests for master (#13425)

* [docker/dashboard] Fix ray dashboard (#12899)

* [CI] Fix Windows Bazel Upload (#13436)

* Return version info from Ray client connect, to allow for discovering version mismatches

* Update ID specification doc (#13356)

* [ray_client]: fix wrong reference in server_pickler (#13474)

Change-Id: Ie3d219541b1875e986e72e3ae73ece145c715acf

* Bump dev branch to 2.0 to avoid endless version bump toil (#13497)

* wip

* fix

* fix

* Remove an unnecessary file (#13499)

* [Tests] Skip failing windows tests (#13495)

* skip failing windows tests

* skip more

* remove

* updates

* [tune] fix small docs typo (#13355)

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

* move message to debug (#13472)

* Minimal version of piping autoscaler events to driver logs (#13434)

* sync write internal config in gcs (#13197)

* Refactor node manager to eliminate `new_scheduler_enabled_` (#12936)

* [GCS]Only publish changed field when node dead (#13364)

* Only update changed field when node dead

* node_id missed

* [CI] Buildkite PR Environment for Simple Tests (#13130)

* [GCS] Remove task info publish as nowhere uses it (#13509)

* Remove task info publish as nowhere uses it

* simplify right publish channel

* [RLlib] Solve PyTorch/TF-eager A3C async race condition between calling model and its value function. (#13467)

* [tune] placement group support (#13370)

* [Serve] Allow ObjectRef for Composition (#12592)

* Add Dashboard Python Test to Buildkite (#13530)

* Add ability to not start Monitor when calling `ray start` (#13505)

* [tune] support experiment checkpointing for grid search (#13357)

* Fix typo (#13098)

* Remove PYTHON_MODE that is not defined in Ray so that import * will work from other packages. (#13544)

* [RLlib] MARWIL loss function test case and cleanup. (#13455)

* [RLlib] Deprecate `vf_share_layers` in top-level PPO/MAML/MB-MPO configs. (#13397)

* [RLlib] Env directory cleanup and tests. (#13082)

* [RLlib] Issue 9071 A3C w/ RNN not working due to VF assuming no RNN. (#13238)

* Fix passing env on windows (#13253)

* [Object Spilling] Remove retries and use a timer instead. (#13175)

* [metrics] Better validation for tags (#13421)

* [Tune] MLflow Credentials (#13533)

* Make AWSNodeProvider.create_node return nodes created (#13498)

* Make AWSNodeProvider.create_node return node config

* return-dict

* Node provider interface create node return type Any

* Type clarification.

* Delete debug code

* Oops reset example-full changes

* Return type specified. GCP create node returns None.

* Article

* Fix Docker Permission for Serve release test again (#13543)

* Pipe monitor.err logs to driver

* Debug info to GCS pub sub (#13564)

* Fix restoration request dedup issues. (#13546)

* [core] refactor disconnect message processing and enrich WorkExitType (#13527)

* [core] refactor disconnect message processing and enrich WorkExitType

add changes from refactor pr

fix type typo

fix typo

fix

* address comments

* also update WorkerTableData

* fix tests

* [GCS]Only publish fileds used by sub clients in WorkerTableData (#13508)

* Revert "Pipe monitor.err logs to driver" (#13574)

This reverts commit a0d08c2cc638c1766a08e2030642c9b434609efa.

* [tune] wandb - WandbLogger now also accepts wandb.data_types.Video (#13169)

* [tune] Allow actor reuse for new trials (#13549)

* Allow actor reuse for new trials

* Fix tests and update conf when starting new trial

* Move magic config to `reset_trial`

* [Core] add thread name to help performance profiling (#13506)

* Extra fix ray client newline (#13577)

* [xgboost] Add XGBoost release tests (#13456)

* Add XGBoost release tests

* Add more xgboost release tests

* Use failure state manager

* Add release test documentation

* Fix wording

* Automate fault tolerance tests

* Fix for operator role definition to add raycluster/finalizer (#13567)

* [metrics] Check that all tag_keys are set when recording (#13420)

* [Core] Remove 'PlasmaBuffer' in the buffer header (#13188)

* Sync Bonsai Changes in 1.1.0 (#49)

* [autoscaler/AWS] Updated AWS Node Provider threading logic (#11422)

* [autoscaler] Add rsync_exclude and rsync_filter options to cluster config (#11512)

* Add --worker-port-list option to ray start (#11481)

* [hotfix] Pin node version (fix linux wheel build) (#11532)

Co-authored-by: Max Fitton <max@semprehealth.com>

* [Core] Allow creating tasks/actors in a detached actor when driver has exited (#11493)

* Allow creating tasks/actors in a detached actor when driver has exited

* lint

* Address comment

* [Autoscaler] Do not count unmanaged nodes in load metrics (#11458)

* fixedd

* lint

* fixed other test case

* .

Co-authored-by: Alex Wu <alex@anyscale.com>

* [RaySGD] Docs for SGD+Tune usage (#11479)

* Clean up release tests (#11420)

* [tune] a tiny ptl example (#11497)

* [yaml] HotFix for correct example full (#11584)

* [releng]: Quiet Docker Push (and explain why) (#11623)

* [release] Do not tag docker latest on release builds  (#11694)

* fix

* Added comment

Co-authored-by: Alex Wu <alex@anyscale.com>

* [tune] fixed validation for search metrics (#11583)

* fixed validation for search metrics

* formatting

* made error report better

* if only one metric is missing extract it from list

* any can take a generator

* Fix asyncio plasma integration in cluster mode (#11665)

* [tune] PB2 (#11466)

Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com>
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Version bump 1.0.1

* Disable validation of cluster config on the cluster to allow for cluster configs with new properties. (#11693)

* [Hotfix] Pin Pydantic Version (#11622)

* [docker] Fix docker regex (#11726)

Co-authored-by: Alex Wu <alex@anyscale.com>

* [GCS]Decouple node failure detector with resoure related operations (#11465)

* [Placement Group] Placement group automatic cleanup. (#11546)

* In progress. Done with all placement group manager code.

* It is working with job.

* Finished detached actor implementation.

* Fix minor issue.

* In progress.

* Addressed code review.

* Addressed code review.

* Addressed code reivew.

* Fix a build error.

* [docker] Push to DockerHub in CI (#11442)

* [docker] Disable Readme push to avoid errors (#11770)

* Release testing things

* rllib regression results

* [Metrics] Implement basic metrics changes (#11769)

* Implement basic metrics changes

* Addressed code review.

* Fix build issue.

* Fix build issue.

* [Core] Fix ray start failure to due to bug of redis address detection (#11735)

* Fix ray start failure to due redis address detection bug

* Address comment

* [Test] Ignore setproctitle for local mode (#11819)

* [Dashboard] Patch issue in 1.0.1 release where worker stats are not present for a node (#12062)

* [autoscaler] Add the cluster_name to docker file mounts directory prefix to make it more unique (#11600)

* Set version to 1.0.1.post1

* Sync Bonsai Changes in 1.0.1 (#47)

* Bump up the version to 0.8.6

* Linting fix.

* Add release test runnning full asan python test (#8836)

* [MERGE TO MASTER] Add microbenchmark result.

* Fix asyncio re-entry error message (#8842)

* Change os.uname()[1] and socket.gethostname() to the portable and faster platform.node_ip() (#8839)

Co-authored-by: Mehrdad <noreply@github.com>

* [serve] Fix long running failure test (#8863)

* [Serve] Serve long running test fix (#8864)

* Replace ps call with psutil (#8851)

* Replace ps call with psutil

* Minor formatting

Co-authored-by: Mehrdad <noreply@github.com>
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>

* [Core] Fix a detached actor bug fix when GCS actor management is off. (#8843)

* [Testing] Fix LINT/sphinx errors. (#8874)

* Node failure test fix (#8882)

* [core] Check that port is unused before assigning to worker (#8773)

* [rllib] Set framework to tf by default and remove import checks; "Auto" option (#8748)

* tf by default

* Update rllib/agents/trainer.py

Co-authored-by: Sven Mika <sven@anyscale.io>

* remove it

* fix

* remove

* fix

* lint

Co-authored-by: Sven Mika <sven@anyscale.io>

* [RLlib] Issue 8889: action clipping bug ppo not learning mujoco (#8898)

* Fix Windows build (#8905)

Co-authored-by: Mehrdad <noreply@github.com>

* Use no_restart=False for ray.kill in Serve failure test (#8952)

* Display GPU Utilization in the Dashboard (#8564)

* Update incorrect detached actor docs (#8930)

* [Dashboard] Dashboard pubsub hotfix. (#8944)

* [CI] Fix Conda Permission on MacOS Github Action(#9004)


Co-authored-by: Mehrdad <noreply@github.com>

* Update pandas to 1.0.5 (#9065)

Co-authored-by: Mehrdad <noreply@github.com>

* Do not add reference count when it is local mode. (#8979)

* [Dashboard] Update the Ray dashboard documentation to explain memory view. (#8945)

* Windows compatibility (#93)

Co-authored-by: mehrdadn <mehrdadn@users.noreply.github.com>
Co-authored-by: Mehrdad <noreply@github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Preparing 0.8.6 (#26)

* Updated Version to 0.8.5.

* Formatting.

* Fix Serve long running test (#8223)

* Fix release 0.8.5 tests for PPO torch Breakout. (#8226)

* Remove logging (#8211)

* [BRING BACK TO MASTER] Fix cluster.yaml config.

* [rllib] Copy plasma memory before adding data to replay buffer

* [sgd] Resource limit lift for GPU test (#8238)

* Fix resource_ids_ data race (#8253)

* [rllib] [hotfix] Remove assert that trips on pytorch multiagent (#8241)

* [BRING BACK TO MASTER] add torch download for rllib regresstion test.

* [serve] Master actor fault tolerance (#8116)

* [serve] Add delete_backend call (#8252)

* Fix resource_ids_ data race (#8253)

* [serve] Add delete_endpoint call (#8256)

* [serve] Refactor BackendConfig (#8202)

* Delete example files.

* Fix serve long running test (#8268)

* [tune] Avoid breakage - soft deprecation warning for search algs (#8258)

* [tune] Hotfix Ax breakage when fixing backwards-compat (#8285)

* Async actor microbenchmark Script (#8275)

* [core] Disable GCS actor management (#8271)

* Pin redis-py version (#8290)

* [BRING BACK TO MASTER] add pip install upgrade to the command.

* Add ipython as dependency for autoscaler container (#8297)

Co-authored-by: rbusche <rbusche@inserve.de>

* Revert "Async actor microbenchmark Script (#8275)"

This reverts commit 6a6eead1fe45c774ce75da0d5f90f443ac3748ec.

* Docs and LINT.

* [RLlib] Increasing reusability v0 (#8)

* Set up CI with Azure Pipelines

Specifically, we are setting a
travis like ADO pipeline following
what is already present in the .travis.yml
file in the root of the repo.

* Separating travis like pipeline from main pipeline

* Adding Jenkings jobs equivalent

* Making some improvements

* Adding validation of the upstream CI

* Disabling Tune and large memory tests

* Changing threshold for simple reservoir sampling test

* Addressing comments

* Updating Azure Pipelines with travis updates

* Updating Azure Pipelines with more travis updates

* Updating CI with new cpp worker tests

* Setting code owners

* Fixing the version number generation

* Making main pipeline also our release pipeline

* Updating Azure Pipelines with travis updates

* Fixing wheels test

* Fixing codeowners

* Updating Azure Pipelines with travis updates

* Bumping up MACOSX_DEPLOYMENT_TARGET

* Updating Azure Pipelines with travis updates

* Updating Azure Pipelines with travis updates

* Updating Azure Pipelines with travis updates

* Disabling Serve tests

* Making explicit which branches GitHubActions workflows should watch

* Desabling Ray serve tests

* Installing numpy explicitly

* consolidating Ray test steps in one yml

* Making worker set, apex and ppo a little bit more reusable for custom agents

* Making Dynamic TF policy more reusable

* Allow the actions dict carry user data defined for the episodes

* Forcing RLlib tests to run always

* Making SAC model more extensible

* Adapting exploration API

* Reverting the random worker index change

* Making epsilon configurable

* Fixing method doc

* Fixing aguments check in reset_schedule

* Fixing per worker epsilon greedy

* Activating logs for failing test

* Making original_space check more roboust

* Allow normalized actions rescaling happend outside RLlib

* Passing infos values from agents to callbacks

* Installing node js using a task

* Adding kwargs in TFModels

* Fixing npm and node in mac

* Fixing the num workers value passed

* Forcing RLlib tests

* Merging 0.8.5

* Running some RLlib test in custom agent

* Adding echo bazelisk

* Force CI

* Force CI

* Relaxing an installation

* Using container jobs

* Fixing container jobs

* Change base image for container job

* Install with sude

* Exec with sudo

* Test

* Changing agent pool

* Remove python selection

* Fix version replacement

* Fix version replacement

* Trying Bazel

* Installing node with sudo

* Run all install as sudo

* Reverting sudo -s

* Fixing omitted param

* install python manually

* Fixing missing param

* Making NVM available

* Fix nvm installation

* Fix copye-paste

* renaming to req file

* fix typo

* Install JDK 8

* Install req in other jobs

* Install JDK with sudo

* Removing docker clean up

* Install Docker

* fix installation issue

* Adding azure package source

* Fix docker permissions

* Install jq

* downloading with sudo

* Install llvm as root

* Skiping flaky test

* copy artifacts as sudo

* Fix Bazel build in MacOS (#23)

* Fixing mac os building issue

* Bazelisk check

* Increase bazel version

* Fixing typos

* Update hash

* Include unzip

* Improved compilation and convergence tests

Added compilation tests that follow proper PyTest conventions.
These tests use parametrized settings, and allow for multiple algorithms to be
tested with a single test.
I've commented out tests these two tests can replace, to show the improvement.
Only about half of the algorithms have been transitioned to the new tests in
interest of keeping the PR small.

* Increasing bazel version

* Increasing bazel version only mac pipelines

* Printing system info in Ubuntu wheels pipeline

* making docker install optional

* Compilation and convergence tests for more algos

Added compilation and convergence tests for Apex DQN, Apex DDPG
Added convergence tests for SAC
Removed old (commented out) compilation test code from
`rllib.agents.dqn.tests.test_apex`

* Clean up

Deleted old (commented out) test code

* Updated BUILD file

Split tests into test_compilation and test_learning.py to work with BAZEL build files.

* Updated BUILD file

Fixed bug in BUILD - wrong files passed in.

* BugFix: Improper imports causing test failures

* BugFix: Improper imports causing test failures

* Removed test_appo from BUILD file

* Fixing copy-paste error

* Applying some bazel fixes

* Fixing installation issues

* Update hash

* Fixing NVM/NODE installation

* Applying latest changes in travis.yml

* Fixing fixture data exclusions

* Disable some java tests

* Adgudime/apex sac (#25)

* WIP: Compilation tests work

* Fixed bugs with Apex SAC continuous action spaces

* Bugfix: Bad imports

* Fixing PyArrow issue

* Fixing guava check

* Fix datetime java format

* Fixing Bazel issues finding or loading conftest

* Fixing pytest module loading order

* Trying different approach to pickle check

* Installing latest pickle5 explicitly

* Fixing conftest resolution

* Temporarily disabling pickle5 validation

* Fixing fixture data exclusions

* Fixing data files treated as src

* Disable some java tests

Co-authored-by: Edilmo Palencia <edilmo@gmail.com>

* Fix multiple CI errors

* Update hash

* Fixing more build issues

* Fixing more build issues

* Fix pipeline cache path

* More fixes

* Fix cache

* Fixing bazel test command

* Fix bazel test

* Allowing custom sumarize episodes

* Adding custom metrics ops in exec plan

* Apex SAC exploration should be stochastic

* Leting DQN deal with rechaping for Discrete spaces

* Commenting the cache

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Simon Mo <xmo@berkeley.edu>
Co-authored-by: Sven Mika <sven@anyscale.io>
Co-authored-by: ijrsvt <ian.rodney@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Rüdiger Busche <rbusche@posteo.net>
Co-authored-by: rbusche <rbusche@inserve.de>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
Co-authored-by: Aditya Gudimella <aditya.gudimella@gmail.com>

* Fix system info step (#29)

* Fix system info step (#30)

* adding testing framework (#28)

* adding testing framework

* install kubebuilder for testing

* adding crrect hash

Co-authored-by: Ali Kanso <ali.kanso@microsoft.com>

* add shared mem max flag

* change readme

* Tuned hyperparams for ApexSAC

* Bugfix for exploration config.

* Allowing PPO to handle async sampling (#34)

* Making ppo ParallelRollouts mode configurable

* Making dqn ParallelRollouts mode configurable

* Making RolloutWorker generator function public

* Missing argument

* Stop iteration if round robim proportion is not met

* fixing wheels parsing

* Improving iter union stop-iteration conditions

* Fixing DDPG

* Fixing MADDPG

* Fix tflite compat issue (#35)

* Fix tflite compat issue

* Fixing iter corner case

* Manual stride with elipsis

* Fix unecesary stop iteration

* Allow replay ops to stop if they are unhealthy (#36)

* Allow the replay ops to stop if they are unhealthy

* Allowing to configure dqn execution plan consistently

* Making configurable concurrency mode in DQN and metric collection in Apex (#37)

* Fixing concurrency op in dqn (#38)

* Replaced Prioritized Experience Replay with normal Experience replay to create AsyncSAC.

* Setting prioritized_replay in config now uses PrioritizedReplay correctly.

* Renamed LocalAsyncReplayBuffer and AsyncReplayActor to better reflect usage

* Added test with prioritized_replay set to True

* Cleaned up code.

* Fixing manual slicing (#40)

* Fixing manual slicing

* Handling the Box space explicitly

* Including the force stop in gather_async (#41)

* Including the force stop in gather_async

* Fix missing bar

* Fix for gather across shards

* Fix for gather async extreme case

* Making env-runner an explicit iterator and Local Iterator regenerable  (#42)

* Making env-runner an explicit iterator
And also making the LocalIterator able to regenerate.

* Fix multi agent test

* Fix union

* Making infinite sequence explicit

For the sake of the parallel iterators, one that hold a infinite sequence, could be called again after a stop iteration message.
In other words, an StopIteration for a infinite sequence must be seen as a "no items available" message.

* Fix unexpected error

* Fixing gym version

* Update hash

* Addressing comments

* Improve gathering async and by shards (#44)

* Improve gathering async and by shards

* Making ParallelIteratorWorker an explicit Iterator in all cases

* Making ParallelIteratorWorker an explicit Iterator in all cases

* Fixing inverted condition

* Removing ForceStopIteration

* Make seeding possible even if env cannot be seeded.

* Fix grep versions (#46)

* Fix grep versions

* Spliting the stages

* Using pool for all rllib

* Update hash

* fixing path permissions

* Changing node version

* Reverting some OS changes

* Fixing compilation errors

* More compilation errors

* More compilations errors

* Fix node installation

* Fixing some package versions

* Using right bazel version

* Fix mac os version in wheels

* Fix mac os version in wheels

* Some minor fixes

* Force the target mac os

* Fix path

* Disable stress test temporarily

* Fixing gitignore

* Fixing Sampler merge mistakes

* Fixing epsilon greddy merge mistakes and requirements versions

* Fix merge error

* Apply changes in travis.yml

* Fix several issues

* Fixing more compatibility bugs

* Fix more incompatibilities

* More incompatibilities

* Fixing more compat issues

* Disable tune horovod torch tests

* Fixing more tests

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Simon Mo <xmo@berkeley.edu>
Co-authored-by: mehrdadn <mehrdadn@users.noreply.github.com>
Co-authored-by: Mehrdad <noreply@github.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>
Co-authored-by: Sven Mika <sven@anyscale.io>
Co-authored-by: Ian Rodney <ian.rodney@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Max Fitton <mfitton@berkeley.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Rüdiger Busche <rbusche@posteo.net>
Co-authored-by: rbusche <rbusche@inserve.de>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
Co-authored-by: Aditya Gudimella <aditya.gudimella@gmail.com>
Co-authored-by: Ali Kanso <akanso@us.ibm.com>
Co-authored-by: Ali Kanso <ali.kanso@microsoft.com>

* Applying travis.yml changes

* Use latest pip

* Update the hash

* Fix rllib issues

* Fix rllib issues 2

* Fix tune errors

* Fix ray issues

* Remove old operator

* revert some rllib test deletions

* revert changes on release folder

* Revert more changes

* Logging dashboard building

* Use previous docker image

* Use centos docker image

* more logging

* Comment step

* hash

* installing node 14

* Fix hash

Co-authored-by: Gekho457 <62982571+Gekho457@users.noreply.github.com>
Co-authored-by: Alan Guo <aguo@aguo.software>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Max Fitton <maxfitton@anyscale.com>
Co-authored-by: Max Fitton <max@semprehealth.com>
Co-authored-by: Kai Yang <kfstorm@outlook.com>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Alex Wu <alex@anyscale.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
Co-authored-by: Barak Michener <me@barakmich.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Ian Rodney <ian.rodney@gmail.com>
Co-authored-by: Raoul Khouri <69156393+raoul-khour-ts@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Jack Parker-Holder <jackph@robots.ox.ac.uk>
Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com>
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Co-authored-by: Alan Guo <aguo@anyscale.com>
Co-authored-by: Tao Wang <wangtaothetonic@163.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
Co-authored-by: Simon Mo <xmo@berkeley.edu>
Co-authored-by: mehrdadn <mehrdadn@users.noreply.github.com>
Co-authored-by: Mehrdad <noreply@github.com>
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>
Co-authored-by: Sven Mika <sven@anyscale.io>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Max Fitton <mfitton@berkeley.edu>
Co-authored-by: Rüdiger Busche <rbusche@posteo.net>
Co-authored-by: rbusche <rbusche@inserve.de>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
Co-authored-by: Aditya Gudimella <aditya.gudimella@gmail.com>
Co-authored-by: Ali Kanso <akanso@us.ibm.com>
Co-authored-by: Ali Kanso <ali.kanso@microsoft.com>

* Apply changes in travis.yml

* Apply changes in travis.yml

* Fix hash

* Fix sampler

* node 14

* Fix sampler 2

* Disable flaky test

* Fix tune test

Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Sven Mika <sven@anyscale.io>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: fyrestone <fyrestone@outlook.com>
Co-authored-by: fangfengbin <869218239a@zju.edu.cn>
Co-authored-by: Barak Michener <me@barakmich.com>
Co-authored-by: DK.Pino <loushang.ls@antfin.com>
Co-authored-by: Ameer Haj Ali <ameer@anyscale.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Max Fitton <maxfitton@anyscale.com>
Co-authored-by: Corey Lowman <coreylowman@users.noreply.github.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: ZhuSenlin <wumuzi520@126.com>
Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>
Co-authored-by: Michael Luo <michael.luo123456789@gmail.com>
Co-authored-by: Alind Khare <alindkhare@gatech.edu>
Co-authored-by: Siyuan (Ryans) Zhuang <suquark@gmail.com>
Co-authored-by: Hao Zhang <zhisbug@users.noreply.github.com>
Co-authored-by: architkulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Lavanya Shukla <lavanya.shukla12@gmail.com>
Co-authored-by: chaokunyang <shawn.ck.yang@gmail.com>
Co-authored-by: Ian Rodney <ian.rodney@gmail.com>
Co-authored-by: ijrsvt <ilr@anyscale.com>
Co-authored-by: 刘宝 <po.lb@antfin.com>
Co-authored-by: Qing Wang <kingchin1218@126.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: Gabriele Oliaro <gabriele_oliaro@college.harvard.edu>
Co-authored-by: Raed Shabbir <raedshabbir@gmail.com>
Co-authored-by: Tao Wang <dooku.wt@antfin.com>
Co-authored-by: YLJALDC <dal177@ucsd.edu>
Co-authored-by: Basu Jindal <42815171+basujindal@users.noreply.github.com>
Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
Co-authored-by: dHannasch <David.A.Hannasch@gmail.com>
Co-authored-by: Lingxuan Zuo <skyzlxuan@gmail.com>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Hao Chen <chenh1024@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Alex <alex@anyscale.com>
Co-authored-by: Akash Patel <17132214+acxz@users.noreply.github.com>
Co-authored-by: Edwin Goh <37746563+edwinytgoh@users.noreply.github.com>
Co-authored-by: Maltimore <git@maltimore.info>
Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
Co-authored-by: Micah Yong <micahtyong@gmail.com>
Co-authored-by: PENG Zhenghao <pengzh@ie.cuhk.edu.hk>
Co-authored-by: SameerF <sameer@blueplastic.com>
Co-authored-by: Todd A. Anderson <drtodd13@comcast.net>
Co-authored-by: Keqiu Hu <khu@linkedin.com>
Co-authored-by: Daan Klijn <daanklijn0@gmail.com>
Co-authored-by: dmatch01 <dmatch01@users.noreply.github.com>
Co-authored-by: Gekho457 <62982571+Gekho457@users.noreply.github.com>
Co-authored-by: Alan Guo <aguo@aguo.software>
Co-authored-by: Max Fitton <max@semprehealth.com>
Co-authored-by: Kai Yang <kfstorm@outlook.com>
Co-authored-by: Raoul Khouri <69156393+raoul-khour-ts@users.noreply.github.com>
Co-authored-by: Jack Parker-Holder <jackph@robots.ox.ac.uk>
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Co-authored-by: Alan Guo <aguo@anyscale.com>
Co-authored-by: Tao Wang <wangtaothetonic@163.com>
Co-authored-by: Simon Mo <xmo@berkeley.edu>
Co-authored-by: mehrdadn <mehrdadn@users.noreply.github.com>
Co-authored-by: Mehrdad <noreply@github.com>
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>
Co-authored-by: Max Fitton <mfitton@berkeley.edu>
Co-authored-by: Rüdiger Busche <rbusche@posteo.net>
Co-authored-by: rbusche <rbusche@inserve.de>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
Co-authored-by: Aditya Gudimella <aditya.gudimella@gmail.com>
Co-authored-by: Ali Kanso <akanso@us.ibm.com>
Co-authored-by: Ali Kanso <ali.kanso@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants