Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test example applications and rllib in jenkins tests. #707

Merged
merged 5 commits into from
Jul 16, 2017
Merged

Test example applications and rllib in jenkins tests. #707

merged 5 commits into from
Jul 16, 2017

Conversation

robertnishihara
Copy link
Collaborator

@robertnishihara robertnishihara commented Jul 4, 2017

This currently tests A3C and evolution strategies in CI. For some reason the policy gradient example doesn't seem to work in Docker.

This should address #558.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1169/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1172/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1185/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1187/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1188/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1194/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1197/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1198/
Test FAILed.

@shaneknapp
Copy link
Contributor

test this please

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1202/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1207/
Test FAILed.

@shaneknapp
Copy link
Contributor

btw, when the test ran this morning (https://amplab.cs.berkeley.edu/jenkins/job/Ray-PRB/1198/), it hung and left myriad zombie processes on the worker:

root@amp-jenkins-staging-worker-01:~# ps auxww|grep Z
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     136544  0.0  0.0      0     0 ?        Z    09:05   0:00 [redis-server] <defunct>
root     136548  0.0  0.0      0     0 ?        Z    09:05   0:00 [redis-server] <defunct>
root     136552  0.0  0.0      0     0 ?        Z    09:05   0:00 [python] <defunct>
root     136553  0.0  0.0      0     0 ?        Z    09:05   0:00 [python] <defunct>
root     136561  0.0  0.0      0     0 ?        Z    09:05   0:00 [local_scheduler] <defunct>
root     136562  0.2  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136563  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136564  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136565  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136566  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136567  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136568  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136569  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136570  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136571  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136572  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136573  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136574  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136575  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136576  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136577  0.2  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136578  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136579  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136580  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136581  0.3  0.0      0     0 ?        Z    09:05   0:17 [python] <defunct>
root     136582  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136583  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136584  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136585  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136586  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136587  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136588  0.2  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136589  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136590  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136591  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136592  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136593  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136594  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136595  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136596  0.2  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136597  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136598  0.2  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136599  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136600  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136601  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136602  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136603  0.2  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136604  0.3  0.0      0     0 ?        Z    09:05   0:17 [python] <defunct>
root     136605  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136606  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136607  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136608  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136609  0.3  0.0      0     0 ?        Z    09:05   0:16 [python] <defunct>
root     136610  0.1  0.0      0     0 ?        Z    09:05   0:06 [jupyter-noteboo] <defunct>
root     136853  1.4  0.0      0     0 ?        Z    09:05   1:17 [python] <defunct>
root     136854  1.4  0.0      0     0 ?        Z    09:05   1:19 [python] <defunct>
root     136855  1.4  0.0      0     0 ?        Z    09:05   1:17 [python] <defunct>
root     136856  1.3  0.0      0     0 ?        Z    09:05   1:15 [python] <defunct>
root     136857  1.3  0.0      0     0 ?        Z    09:05   1:13 [python] <defunct>

this is, um, suboptimal and really bad as the only way to recover is a hard reboot of the server. it can also affect other builds from different projects from running on the same machine... a clipper PRB build that fired off immediately after these zombies appeared hung indefinitely, and i discovered these zombies during my investigations for them.

@robertnishihara
Copy link
Collaborator Author

Thanks @shaneknapp, it looks like there may have been some corruption in the local scheduler. Or maybe the problem is something else. I'll see if I can reproduce it locally.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1209/
Test FAILed.

@shaneknapp
Copy link
Contributor

shaneknapp commented Jul 8, 2017 via email

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1306/
Test PASSed.

@robertnishihara
Copy link
Collaborator Author

@pcmoritz, @ericl let me know if you have any comments about this.

# --iterations=2

docker run --shm-size=10G --memory=10G $DOCKER_SHA \
python /ray/python/ray/rllib/evolution_strategies/example.py \
Copy link
Contributor

@ericl ericl Jul 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe set --env-name as well here? We could also run rllib/train.py instead, which somewhat supercedes the example.py files.

Otherwise this looks good to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I agree switching to rllib/train.py would be a good change, perhaps in a subsequent PR, to make things more uniform.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/1309/
Test PASSed.

@pcmoritz pcmoritz merged commit 80e8426 into ray-project:master Jul 16, 2017
@pcmoritz pcmoritz deleted the jenkinsexamples branch July 16, 2017 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants