Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Add verbose log for nightly test only #20088

Merged
merged 2 commits into from
Nov 4, 2021

Conversation

simon-mo
Copy link
Contributor

@simon-mo simon-mo commented Nov 4, 2021

Why are these changes needed?

Can't really debug the mysterious failure atm and it is hard to reproduce. Therefore adding logs only for release tests.

I tested that this log will only print when this env var is enabled.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@simon-mo
Copy link
Contributor Author

simon-mo commented Nov 4, 2021

Example out

(ServeController pid=7655) 2021-11-04 14:36:16,955      WARNING backend_state.py:1145 -- Deployment 'f' has 1 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {'CPU': 1, 'GPU': 1}, resources available: {'CPU': 16.0}. component=serve deployment=f
(ServeController pid=7655) 2021-11-04 14:36:16,959      ERROR backend_state.py:74 -- Scaling information
(ServeController pid=7655) {
(ServeController pid=7655)   "nodes": [
(ServeController pid=7655)     {
(ServeController pid=7655)       "NodeID": "6b9f244ce298936b71c8b242d1b747501062cbef0158b02aea7803d8",
(ServeController pid=7655)       "Alive": true,
(ServeController pid=7655)       "NodeManagerAddress": "127.0.0.1",
(ServeController pid=7655)       "NodeManagerHostname": "Simons-MacBook-Pro.local",
(ServeController pid=7655)       "NodeManagerPort": 63337,
(ServeController pid=7655)       "ObjectManagerPort": 63336,
(ServeController pid=7655)       "ObjectStoreSocketName": "/tmp/ray/session_2021-11-04_14-13-07_169048_5470/sockets/plasma_store",
(ServeController pid=7655)       "RayletSocketName": "/tmp/ray/session_2021-11-04_14-13-07_169048_5470/sockets/raylet",
(ServeController pid=7655)       "MetricsExportPort": 65344,
(ServeController pid=7655)       "alive": true,
(ServeController pid=7655)       "Resources": {
(ServeController pid=7655)         "CPU": 16.0,
(ServeController pid=7655)         "memory": 23238119424.0,
(ServeController pid=7655)         "node:127.0.0.1": 1.0,
(ServeController pid=7655)         "object_store_memory": 11619059712.0
(ServeController pid=7655)       }
(ServeController pid=7655)     }
(ServeController pid=7655)   ],
(ServeController pid=7655)   "available_resources": {
(ServeController pid=7655)     "memory": 23238119424.0,
(ServeController pid=7655)     "CPU": 16.0,
(ServeController pid=7655)     "object_store_memory": 11619059712.0,
(ServeController pid=7655)     "node:127.0.0.1": 0.98
(ServeController pid=7655)   },
(ServeController pid=7655)   "total_resources": {
(ServeController pid=7655)     "memory": 23238119424.0,
(ServeController pid=7655)     "node:127.0.0.1": 1.0,
(ServeController pid=7655)     "CPU": 16.0,
(ServeController pid=7655)     "object_store_memory": 11619059712.0
(ServeController pid=7655)   },
(ServeController pid=7655)   "autoscaler_logs": [
(ServeController pid=7655)     "Usage:\n",
(ServeController pid=7655)     " 0.0/16.0 CPU\n",
(ServeController pid=7655)     " 0.00/21.642 GiB memory\n",
(ServeController pid=7655)     " 0.00/10.821 GiB object_store_memory\n",
(ServeController pid=7655)     "\n",
(ServeController pid=7655)     "Demands:\n",
(ServeController pid=7655)     " {'CPU': 1.0, 'GPU': 1.0}: 1+ pending tasks/actors (1+ using placement groups)\n",
(ServeController pid=7655)     " {'GPU': 1.0, 'CPU': 1.0} * 1 (PACK): 1+ pending placement groups\n",
(ServeController pid=7655)     "2021-11-04 14:36:10,335\tINFO autoscaler.py:267 -- \n",
(ServeController pid=7655)     "======== Autoscaler status: 2021-11-04 14:36:10.335583 ========\n",
(ServeController pid=7655)     "Node status\n",
(ServeController pid=7655)     "---------------------------------------------------------------\n",
(ServeController pid=7655)     "Healthy:\n",
(ServeController pid=7655)     " 1 node_6b9f244ce298936b71c8b242d1b747501062cbef0158b02aea7803d8\n",
(ServeController pid=7655)     "Pending:\n",
(ServeController pid=7655)     " (no pending nodes)\n",
(ServeController pid=7655)     "Recent failures:\n",
(ServeController pid=7655)     " (no failures)\n",
(ServeController pid=7655)     "\n",
(ServeController pid=7655)     "Resources\n",
(ServeController pid=7655)     "---------------------------------------------------------------\n",
(ServeController pid=7655)     "Usage:\n",
(ServeController pid=7655)     " 0.0/16.0 CPU\n",
(ServeController pid=7655)     " 0.00/21.642 GiB memory\n",
(ServeController pid=7655)     " 0.00/10.821 GiB object_store_memory\n",
(ServeController pid=7655)     "\n",
(ServeController pid=7655)     "Demands:\n",
(ServeController pid=7655)     " {'CPU': 1.0, 'GPU': 1.0}: 1+ pending tasks/actors (1+ using placement groups)\n",
(ServeController pid=7655)     " {'GPU': 1.0, 'CPU': 1.0} * 1 (PACK): 1+ pending placement groups\n",
(ServeController pid=7655)     "2021-11-04 14:36:15,342\tINFO autoscaler.py:267 -- \n",
(ServeController pid=7655)     "======== Autoscaler status: 2021-11-04 14:36:15.342040 ========\n",
(ServeController pid=7655)     "Node status\n",
(ServeController pid=7655)     "---------------------------------------------------------------\n",
(ServeController pid=7655)     "Healthy:\n",
(ServeController pid=7655)     " 1 node_6b9f244ce298936b71c8b242d1b747501062cbef0158b02aea7803d8\n",
(ServeController pid=7655)     "Pending:\n",
(ServeController pid=7655)     " (no pending nodes)\n",
(ServeController pid=7655)     "Recent failures:\n",
(ServeController pid=7655)     " (no failures)\n",
(ServeController pid=7655)     "\n",
(ServeController pid=7655)     "Resources\n",
(ServeController pid=7655)     "---------------------------------------------------------------\n",
(ServeController pid=7655)     "Usage:\n",
(ServeController pid=7655)     " 0.0/16.0 CPU\n",
(ServeController pid=7655)     " 0.00/21.642 GiB memory\n",
(ServeController pid=7655)     " 0.00/10.821 GiB object_store_memory\n",
(ServeController pid=7655)     "\n",
(ServeController pid=7655)     "Demands:\n",
(ServeController pid=7655)     " {'CPU': 1.0, 'GPU': 1.0}: 1+ pending tasks/actors (1+ using placement groups)\n",
(ServeController pid=7655)     " {'CPU': 1.0, 'GPU': 1.0} * 1 (PACK): 1+ pending placement groups\n"
(ServeController pid=7655)   ]
(ServeController pid=7655) }

@simon-mo simon-mo merged commit 4d583da into ray-project:master Nov 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants