Skip to content

Auto recovery failed again #2423

Open
Open
@moono

Description

@moono

🐛 Describe the bug

My Torchserve standalone instance starts(intializes) normally,
If some request takes too long (longer than responseTimeout) then torchserve try to restart about 5 times, but restart always fails.
I expect restarting will run initialize() function again but it seems not.

Error logs

WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-06-25T23:14:16,114 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2023-06-25T23:14:16,151 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
2023-06-25T23:14:16,218 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.8.1
TS Home: /home/venv/lib/python3.9/site-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Metrics config path: /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
Number of GPUs: 0
Number of CPUs: 8
Max heap size: 16064 M
Python executable: /home/venv/bin/python
Config file: /home/model-server/config.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8081
Metrics address: http://0.0.0.0:8082
Model Store: /home/model-server/model-store
Initial Models: my_model=my_model.mar
Log dir: /home/model-server/logs
Metrics dir: /home/model-server/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.|http(s)?://.]
Custom python dependency for model allowed: false
Enable metrics API: true
Metrics mode: log
Disable system metrics: true
Workflow Store: /home/model-server/model-store
Model config: N/A
2023-06-25T23:14:16,223 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin...
2023-06-25T23:14:16,235 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: my_model.mar
2023-06-25T23:14:16,247 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model my_model
2023-06-25T23:14:16,248 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model my_model
2023-06-25T23:14:16,248 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model my_model loaded.
2023-06-25T23:14:16,248 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: my_model, count: 1
2023-06-25T23:14:16,253 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-06-25T23:14:16,254 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2023-06-25T23:14:16,303 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2023-06-25T23:14:16,303 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2023-06-25T23:14:16,304 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://0.0.0.0:8081
2023-06-25T23:14:16,304 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2023-06-25T23:14:16,305 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082
Model server started.
2023-06-25T23:14:17,200 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=32
2023-06-25T23:14:17,200 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:14:17,208 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-06-25T23:14:17,208 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [PID]32
2023-06-25T23:14:17,209 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Torch worker started.
2023-06-25T23:14:17,209 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Python runtime: 3.9.17
2023-06-25T23:14:17,209 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change null -> WORKER_STARTED
2023-06-25T23:14:17,212 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:14:17,216 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-25T23:14:17,217 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1687734857217
2023-06-25T23:14:17,234 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - model_name: my_model, batchSize: 1
2023-06-25T23:14:17,235 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [xxx] Model initialization ... !!
2023-06-25T23:14:17,235 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [xxx] Model initialization ... DONE !!
2023-06-25T23:14:17,242 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-06-25T23:14:17,242 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 8
2023-06-25T23:14:17,243 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2023-06-25T23:14:17,244 [INFO ] W-9000-my_model_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:992.0|#WorkerName:W-9000-my_model_1.0,Level:Host|#hostname:6cfabc02384c,timestamp:1687734857
2023-06-25T23:14:17,244 [INFO ] W-9000-my_model_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:19.0|#Level:Host|#hostname:6cfabc02384c,timestamp:1687734857
2023-06-25T23:15:11,171 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:my_model,model_version:default|#hostname:6cfabc02384c,timestamp:1687734911
2023-06-25T23:15:11,172 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd PREDICT to backend at: 1687734911172
2023-06-25T23:15:11,173 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Backend received inference at: 1687734911
2023-06-25T23:15:11,174 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [xxx] run ... !!
2023-06-25T23:15:21,173 [ERROR] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1
2023-06-25T23:15:21,174 [ERROR] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:224) [model-server.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-25T23:15:21,181 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED
2023-06-25T23:15:21,183 [INFO ] W-9000-my_model_1.0 ACCESS_LOG - /172.20.40.248:59169 "POST /predictions/my_model HTTP/1.1" 500 10013
2023-06-25T23:15:21,183 [INFO ] W-9000-my_model_1.0 TS_METRICS - Requests5XX.Count:1.0|#Level:Host|#hostname:6cfabc02384c,timestamp:1687734921
2023-06-25T23:15:21,183 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.job.Job - Waiting time ns: 126385, Inference time ns: 10011131942
2023-06-25T23:15:21,184 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_MODEL_LOADED -> WORKER_STOPPED
2023-06-25T23:15:21,184 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery start timestamp: 1687734921184
2023-06-25T23:15:21,184 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stderr
2023-06-25T23:15:21,184 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stdout
2023-06-25T23:15:21,185 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2023-06-25T23:15:21,194 [INFO ] W-9000-my_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stdout
2023-06-25T23:15:21,194 [INFO ] W-9000-my_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stderr
2023-06-25T23:15:22,186 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-06-25T23:15:23,161 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=57
2023-06-25T23:15:23,161 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:23,170 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-06-25T23:15:23,170 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [PID]57
2023-06-25T23:15:23,170 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Torch worker started.
2023-06-25T23:15:23,170 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Python runtime: 3.9.17
2023-06-25T23:15:23,170 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2023-06-25T23:15:23,170 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:23,171 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-06-25T23:15:23,171 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-25T23:15:23,171 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-06-25T23:15:23,171 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?]
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:276) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:34) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:186) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-25T23:15:23,172 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-06-25T23:15:23,172 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2023-06-25T23:15:23,172 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stderr
2023-06-25T23:15:23,172 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stdout
2023-06-25T23:15:23,173 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2023-06-25T23:15:23,185 [INFO ] W-9000-my_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stdout
2023-06-25T23:15:23,185 [INFO ] W-9000-my_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stderr
2023-06-25T23:15:24,173 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-06-25T23:15:25,124 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=69
2023-06-25T23:15:25,124 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:25,132 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-06-25T23:15:25,132 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [PID]69
2023-06-25T23:15:25,132 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Torch worker started.
2023-06-25T23:15:25,132 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2023-06-25T23:15:25,132 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Python runtime: 3.9.17
2023-06-25T23:15:25,133 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:25,133 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-06-25T23:15:25,133 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-25T23:15:25,133 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-06-25T23:15:25,134 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?]
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:276) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:34) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:186) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-25T23:15:25,134 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-06-25T23:15:25,134 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2023-06-25T23:15:25,134 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stderr
2023-06-25T23:15:25,134 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stdout
2023-06-25T23:15:25,134 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 2 seconds.
2023-06-25T23:15:25,143 [INFO ] W-9000-my_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stdout
2023-06-25T23:15:25,143 [INFO ] W-9000-my_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stderr
2023-06-25T23:15:27,135 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-06-25T23:15:28,063 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=80
2023-06-25T23:15:28,064 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:28,072 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-06-25T23:15:28,072 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [PID]80
2023-06-25T23:15:28,072 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Torch worker started.
2023-06-25T23:15:28,072 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2023-06-25T23:15:28,072 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Python runtime: 3.9.17
2023-06-25T23:15:28,072 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:28,073 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-06-25T23:15:28,073 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-25T23:15:28,073 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-06-25T23:15:28,073 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?]
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:276) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:34) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:186) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-25T23:15:28,074 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-06-25T23:15:28,074 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2023-06-25T23:15:28,074 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stderr
2023-06-25T23:15:28,074 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stdout
2023-06-25T23:15:28,074 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 3 seconds.
2023-06-25T23:15:28,083 [INFO ] W-9000-my_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stdout
2023-06-25T23:15:28,083 [INFO ] W-9000-my_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stderr
2023-06-25T23:15:31,074 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-06-25T23:15:32,013 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=91
2023-06-25T23:15:32,014 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:32,022 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-06-25T23:15:32,022 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [PID]91
2023-06-25T23:15:32,022 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Torch worker started.
2023-06-25T23:15:32,022 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Python runtime: 3.9.17
2023-06-25T23:15:32,022 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2023-06-25T23:15:32,022 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:32,023 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-06-25T23:15:32,023 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-25T23:15:32,023 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-06-25T23:15:32,023 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?]
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:276) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:34) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:186) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-25T23:15:32,023 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-06-25T23:15:32,023 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2023-06-25T23:15:32,024 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stderr
2023-06-25T23:15:32,024 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stdout
2023-06-25T23:15:32,024 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 5 seconds.
2023-06-25T23:15:32,033 [INFO ] W-9000-my_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stdout
2023-06-25T23:15:32,033 [INFO ] W-9000-my_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stderr
2023-06-25T23:15:37,025 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-06-25T23:15:37,953 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=102
2023-06-25T23:15:37,954 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:37,962 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-06-25T23:15:37,962 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [PID]102
2023-06-25T23:15:37,962 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Torch worker started.
2023-06-25T23:15:37,962 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Python runtime: 3.9.17
2023-06-25T23:15:37,962 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2023-06-25T23:15:37,962 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:37,963 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-06-25T23:15:37,963 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-25T23:15:37,963 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-06-25T23:15:37,963 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?]
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:276) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:34) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:186) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-25T23:15:37,963 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-06-25T23:15:37,963 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2023-06-25T23:15:37,964 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stderr
2023-06-25T23:15:37,964 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stdout
2023-06-25T23:15:37,964 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 8 seconds.
2023-06-25T23:15:37,973 [INFO ] W-9000-my_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stdout
2023-06-25T23:15:37,973 [INFO ] W-9000-my_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stderr
2023-06-25T23:15:45,964 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-06-25T23:15:46,880 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=113
2023-06-25T23:15:46,881 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:46,889 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-06-25T23:15:46,889 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [PID]113
2023-06-25T23:15:46,889 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Torch worker started.
2023-06-25T23:15:46,889 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2023-06-25T23:15:46,889 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Python runtime: 3.9.17
2023-06-25T23:15:46,889 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:46,890 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-25T23:15:46,890 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-06-25T23:15:46,890 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-06-25T23:15:46,890 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?]
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:276) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:34) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:186) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-25T23:15:46,890 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-06-25T23:15:46,890 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2023-06-25T23:15:46,891 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stderr
2023-06-25T23:15:46,891 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stdout
2023-06-25T23:15:46,891 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 13 seconds.
2023-06-25T23:15:46,900 [INFO ] W-9000-my_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stdout
2023-06-25T23:15:46,900 [INFO ] W-9000-my_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stderr
2023-06-25T23:15:59,892 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-06-25T23:16:00,821 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=124
2023-06-25T23:16:00,822 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:16:00,829 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-06-25T23:16:00,830 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [PID]124
2023-06-25T23:16:00,830 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Torch worker started.
2023-06-25T23:16:00,830 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2023-06-25T23:16:00,830 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Python runtime: 3.9.17
2023-06-25T23:16:00,830 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:16:00,831 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-06-25T23:16:00,831 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-25T23:16:00,831 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-06-25T23:16:00,831 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?]
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:276) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:34) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:186) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-25T23:16:00,831 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-06-25T23:16:00,831 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2023-06-25T23:16:00,831 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stderr
2023-06-25T23:16:00,831 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stdout
2023-06-25T23:16:00,831 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 21 seconds.
2023-06-25T23:16:00,841 [INFO ] W-9000-my_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stdout
2023-06-25T23:16:00,841 [INFO ] W-9000-my_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stderr
2023-06-25T23:16:21,833 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-06-25T23:16:22,789 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=135
2023-06-25T23:16:22,790 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:16:22,799 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-06-25T23:16:22,799 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [PID]135
2023-06-25T23:16:22,799 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Torch worker started.
2023-06-25T23:16:22,799 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2023-06-25T23:16:22,799 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Python runtime: 3.9.17
2023-06-25T23:16:22,800 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:16:22,800 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-06-25T23:16:22,800 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-25T23:16:22,800 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-06-25T23:16:22,800 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?]
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:276) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:34) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:186) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-25T23:16:22,801 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-06-25T23:16:22,801 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2023-06-25T23:16:22,801 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stderr
2023-06-25T23:16:22,801 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stdout
2023-06-25T23:16:22,801 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 34 seconds.
2023-06-25T23:16:22,811 [INFO ] W-9000-my_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stdout
2023-06-25T23:16:22,811 [INFO ] W-9000-my_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stderr

Installation instructions

Install torchserve from source? No
Using Docker? Yes (pytorch/torchserve:0.8.1-cpu)

Model Packaing

my_handler.py

import time
import torch

from typing import List, Dict, Any, Tuple
from ts.context import Context


class MyTSModel(object):
    def __init__(self) -> None:
        self.model = None
        self.mapping = None
        self.device = None
        self.initialized = False
        self.context = None
        self.manifest = None
        self.map_location = None

        # model specific
        self.model = None
        return

    @staticmethod
    def _check_device(properties: Dict[str, Any]) -> Tuple[str, torch.device]:
        # default is "cpu"
        map_location = "cpu"
        device = torch.device("cpu")
        if torch.cuda.is_available() and properties.get("gpu_id") is not None:
            map_location = "cuda"
            device = torch.device(map_location + ":" + str(properties.get("gpu_id")))
        return (map_location, device)

    # must have: Torchserve
    def initialize(self, context: Context) -> None:
        print(f"[xxx] Model initialization ... !!")
        properties = context.system_properties
        self.map_location, self.device = self._check_device(properties)
        self.manifest = context.manifest
        model_dir = properties.get("model_dir")

        self.initialized = True
        print(f"[xxx] Model initialization ... DONE !!")
        return

    # must have: Torchserve
    def handle(self, data: List[Dict[str, Any]], context: Context):
        # It can be used for pre or post processing if needed as additional request
        # information is available in context

        self.context = context
        # metrics = self.context.metrics

        output = list()
        for idx, row in enumerate(data):
            # run
            print(f"[xxx] run ... !!")
            time.sleep(15)
            print(f"[xxx] run ... DONE !!")
            output.append(f"sample output {idx}")
        return output

config.properties

config.properties

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
model_store=/home/model-server/model-store
default_workers_per_model=1
load_models=my_model=my_model.mar
disable_system_metrics=true

# override default timeout: 120s
default_response_timeout=10

Versions


Environment headers

Torchserve branch:

torchserve==0.8.1
torch-model-archiver==0.8.1

Python version: 3.9 (64-bit runtime)
Python executable: /home/venv/bin/python

Versions of relevant python libraries:
captum==0.6.0
numpy==1.24.3
psutil==5.9.5
requests==2.31.0
torch==2.0.1+cpu
torch-model-archiver==0.8.1
torch-workflow-archiver==0.2.9
torchaudio==2.0.2+cpu
torchdata==0.6.1
torchserve==0.8.1
torchtext==0.15.2+cpu
torchvision==0.15.2+cpu
wheel==0.40.0
torch==2.0.1+cpu
torchtext==0.15.2+cpu
torchvision==0.15.2+cpu
torchaudio==2.0.2+cpu

Java Version:

OS: Ubuntu 20.04.6 LTS
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: N/A
CMake version: N/A

Repro instructions

  1. start torchserve
# example docker-compose.yml
version: "3.9"
services:
  torchserve_error:
    image: pytorch/torchserve:0.8.1-cpu
    shm_size: 1gb
    ports:
      - 8080:8080
      - 8081:8081
      - 8082:8082
    volumes:
      - ./ts_examples/error_case/config.properties:/home/model-server/config.properties
      - ./ts_examples/error_case/exported:/home/model-server/model-store
    command: [
      "torchserve", "--start", 
      "--ts-config", "/home/model-server/config.properties"
    ]
  1. send requests with postman
curl --location --request POST 'http://<ip-address>:8080/predictions/my_model'

Possible Solution

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdebuggingp0high prioritytriagedIssue has been reviewed and triaged

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions