Description
🐛 Describe the bug
My Torchserve standalone instance starts(intializes) normally,
If some request takes too long (longer than responseTimeout) then torchserve try to restart about 5 times, but restart always fails.
I expect restarting will run initialize()
function again but it seems not.
Error logs
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-06-25T23:14:16,114 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2023-06-25T23:14:16,151 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
2023-06-25T23:14:16,218 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.8.1
TS Home: /home/venv/lib/python3.9/site-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Metrics config path: /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
Number of GPUs: 0
Number of CPUs: 8
Max heap size: 16064 M
Python executable: /home/venv/bin/python
Config file: /home/model-server/config.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8081
Metrics address: http://0.0.0.0:8082
Model Store: /home/model-server/model-store
Initial Models: my_model=my_model.mar
Log dir: /home/model-server/logs
Metrics dir: /home/model-server/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.|http(s)?://.]
Custom python dependency for model allowed: false
Enable metrics API: true
Metrics mode: log
Disable system metrics: true
Workflow Store: /home/model-server/model-store
Model config: N/A
2023-06-25T23:14:16,223 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin...
2023-06-25T23:14:16,235 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: my_model.mar
2023-06-25T23:14:16,247 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model my_model
2023-06-25T23:14:16,248 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model my_model
2023-06-25T23:14:16,248 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model my_model loaded.
2023-06-25T23:14:16,248 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: my_model, count: 1
2023-06-25T23:14:16,253 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-06-25T23:14:16,254 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2023-06-25T23:14:16,303 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2023-06-25T23:14:16,303 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2023-06-25T23:14:16,304 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://0.0.0.0:8081
2023-06-25T23:14:16,304 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2023-06-25T23:14:16,305 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082
Model server started.
2023-06-25T23:14:17,200 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=32
2023-06-25T23:14:17,200 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:14:17,208 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-06-25T23:14:17,208 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [PID]32
2023-06-25T23:14:17,209 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Torch worker started.
2023-06-25T23:14:17,209 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Python runtime: 3.9.17
2023-06-25T23:14:17,209 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change null -> WORKER_STARTED
2023-06-25T23:14:17,212 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:14:17,216 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-25T23:14:17,217 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1687734857217
2023-06-25T23:14:17,234 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - model_name: my_model, batchSize: 1
2023-06-25T23:14:17,235 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [xxx] Model initialization ... !!
2023-06-25T23:14:17,235 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [xxx] Model initialization ... DONE !!
2023-06-25T23:14:17,242 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-06-25T23:14:17,242 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 8
2023-06-25T23:14:17,243 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2023-06-25T23:14:17,244 [INFO ] W-9000-my_model_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:992.0|#WorkerName:W-9000-my_model_1.0,Level:Host|#hostname:6cfabc02384c,timestamp:1687734857
2023-06-25T23:14:17,244 [INFO ] W-9000-my_model_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:19.0|#Level:Host|#hostname:6cfabc02384c,timestamp:1687734857
2023-06-25T23:15:11,171 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:my_model,model_version:default|#hostname:6cfabc02384c,timestamp:1687734911
2023-06-25T23:15:11,172 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd PREDICT to backend at: 1687734911172
2023-06-25T23:15:11,173 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Backend received inference at: 1687734911
2023-06-25T23:15:11,174 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [xxx] run ... !!
2023-06-25T23:15:21,173 [ERROR] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1
2023-06-25T23:15:21,174 [ERROR] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:224) [model-server.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-25T23:15:21,181 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED
2023-06-25T23:15:21,183 [INFO ] W-9000-my_model_1.0 ACCESS_LOG - /172.20.40.248:59169 "POST /predictions/my_model HTTP/1.1" 500 10013
2023-06-25T23:15:21,183 [INFO ] W-9000-my_model_1.0 TS_METRICS - Requests5XX.Count:1.0|#Level:Host|#hostname:6cfabc02384c,timestamp:1687734921
2023-06-25T23:15:21,183 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.job.Job - Waiting time ns: 126385, Inference time ns: 10011131942
2023-06-25T23:15:21,184 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_MODEL_LOADED -> WORKER_STOPPED
2023-06-25T23:15:21,184 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery start timestamp: 1687734921184
2023-06-25T23:15:21,184 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stderr
2023-06-25T23:15:21,184 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stdout
2023-06-25T23:15:21,185 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2023-06-25T23:15:21,194 [INFO ] W-9000-my_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stdout
2023-06-25T23:15:21,194 [INFO ] W-9000-my_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stderr
2023-06-25T23:15:22,186 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-06-25T23:15:23,161 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=57
2023-06-25T23:15:23,161 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:23,170 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-06-25T23:15:23,170 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [PID]57
2023-06-25T23:15:23,170 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Torch worker started.
2023-06-25T23:15:23,170 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Python runtime: 3.9.17
2023-06-25T23:15:23,170 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2023-06-25T23:15:23,170 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:23,171 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-06-25T23:15:23,171 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-25T23:15:23,171 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-06-25T23:15:23,171 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?]
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:276) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:34) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:186) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-25T23:15:23,172 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-06-25T23:15:23,172 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2023-06-25T23:15:23,172 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stderr
2023-06-25T23:15:23,172 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stdout
2023-06-25T23:15:23,173 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2023-06-25T23:15:23,185 [INFO ] W-9000-my_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stdout
2023-06-25T23:15:23,185 [INFO ] W-9000-my_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stderr
2023-06-25T23:15:24,173 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-06-25T23:15:25,124 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=69
2023-06-25T23:15:25,124 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:25,132 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-06-25T23:15:25,132 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [PID]69
2023-06-25T23:15:25,132 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Torch worker started.
2023-06-25T23:15:25,132 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2023-06-25T23:15:25,132 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Python runtime: 3.9.17
2023-06-25T23:15:25,133 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:25,133 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-06-25T23:15:25,133 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-25T23:15:25,133 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-06-25T23:15:25,134 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?]
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:276) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:34) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:186) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-25T23:15:25,134 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-06-25T23:15:25,134 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2023-06-25T23:15:25,134 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stderr
2023-06-25T23:15:25,134 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stdout
2023-06-25T23:15:25,134 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 2 seconds.
2023-06-25T23:15:25,143 [INFO ] W-9000-my_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stdout
2023-06-25T23:15:25,143 [INFO ] W-9000-my_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stderr
2023-06-25T23:15:27,135 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-06-25T23:15:28,063 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=80
2023-06-25T23:15:28,064 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:28,072 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-06-25T23:15:28,072 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [PID]80
2023-06-25T23:15:28,072 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Torch worker started.
2023-06-25T23:15:28,072 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2023-06-25T23:15:28,072 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Python runtime: 3.9.17
2023-06-25T23:15:28,072 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:28,073 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-06-25T23:15:28,073 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-25T23:15:28,073 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-06-25T23:15:28,073 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?]
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:276) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:34) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:186) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-25T23:15:28,074 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-06-25T23:15:28,074 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2023-06-25T23:15:28,074 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stderr
2023-06-25T23:15:28,074 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stdout
2023-06-25T23:15:28,074 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 3 seconds.
2023-06-25T23:15:28,083 [INFO ] W-9000-my_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stdout
2023-06-25T23:15:28,083 [INFO ] W-9000-my_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stderr
2023-06-25T23:15:31,074 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-06-25T23:15:32,013 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=91
2023-06-25T23:15:32,014 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:32,022 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-06-25T23:15:32,022 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [PID]91
2023-06-25T23:15:32,022 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Torch worker started.
2023-06-25T23:15:32,022 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Python runtime: 3.9.17
2023-06-25T23:15:32,022 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2023-06-25T23:15:32,022 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:32,023 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-06-25T23:15:32,023 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-25T23:15:32,023 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-06-25T23:15:32,023 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?]
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:276) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:34) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:186) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-25T23:15:32,023 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-06-25T23:15:32,023 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2023-06-25T23:15:32,024 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stderr
2023-06-25T23:15:32,024 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stdout
2023-06-25T23:15:32,024 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 5 seconds.
2023-06-25T23:15:32,033 [INFO ] W-9000-my_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stdout
2023-06-25T23:15:32,033 [INFO ] W-9000-my_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stderr
2023-06-25T23:15:37,025 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-06-25T23:15:37,953 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=102
2023-06-25T23:15:37,954 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:37,962 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-06-25T23:15:37,962 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [PID]102
2023-06-25T23:15:37,962 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Torch worker started.
2023-06-25T23:15:37,962 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Python runtime: 3.9.17
2023-06-25T23:15:37,962 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2023-06-25T23:15:37,962 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:37,963 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-06-25T23:15:37,963 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-25T23:15:37,963 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-06-25T23:15:37,963 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?]
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:276) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:34) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:186) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-25T23:15:37,963 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-06-25T23:15:37,963 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2023-06-25T23:15:37,964 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stderr
2023-06-25T23:15:37,964 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stdout
2023-06-25T23:15:37,964 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 8 seconds.
2023-06-25T23:15:37,973 [INFO ] W-9000-my_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stdout
2023-06-25T23:15:37,973 [INFO ] W-9000-my_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stderr
2023-06-25T23:15:45,964 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-06-25T23:15:46,880 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=113
2023-06-25T23:15:46,881 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:46,889 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-06-25T23:15:46,889 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [PID]113
2023-06-25T23:15:46,889 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Torch worker started.
2023-06-25T23:15:46,889 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2023-06-25T23:15:46,889 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Python runtime: 3.9.17
2023-06-25T23:15:46,889 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:15:46,890 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-25T23:15:46,890 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-06-25T23:15:46,890 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-06-25T23:15:46,890 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?]
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:276) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:34) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:186) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-25T23:15:46,890 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-06-25T23:15:46,890 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2023-06-25T23:15:46,891 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stderr
2023-06-25T23:15:46,891 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stdout
2023-06-25T23:15:46,891 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 13 seconds.
2023-06-25T23:15:46,900 [INFO ] W-9000-my_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stdout
2023-06-25T23:15:46,900 [INFO ] W-9000-my_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stderr
2023-06-25T23:15:59,892 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-06-25T23:16:00,821 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=124
2023-06-25T23:16:00,822 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:16:00,829 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-06-25T23:16:00,830 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [PID]124
2023-06-25T23:16:00,830 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Torch worker started.
2023-06-25T23:16:00,830 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2023-06-25T23:16:00,830 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Python runtime: 3.9.17
2023-06-25T23:16:00,830 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:16:00,831 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-06-25T23:16:00,831 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-25T23:16:00,831 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-06-25T23:16:00,831 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?]
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:276) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:34) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:186) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-25T23:16:00,831 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-06-25T23:16:00,831 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2023-06-25T23:16:00,831 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stderr
2023-06-25T23:16:00,831 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stdout
2023-06-25T23:16:00,831 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 21 seconds.
2023-06-25T23:16:00,841 [INFO ] W-9000-my_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stdout
2023-06-25T23:16:00,841 [INFO ] W-9000-my_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stderr
2023-06-25T23:16:21,833 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-06-25T23:16:22,789 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=135
2023-06-25T23:16:22,790 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:16:22,799 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-06-25T23:16:22,799 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - [PID]135
2023-06-25T23:16:22,799 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Torch worker started.
2023-06-25T23:16:22,799 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2023-06-25T23:16:22,799 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Python runtime: 3.9.17
2023-06-25T23:16:22,800 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2023-06-25T23:16:22,800 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-06-25T23:16:22,800 [INFO ] W-9000-my_model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-25T23:16:22,800 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-06-25T23:16:22,800 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?]
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:276) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:34) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:186) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-25T23:16:22,801 [DEBUG] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-06-25T23:16:22,801 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2023-06-25T23:16:22,801 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stderr
2023-06-25T23:16:22,801 [WARN ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my_model_1.0-stdout
2023-06-25T23:16:22,801 [INFO ] W-9000-my_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 34 seconds.
2023-06-25T23:16:22,811 [INFO ] W-9000-my_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stdout
2023-06-25T23:16:22,811 [INFO ] W-9000-my_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my_model_1.0-stderr
Installation instructions
Install torchserve from source? No
Using Docker? Yes (pytorch/torchserve:0.8.1-cpu)
Model Packaing
my_handler.py
import time
import torch
from typing import List, Dict, Any, Tuple
from ts.context import Context
class MyTSModel(object):
def __init__(self) -> None:
self.model = None
self.mapping = None
self.device = None
self.initialized = False
self.context = None
self.manifest = None
self.map_location = None
# model specific
self.model = None
return
@staticmethod
def _check_device(properties: Dict[str, Any]) -> Tuple[str, torch.device]:
# default is "cpu"
map_location = "cpu"
device = torch.device("cpu")
if torch.cuda.is_available() and properties.get("gpu_id") is not None:
map_location = "cuda"
device = torch.device(map_location + ":" + str(properties.get("gpu_id")))
return (map_location, device)
# must have: Torchserve
def initialize(self, context: Context) -> None:
print(f"[xxx] Model initialization ... !!")
properties = context.system_properties
self.map_location, self.device = self._check_device(properties)
self.manifest = context.manifest
model_dir = properties.get("model_dir")
self.initialized = True
print(f"[xxx] Model initialization ... DONE !!")
return
# must have: Torchserve
def handle(self, data: List[Dict[str, Any]], context: Context):
# It can be used for pre or post processing if needed as additional request
# information is available in context
self.context = context
# metrics = self.context.metrics
output = list()
for idx, row in enumerate(data):
# run
print(f"[xxx] run ... !!")
time.sleep(15)
print(f"[xxx] run ... DONE !!")
output.append(f"sample output {idx}")
return output
config.properties
config.properties
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
model_store=/home/model-server/model-store
default_workers_per_model=1
load_models=my_model=my_model.mar
disable_system_metrics=true
# override default timeout: 120s
default_response_timeout=10
Versions
Environment headers
Torchserve branch:
torchserve==0.8.1
torch-model-archiver==0.8.1
Python version: 3.9 (64-bit runtime)
Python executable: /home/venv/bin/python
Versions of relevant python libraries:
captum==0.6.0
numpy==1.24.3
psutil==5.9.5
requests==2.31.0
torch==2.0.1+cpu
torch-model-archiver==0.8.1
torch-workflow-archiver==0.2.9
torchaudio==2.0.2+cpu
torchdata==0.6.1
torchserve==0.8.1
torchtext==0.15.2+cpu
torchvision==0.15.2+cpu
wheel==0.40.0
torch==2.0.1+cpu
torchtext==0.15.2+cpu
torchvision==0.15.2+cpu
torchaudio==2.0.2+cpu
Java Version:
OS: Ubuntu 20.04.6 LTS
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: N/A
CMake version: N/A
Repro instructions
- start torchserve
# example docker-compose.yml
version: "3.9"
services:
torchserve_error:
image: pytorch/torchserve:0.8.1-cpu
shm_size: 1gb
ports:
- 8080:8080
- 8081:8081
- 8082:8082
volumes:
- ./ts_examples/error_case/config.properties:/home/model-server/config.properties
- ./ts_examples/error_case/exported:/home/model-server/model-store
command: [
"torchserve", "--start",
"--ts-config", "/home/model-server/config.properties"
]
- send requests with postman
curl --location --request POST 'http://<ip-address>:8080/predictions/my_model'
Possible Solution
No response