TorchServe no prediction when input data gets bigger (Backend worker did not respond in given time)

### 🐛 Describe the bug

I am passing JSON data to python-requests. For simplification you can assume the following input: 
```
dic1 = {"main": "this is a main", "categories": "this is a categories"}
count = 2
input_data = [dic1 for i in range(count)]
response = requests.post(url, json=input_data)
```
**Issue:**
When the `count<= 8` --> it is working well. 
As soon as `count>8` --> it stuck and never returns.

as you see the input is just a simple python dictionary and if I set `input_data = [dic1 for i in range(10)]` the final size of the input would be very small.


I am using:
- custom handler
- ML trained model is based on simpletransformer
- Ubuntu 22.04 and 20.04
- GPU (local: RTX 3060, Kubernetes: T4)
- I have tested on local machines and Kubernetes. The issue is the same 


**When the issue shows itself:**
TorchServe: on the GPU, it is critically dependent on the input data size.

**When it works well:**
TorchServe: On the CPU, it is working well regardless of the size of input. 
PyTorch without TorchServe: I have tested it on PyTorch it is working well even when I pass `input_data = [dic1 for i in range(1000)]`







### Error logs

```
2022-08-23 16:34:14,800 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-hardnews_1.0 State change null -> WORKER_STARTED
2022-08-23 16:34:14,804 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000
2022-08-23 16:34:22,821 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 7931
2022-08-23 16:34:22,821 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-hardnews_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2022-08-23 16:34:48,850 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 1093
2022-08-23 16:34:48,852 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.job.Job - Waiting time ns: 239159, Backend time ns: 1094847821
2022-08-23 16:34:53,317 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 156
2022-08-23 16:34:53,318 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.job.Job - Waiting time ns: 144077, Backend time ns: 157878185
2022-08-23 16:35:01,126 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 224
2022-08-23 16:35:01,127 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.job.Job - Waiting time ns: 140180, Backend time ns: 225271057
2022-08-23 16:35:38,326 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 30000
2022-08-23 16:35:38,326 [ERROR] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1
2022-08-23 16:35:38,327 [ERROR] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:198)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
2022-08-23 16:35:38,328 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED
2022-08-23 16:35:38,335 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.job.Job - Waiting time ns: 64717, Inference time ns: 30009467849
2022-08-23 16:35:38,335 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-hardnews_1.0 State change WORKER_MODEL_LOADED -> WORKER_STOPPED
2022-08-23 16:35:38,335 [WARN ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-hardnews_1.0-stderr
2022-08-23 16:35:38,335 [WARN ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-hardnews_1.0-stdout
2022-08-23 16:35:38,336 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
```

### Installation instructions

I don't use docker for installation 

### Model Packaing

```
# Running archiver
torch-model-archiver -f --model-name model \
--version 1.0 \
--serialized-file model_folder/pytorch_model.bin \
--export-path model-store \
--requirements-file requirements.txt \
--extra-files "model_folder/config.json,model_folder/merges.txt,model_folder/model_args.json,model_folder/special_tokens_map.json,model_folder/tokenizer.json,model_folder/tokenizer_config.json,model_folder/training_args.bin,model_folder/vocab.json" \
--handler handler.py

```

### config.properties

```
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
install_py_dep_per_model=true
NUM_WORKERS=1
number_of_gpu=1
number_of_netty_threads=4
netty_client_threads=1
MKL_NUM_THREADS=1
batch_size=1
max_batch_delay=10
job_queue_size=1000
model_store=/home/model-server/shared/model-store
model_snapshot={"name": "startup.cfg","modelCount": 1,"models": {"news": {"1.0": {"defaultVersion": true,"marName": "news.mar","minWorkers": 1,"maxWorkers": 1,"batchSize": 1,"maxBatchDelay": 10,"responseTimeout": 120}}}}


- I have tried different values for threading, but not helpful
```

### Versions

```
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch: 

torchserve==0.4.0b20210521
torch-model-archiver==0.4.0b20210521

Python version: 3.8 (64-bit runtime)
Python executable: /home/ted/anaconda3/envs/myland/bin/python3

Versions of relevant python libraries:
captum==0.5.0
future==0.18.2
numpy==1.23.1
psutil==5.9.1
pytest==4.6.11
pytest-forked==1.4.0
pytest-timeout==1.4.2
pytest-xdist==1.34.0
requests==2.28.1
requests-mock==1.9.3
requests-oauthlib==1.3.1
sentencepiece==0.1.95
simpletransformers==0.62.0
torch==1.12.1
torch-model-archiver==0.4.0b20210521
torch-workflow-archiver==0.1.0b20210521
torchaudio==0.12.1
torchserve==0.4.0b20210521
torchvision==0.13.1
transformers==4.20.1
wheel==0.37.1
torch==1.12.1
**Warning: torchtext not present ..
torchvision==0.13.1
torchaudio==0.12.1

Java Version:


OS: Ubuntu 22.04 LTS
GCC version: (Ubuntu 11.2.0-19ubuntu1) 11.2.0
Clang version: N/A
CMake version: N/A

Is CUDA available: Yes
CUDA runtime version: N/A
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU
Nvidia driver version: 515.65.01
cuDNN version: None

```

### Repro instructions

```
# Running archiver
torch-model-archiver -f --model-name model \
--version 1.0 \
--serialized-file model_folder/pytorch_model.bin \
--export-path model-store \
--requirements-file requirements.txt \
--extra-files "model_folder/config.json,model_folder/merges.txt,model_folder/model_args.json,model_folder/special_tokens_map.json,model_folder/tokenizer.json,model_folder/tokenizer_config.json,model_folder/training_args.bin,model_folder/vocab.json" \
--handler handler.py

```


torchserve --start --model-store model-store --models model=hardnews --ncs --ts-config config.properties


running prediction

### Possible Solution

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TorchServe no prediction when input data gets bigger (Backend worker did not respond in given time) #1812

🐛 Describe the bug

Error logs

Installation instructions

Model Packaing

config.properties

Versions

Repro instructions

Possible Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TorchServe no prediction when input data gets bigger (Backend worker did not respond in given time) #1812

Description

🐛 Describe the bug

Error logs

Installation instructions

Model Packaing

config.properties

Versions

Repro instructions

Possible Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions