You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After forward propagation, yolov5 will produce a 3-layers outputs corresponding to it's FPN architecture, which is my predict, a python list with 3 elements: [ torch.Size([16, 3, 80, 80, 6]) torch.Size([16, 3, 40, 40, 6]) torch.Size([16, 3, 20, 20, 6]) ].
And my target is a pytorch tensor : torch.Size([36, 6]). Each line is a ground-truth: [for_which_img, x, y, h, w, category_id]
I trained my yolov5 code without az, and it works healthy, but when I try to use az local mode, an error was cased:
Traceback (most recent call last):
File "yolov5.py", line 1910, in
main()
File "yolov5.py", line 1907, in main
train(opt, device)
File "yolov5.py", line 1808, in train
est.fit(data=train_loader_creator, epochs=epochs, batch_size=batch_size)
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/estimator.py", line 178, in fit
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/pytorch_ray_estimator.py", line 230, in train
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/pytorch_ray_estimator.py", line 262, in _train_epochs
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/pytorch_ray_estimator.py", line 49, in check_for_failure
File "/usr/local/lib/python3.6/dist-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1456, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::TorchRunner.train_epochs() (pid=6775, ip=172.16.212.214)
File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/torch_runner.py", line 268, in train_epochs
stats = self.train_epoch(loader, profile=profile, info=info)
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/torch_runner.py", line 289, in train_epoch
train_stats = self.training_operator.train_epoch(data_loader, info)
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/training_operator.py", line 199, in train_epoch
metrics = self.train_batch(batch, batch_info=batch_info)
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/training_operator.py", line 265, in train_batch
loss = self.criterion(*output, *target)
TypeError: call() takes 3 positional arguments but 40 were given
It passes 40 parameters to my ComputeLoss function! I tried to print what the 40 parameters are, the 1st param is the self obj. The next 3 params are my predict, it splits my predict list into 3 separate elements! And the last 36 params are my target, which is a torch.tensor original, but it divided my target into 36 tensors!
I solved this issue by modifying my ComputeLoss function:
My loss computation function requires two parameters,
predict
andtarget
. Below is a brief description of my loss function:In my function, my loss is defined as :
After forward propagation, yolov5 will produce a 3-layers outputs corresponding to it's FPN architecture, which is my
predict
, a python list with 3 elements:[ torch.Size([16, 3, 80, 80, 6]) torch.Size([16, 3, 40, 40, 6]) torch.Size([16, 3, 20, 20, 6]) ]
.And my
target
is a pytorch tensor :torch.Size([36, 6])
. Each line is a ground-truth:[for_which_img, x, y, h, w, category_id]
I trained my yolov5 code without az, and it works healthy, but when I try to use az local mode, an error was cased:
It passes 40 parameters to my
ComputeLoss
function! I tried to print what the 40 parameters are, the 1st param is theself
obj. The next 3 params are mypredict
, it splits my predict list into 3 separate elements! And the last 36 params are mytarget
, which is a torch.tensor original, but it divided mytarget
into 36 tensors!I solved this issue by modifying my
ComputeLoss
function:But how can I solve this issue/bug without change my original code?
The text was updated successfully, but these errors were encountered: