Parameter transfer issue in loss function #3409

gganduu · 2021-11-05T01:41:32Z

My loss computation function requires two parameters, predict and target. Below is a brief description of my loss function:

class ComputeLoss:
    def __init__(self, *args):
        ...
    def __call__(self, predict, target):
        ...

In my function, my loss is defined as :

def loss_creator(config):
	loss = ComputeLoss(...)
	return loss

init_orca_context(cluster_mode="local", cores=8, num_nodes=1, memory='30g', init_ray_on_spark=False, object_store_memory='30g')
est = Estimator.from_torch(model=model_creator, optimizer=optim_creator, loss=loss_creator, backend="torch_distributed")
est.fit(data=train_loader_creator, epochs=epochs, batch_size=batch_size)

After forward propagation, yolov5 will produce a 3-layers outputs corresponding to it's FPN architecture, which is my predict, a python list with 3 elements: [ torch.Size([16, 3, 80, 80, 6]) torch.Size([16, 3, 40, 40, 6]) torch.Size([16, 3, 20, 20, 6]) ].

And my target is a pytorch tensor : torch.Size([36, 6]). Each line is a ground-truth: [for_which_img, x, y, h, w, category_id]

I trained my yolov5 code without az, and it works healthy, but when I try to use az local mode, an error was cased:

Traceback (most recent call last):
File "yolov5.py", line 1910, in
main()
File "yolov5.py", line 1907, in main
train(opt, device)
File "yolov5.py", line 1808, in train
est.fit(data=train_loader_creator, epochs=epochs, batch_size=batch_size)
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/estimator.py", line 178, in fit
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/pytorch_ray_estimator.py", line 230, in train
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/pytorch_ray_estimator.py", line 262, in _train_epochs
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/pytorch_ray_estimator.py", line 49, in check_for_failure
File "/usr/local/lib/python3.6/dist-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1456, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::TorchRunner.train_epochs() (pid=6775, ip=172.16.212.214)
File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/torch_runner.py", line 268, in train_epochs
stats = self.train_epoch(loader, profile=profile, info=info)
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/torch_runner.py", line 289, in train_epoch
train_stats = self.training_operator.train_epoch(data_loader, info)
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/training_operator.py", line 199, in train_epoch
metrics = self.train_batch(batch, batch_info=batch_info)
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/training_operator.py", line 265, in train_batch
loss = self.criterion(*output, *target)
TypeError: call() takes 3 positional arguments but 40 were given

It passes 40 parameters to my ComputeLoss function! I tried to print what the 40 parameters are, the 1st param is the self obj. The next 3 params are my predict, it splits my predict list into 3 separate elements! And the last 36 params are my target, which is a torch.tensor original, but it divided my target into 36 tensors!

I solved this issue by modifying my ComputeLoss function:

def __call__(self, *args):
	predict = args[:3] 
    
	target = args[3:]
	target = torch.stack(target) # torch.Size(36, 6)

But how can I solve this issue/bug without change my original code?

The text was updated successfully, but these errors were encountered:

glorysdj assigned qiuxin2012 Nov 5, 2021

hkvision mentioned this issue Nov 5, 2021

Pytorch Dataloader issue #3410

Open

helenlly added the user issue label Nov 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameter transfer issue in loss function #3409

Parameter transfer issue in loss function #3409

gganduu commented Nov 5, 2021

Parameter transfer issue in loss function #3409

Parameter transfer issue in loss function #3409

Comments

gganduu commented Nov 5, 2021