Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameter transfer issue in loss function #3409

Open
gganduu opened this issue Nov 5, 2021 · 0 comments
Open

Parameter transfer issue in loss function #3409

gganduu opened this issue Nov 5, 2021 · 0 comments
Assignees

Comments

@gganduu
Copy link

gganduu commented Nov 5, 2021

My loss computation function requires two parameters, predict and target. Below is a brief description of my loss function:

class ComputeLoss:
    def __init__(self, *args):
        ...
    def __call__(self, predict, target):
        ...

In my function, my loss is defined as :

def loss_creator(config):
	loss = ComputeLoss(...)
	return loss

init_orca_context(cluster_mode="local", cores=8, num_nodes=1, memory='30g', init_ray_on_spark=False, object_store_memory='30g')
est = Estimator.from_torch(model=model_creator, optimizer=optim_creator, loss=loss_creator, backend="torch_distributed")
est.fit(data=train_loader_creator, epochs=epochs, batch_size=batch_size)

After forward propagation, yolov5 will produce a 3-layers outputs corresponding to it's FPN architecture, which is my predict, a python list with 3 elements: [ torch.Size([16, 3, 80, 80, 6]) torch.Size([16, 3, 40, 40, 6]) torch.Size([16, 3, 20, 20, 6]) ].

And my target is a pytorch tensor : torch.Size([36, 6]). Each line is a ground-truth: [for_which_img, x, y, h, w, category_id]

I trained my yolov5 code without az, and it works healthy, but when I try to use az local mode, an error was cased:

Traceback (most recent call last):
File "yolov5.py", line 1910, in
main()
File "yolov5.py", line 1907, in main
train(opt, device)
File "yolov5.py", line 1808, in train
est.fit(data=train_loader_creator, epochs=epochs, batch_size=batch_size)
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/estimator.py", line 178, in fit
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/pytorch_ray_estimator.py", line 230, in train
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/pytorch_ray_estimator.py", line 262, in _train_epochs
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/pytorch_ray_estimator.py", line 49, in check_for_failure
File "/usr/local/lib/python3.6/dist-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1456, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::TorchRunner.train_epochs() (pid=6775, ip=172.16.212.214)
File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/torch_runner.py", line 268, in train_epochs
stats = self.train_epoch(loader, profile=profile, info=info)
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/torch_runner.py", line 289, in train_epoch
train_stats = self.training_operator.train_epoch(data_loader, info)
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/training_operator.py", line 199, in train_epoch
metrics = self.train_batch(batch, batch_info=batch_info)
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/pytorch/training_operator.py", line 265, in train_batch
loss = self.criterion(*output, *target)
TypeError: call() takes 3 positional arguments but 40 were given

It passes 40 parameters to my ComputeLoss function! I tried to print what the 40 parameters are, the 1st param is the self obj. The next 3 params are my predict, it splits my predict list into 3 separate elements! And the last 36 params are my target, which is a torch.tensor original, but it divided my target into 36 tensors!

I solved this issue by modifying my ComputeLoss function:

def __call__(self, *args):
	predict = args[:3] 
    
	target = args[3:]
	target = torch.stack(target) # torch.Size(36, 6)

But how can I solve this issue/bug without change my original code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants