Skip to content

Conversation

@m3ngyang
Copy link
Member

fix #8954


1. Network connection errors in the log during muliti-node cluster training
------------------------------------------------
The errors in the log belong to network connection during mulilti-node cluster training, for example, :code:`Connection reset by peer`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence only have subject but no predicate and no object.


* Find the first error in the :code:`train.log`, :code:`server.log`, check whether other fault casued the problem, such as FPE, lacking of memory or disk.

* If network connection gave rise to the first error in the log, this may be caused by the port conflict of the non-exclusive execution. Connect with the operator to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If so, change the port of job.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If network connection gave rise to the first error in the log" => If the first error in server.log says "Address already used"


* Find the first error in the :code:`train.log`, :code:`server.log`, check whether other fault casued the problem, such as FPE, lacking of memory or disk.

* If network connection gave rise to the first error in the log, this may be caused by the port conflict of the non-exclusive execution. Connect with the operator to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If so, change the port of job.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connect with the operator => Contact the sys-admin


* Find the first error in the :code:`train.log`, :code:`server.log`, check whether other fault casued the problem, such as FPE, lacking of memory or disk.

* If network connection gave rise to the first error in the log, this may be caused by the port conflict of the non-exclusive execution. Connect with the operator to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If so, change the port of job.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, change the port of job. => If the current MPI cluster does not support this parameter, change the server port and try agian.


* If network connection gave rise to the first error in the log, this may be caused by the port conflict of the non-exclusive execution. Connect with the operator to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If so, change the port of job.

* If the currnet MPI cluster does not support exclusive pattern, ask the operator to replace or update the current cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

people may want to know what the "exclusive pattern" is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operator => cluster administrator

Copy link
Member Author

@m3ngyang m3ngyang Mar 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

people may want to know what the "exclusive pattern" is.

Which doc should we refer this term to ?

TBD
.. contents::

1. Network connection errors in the log during muliti-node cluster training
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

muliti-node -> multi-node

.. contents::

1. Network connection errors in the log during muliti-node cluster training
------------------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mulilti-node -> multi-node

1. Network connection errors in the log during muliti-node cluster training
------------------------------------------------
The errors in the log belong to network connection during mulilti-node cluster training, for example, :code:`Connection reset by peer`.
This kind of error is usually caused by the abnormal exit of the training process in some node, and the others cannot connect with this node any longer. Steps to troubleshoot the problem as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of the training process -> of a training process

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

others -> other nodes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem as follows -> the problem are as follows


* If network connection gave rise to the first error in the log, this may be caused by the port conflict of the non-exclusive execution. Connect with the operator to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If so, change the port of job.

* If the currnet MPI cluster does not support exclusive pattern, ask the operator to replace or update the current cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currnet -> current

@m3ngyang
Copy link
Member Author

done

Copy link
Contributor

@abhinavarora abhinavarora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@typhoonzero typhoonzero merged commit 33614ed into PaddlePaddle:develop Mar 28, 2018
@m3ngyang m3ngyang deleted the cluster_trai_pred_trans branch November 23, 2021 07:10
blacksheep-Aristotle pushed a commit to blacksheep-Aristotle/Paddle that referenced this pull request Nov 22, 2024
* support optimizer state offload (PaddlePaddle#8715)

* support optimizer offload

* update doc

* [FleetY]offload optimizer state after load optmizer state (PaddlePaddle#9352)

* add offload optimizer

* fix memory

* [FleetY] Add reload/offload for optimizer (PaddlePaddle#9356)

* add reload/offload for optimizer

* fix bug

---------

Co-authored-by: Guoxia Wang <mingzilaochongtu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Translation Plan-集群训练与预测-汉译英

3 participants