Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converge on 16-node Platinum 8180 cluster, but not converge on 32-node with the same hyper-parameter #2000

Open
ZeweiChen11 opened this issue Dec 9, 2017 · 0 comments

Comments

@ZeweiChen11
Copy link

We used BigDL 0.3 release package for training Inception V1. It can converge on 16-node Platinum 8180 cluster with batch size 3584. But the loss shows NaN with the same hyper-parameters. This issue was reproduced for twice, one is from 16218 iterations and another is from 11885 iterations.

Log of loss NaN:
========1=============
2017-12-06 10:13:01 INFO DistriOptimizer$:330 - [Epoch 34 254464/1281167][Iteration 11885][Wall Clock 18617.882038226s] Trained 3584 records in 1.679330981 seconds. Throughput is 2134.1833 records/second. Loss is NaN. Current learning rate is 0.0805564585184345. Current weight decay is 1.0E-4. Current momentum is 0.9.
2017-12-06 10:13:02 INFO DistriOptimizer$:330 - [Epoch 34 258048/1281167][Iteration 11886][Wall Clock 18619.528894786s] Trained 3584 records in 1.64685656 seconds. Throughput is 2176.2673 records/second. Loss is NaN. Current learning rate is 0.08055565481442407. Current weight decay is 1.0E-4. Current momentum is 0.9.
2017-12-06 10:13:04 INFO DistriOptimizer$:330 - [Epoch 34 261632/1281167][Iteration 11887][Wall Clock 18621.056131478s] Trained 3584 records in 1.527236692 seconds. Throughput is 2346.722 records/second. Loss is NaN. Current learning rate is 0.08055485110239502. Current weight decay is 1.0E-4. Current momentum is 0.9.
2017-12-06 10:13:05 INFO DistriOptimizer$:330 - [Epoch 34 265216/1281167][Iteration 11888][Wall Clock 18622.412598597s] Trained 3584 records in 1.356467119 seconds. Throughput is 2642.1577 records/second. Loss is NaN. Current learning rate is 0.08055404738234707. Current weight decay is 1.0E-4. Current momentum is 0.9.
2017-12-06 10:13:06 INFO DistriOptimizer$:330 - [Epoch 34 268800/1281167][Iteration 11889][Wall Clock 18623.759393037s] Trained 3584 records in 1.34679444 seconds. Throughput is 2661.1335 records/second. Loss is NaN. Current learning rate is 0.08055324365428003. Current weight decay is 1.0E-4. Current momentum is 0.9.

======2============
2017-12-06 03:42:39 INFO DistriOptimizer$:330 - [Epoch 46 387072/1281167][Iteration 16218][Wall Clock 22671.003862105s] Trained 3584 records in 1.350137834 seconds. Throughput is 2654.5437 records/second. Loss is NaN. Current learning rate is 0.07699531293652587. Current weight decay is 1.0E-4. Current momentum is 0.9.
2017-12-06 03:42:40 INFO DistriOptimizer$:330 - [Epoch 46 390656/1281167][Iteration 16219][Wall Clock 22672.403222219s] Trained 3584 records in 1.399360114 seconds. Throughput is 2561.1707 records/second. Loss is NaN. Current learning rate is 0.07699447205963514. Current weight decay is 1.0E-4. Current momentum is 0.9.
2017-12-06 03:42:41 INFO DistriOptimizer$:330 - [Epoch 46 394240/1281167][Iteration 16220][Wall Clock 22673.742959043s] Trained 3584 records in 1.339736824 seconds. Throughput is 2675.1523 records/second. Loss is NaN. Current learning rate is 0.07699363117356087. Current weight decay is 1.0E-4. Current momentum is 0.9.
2017-12-06 03:42:43 INFO DistriOptimizer$:330 - [Epoch 46 397824/1281167][Iteration 16221][Wall Clock 22675.107806219s] Trained 3584 records in 1.364847176 seconds. Throughput is 2625.935 records/second. Loss is NaN. Current learning rate is 0.07699279027830275. Current weight decay is 1.0E-4. Current momentum is 0.9.
2017-12-06 03:42:44 INFO DistriOptimizer$:330 - [Epoch 46 401408/1281167][Iteration 16222][Wall Clock 22676.366339708s] Trained 3584 records in 1.258533489 seconds. Throughput is 2847.759 records/second. Loss is NaN. Current learning rate is 0.07699194937386049. Current weight decay is 1.0E-4. Current momentum is 0.9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant