java.lang.IllegalArgumentException: requirement failed: self element number... Error during saving checkpoint #2978

LoannGio · 2019-12-04T16:28:31Z

Hello,
I'm trying to train/validate a neural network with AnalyticsZoo(0.6)/BigDL(0.9) as follows

symbol_model = build_symbol_model(params)
train_symbol_rdd = create_bigdl_samples(sc, train_path, ...) #returns RDD<Sample>
val_symbol_rdd = create_bigdl_samples(sc, val_path, ...) # returns RDD<Sample>

symbol_optimizer = Optimizer(model=symbol_model, 
                    training_rdd=train_symbol_rdd, 
                    optim_method=Adam(),
                    end_trigger=MaxEpoch(5),
                    criterion=MSECriterion(),
                    batch_size=batchSize,
                    bigdl_type="float")
symbol_optimizer.set_validation(
    val_rdd=val_symbol_rdd,
    batch_size=batchSize,
    trigger=EveryEpoch(),
    val_method=[Loss(MSECriterion())]
)
symbol_model.set_checkpoint(EveryEpoch(), save_model_path , isOverWrite=True)
symbol_model.optimize()

The optimization's training works fine but when come the (first) validation, following error is raised :
java.lang.IllegalArgumentException: requirement failed: self element number(16) is not equal to source element number(32)

I've double checked my RDD of samples creation, seems like there's no trouble there.
I've tried to truncate my validation set so it matches a multiple of batchSize.
It doesn't seem to come from my model since the training works fine.

I have no clue where to look for anymore.
Any help would be appreciated.

Thanks.

Full logs :

2019-12-04 17:19:36 INFO  DistriOptimizer$:791 - caching training rdd ...
2019-12-04 17:19:52 INFO  DistriOptimizer$:629 - Cache thread models...
2019-12-04 17:19:52 INFO  ThreadPool$:95 - Set mkl threads to 1 on thread 652
2019-12-04 17:19:52 INFO  ThreadPool$:95 - Set mkl threads to 1 on thread 652
2019-12-04 17:19:52 INFO  ThreadPool$:95 - Set mkl threads to 1 on thread 652
2019-12-04 17:19:52 INFO  ThreadPool$:95 - Set mkl threads to 1 on thread 652
2019-12-04 17:19:52 INFO  ThreadPool$:95 - Set mkl threads to 1 on thread 652
2019-12-04 17:19:52 INFO  DistriOptimizer$:612 - model thread pool size is 1
2019-12-04 17:19:52 INFO  DistriOptimizer$:631 - Cache thread models... done
2019-12-04 17:19:52 INFO  DistriOptimizer$:148 - Count dataset
2019-12-04 17:19:52 INFO  DistriOptimizer$:152 - Count dataset complete. Time elapsed: 0.128190703s
2019-12-04 17:19:52 INFO  DistriOptimizer$:160 - config  {
	computeThresholdbatchSize: 100
	maxDropPercentage: 0.0
	warmupIterationNum: 200
	isLayerwiseScaled: false
	dropPercentage: 0.0
 }
2019-12-04 17:19:52 INFO  DistriOptimizer$:164 - Shuffle data
2019-12-04 17:19:52 INFO  DistriOptimizer$:167 - Shuffle data complete. Takes 0.012889339s
2019-12-04 17:19:53 INFO  DistriOptimizer$:406 - [Epoch 1 64/18880][Iteration 1][Wall Clock 0.501946693s] Trained 64 records in 0.501946693 seconds. Throughput is 127.50358 records/second. Loss is 0.24601407. 
2019-12-04 17:19:53 INFO  DistriOptimizer$:406 - [Epoch 1 128/18880][Iteration 2][Wall Clock 0.628867876s] Trained 64 records in 0.126921183 seconds. Throughput is 504.24997 records/second. Loss is 0.22389516. 
2019-12-04 17:19:53 INFO  DistriOptimizer$:406 - [Epoch 1 192/18880][Iteration 3][Wall Clock 0.744553932s] Trained 64 records in 0.115686056 seconds. Throughput is 553.2214 records/second. Loss is 0.21530338. 
2019-12-04 17:19:53 INFO  DistriOptimizer$:406 - [Epoch 1 256/18880][Iteration 4][Wall Clock 0.858995715s] Trained 64 records in 0.114441783 seconds. Throughput is 559.2363 records/second. Loss is 0.20601867. 
2019-12-04 17:19:53 INFO  DistriOptimizer$:406 - [Epoch 1 320/18880][Iteration 5][Wall Clock 0.962278502s] Trained 64 records in 0.103282787 seconds. Throughput is 619.65796 records/second. Loss is 0.19575086.

[...]

2019-12-04 17:20:18 INFO  DistriOptimizer$:406 - [Epoch 1 18880/18880][Iteration 295][Wall Clock 25.264317586s] Trained 64 records in 0.072651999 seconds. Throughput is 880.9118 records/second. Loss is 0.016169826. 
2019-12-04 17:20:18 INFO  DistriOptimizer$:451 - [Epoch 1 18880/18880][Iteration 295][Wall Clock 25.264317586s] Epoch finished. Wall clock time is 25378.094366 ms
2019-12-04 17:20:18 INFO  DistriOptimizer$:111 - [Epoch 1 18880/18880][Iteration 295][Wall Clock 25.264317586s] Validate model...
2019-12-04 17:20:19 INFO  DistriOptimizer$:177 - [Epoch 1 18880/18880][Iteration 295][Wall Clock 25.264317586s] validate model throughput is 239.91006 records/second
2019-12-04 17:20:19 INFO  DistriOptimizer$:180 - [Epoch 1 18880/18880][Iteration 295][Wall Clock 25.264317586s] Loss is (Loss: 46.11404, count: 256, Average Loss: 0.18013297)
Traceback (most recent call last):
  File "/home/toto/Documents/giot/colinware1/app.py", line 67, in <module>
    symbol_optimizer.optimize()
  File "/home/toto/.local/lib/python3.6/site-packages/bigdl/share/lib/bigdl-0.9.0-python-api.zip/bigdl/optim/optimizer.py", line 764, in optimize
  File "/home/toto/.local/lib/python3.6/site-packages/bigdl/share/lib/bigdl-0.9.0-python-api.zip/bigdl/util/common.py", line 634, in callJavaFunc
  File "/home/toto/.local/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/home/toto/.local/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o380.optimize.
: java.lang.IllegalArgumentException: requirement failed: self element number(16) is not equal to source element number(32)
	at scala.Predef$.require(Predef.scala:224)
	at com.intel.analytics.bigdl.tensor.DenseTensor$.copy(DenseTensor.scala:2623)
	at com.intel.analytics.bigdl.tensor.DenseTensor.copy(DenseTensor.scala:435)
	at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.setExtraParameter(AbstractModule.scala:375)
	at com.intel.analytics.bigdl.optim.DistriOptimizer$.getModel(DistriOptimizer.scala:659)
	at com.intel.analytics.bigdl.optim.AbstractOptimizer$$anonfun$checkpoint$1$$anonfun$apply$13.apply(AbstractOptimizer.scala:218)
	at com.intel.analytics.bigdl.optim.AbstractOptimizer$$anonfun$checkpoint$1$$anonfun$apply$13.apply(AbstractOptimizer.scala:216)
	at scala.Option.foreach(Option.scala:257)
	at com.intel.analytics.bigdl.optim.AbstractOptimizer$$anonfun$checkpoint$1.apply(AbstractOptimizer.scala:216)
	at com.intel.analytics.bigdl.optim.AbstractOptimizer$$anonfun$checkpoint$1.apply(AbstractOptimizer.scala:215)
	at scala.Option.foreach(Option.scala:257)
	at com.intel.analytics.bigdl.optim.AbstractOptimizer.checkpoint(AbstractOptimizer.scala:215)
	at com.intel.analytics.bigdl.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:491)
	at com.intel.analytics.bigdl.optim.DistriOptimizer.optimize(DistriOptimizer.scala:881)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

The text was updated successfully, but these errors were encountered:

i8run · 2019-12-05T03:39:24Z

Hi @LoannGio , it seems not the problem with RDD[Sample] creation. Is there any BatchNormalization in your model?

LoannGio · 2019-12-05T08:55:26Z

Hi @i8run , thanks for your reply.
Yes, there are. But I don't understand why it would raise such an error.

my model layers (rather classic encoder) :

Input(image32x32x3)
Conv2D
BatchNormalization
MaxPool2D
Conv2D
BatchNormalization
MaxPool2D
Conv2D
BatchNormalization
MaxPool2D
Conv2D
BatchNormalization
MaxPool2D
Reshape
Dense(16)

Which is merged with a decoder before training but this last doesn't have any BatchNormalization.

Edit: after some debugging, the error seems to be raised by the presence of my 2 first BatchNormalizations (not the next ones).

encoder_input (Input)                   (None, 32, 32, 3)         0                                                   
________________________________________________________________________________________________________________________
Convolution2D38f90e9b (Convolution2D)   (None, 32, 32, 16)        448           encoder_input                         
________________________________________________________________________________________________________________________
BatchNormalization24e9e3b5 (BatchNormal (None, 32, 32, 16)        32            Convolution2D38f90e9b                 
________________________________________________________________________________________________________________________
MaxPooling2D25fbedb6 (MaxPooling2D)     (None, 16, 16, 16)        0             BatchNormalization24e9e3b5            
________________________________________________________________________________________________________________________
Convolution2D4addcc55 (Convolution2D)   (None, 16, 16, 8)         1160          MaxPooling2D25fbedb6                  
________________________________________________________________________________________________________________________
BatchNormalization2de838a3 (BatchNormal (None, 16, 16, 8)         16            Convolution2D4addcc55                 
________________________________________________________________________________________________________________________
MaxPooling2D541e609a (MaxPooling2D)     (None, 8, 8, 8)           0             BatchNormalization2de838a3            
________________________________________________________________________________________________________________________

i8run · 2019-12-09T06:07:17Z

It's a very strange issue. Have you ever set the dim_ordering to tf ? Because your model summary seems the channel is at the last dimension.

The threw exception says the tensor shape not the same, where the tensor is the extra parameters of BatchNormalization/SpatialBatchNormalization. On master, the first bn in model has 16 output channels of runningMean. But the trained model in clients has 32 output channels for runningMean.

The BN should have 4 parameters, weight, bias, runningMean, runningVariance, whose size should be the output channels.

If the dim_ordering is th, then the default input format is NCHW, and output channel is the dimension 1 (0 based).
If the dim_ordering is tf, then default input format is NHWC, so the output channel is the dimension 3 (0 based).

If possible please take a look at the channel.

i8run · 2019-12-09T08:27:12Z

Hi @LoannGio , I have found the root cause of your case. It's an BigDL bug.

Currently, based on the version you used, (zoo of 0.6, 0.9.0 of BigDL), you should make the model format to NCHW or th, not the tf.

Thanks for your issue. :)

LoannGio · 2019-12-09T09:53:20Z

Hi @i8run , thanks for you investigation :)
Indeed, I was using dim_ordering="tf". I'll try to switch to th as you advised. I'm surprised this kind of bug only happens during validation in my case.
Again, many thanks for your help

i8run · 2019-12-10T01:39:21Z

In fact, the validation has completed. It occurs at the saving checkpoint stage. Because of this bug, the runningMean and runningVariance will be changed to wrong shape when do training. So the model saved on the driver will not the same as models in the executors.

* Fix code blocks indents in .md files Previously a lot of the code blocks in markdown files were horribly indented with bad white spaces in the beginning of lines. Users can't just select, copy, paste, and run (in the case of python). I have fixed all these, so there is no longer any code block with bad white space at the beginning of the lines. It would be nice if you could try to make sure in future commits that all code blocks are properly indented inside and have the right amount of white space in the beginning! * Fix small style issue * Fix indents * Fix indent and add \ for multiline commands Change indent from 3 spaces to 4, and add "\" for multiline bash commands Co-authored-by: Yifan Zhu <fanzhuyifan@gmail.com>

* add hyperzoo for k8s support (intel-analytics#2140) * add hyperzoo for k8s support * format * format * format * format * run examples on k8s readme (intel-analytics#2163) * k8s readme * fix jdk download issue (intel-analytics#2219) * add doc for submit jupyter notebook and cluster serving to k8s (intel-analytics#2221) * add hyperzoo doc * add hyperzoo doc * add hyperzoo doc * add hyperzoo doc * fix jdk download issue (intel-analytics#2223) * bump to 0.9s (intel-analytics#2227) * update jdk download url (intel-analytics#2259) * update some previous docs (intel-analytics#2284) * K8docsupdate (intel-analytics#2306) * Update README.md * Update s3 related links in readme and documents (intel-analytics#2489) * Update s3 related links in readme and documents * Update s3 related links in readme and documents * Update s3 related links in readme and documents * Update s3 related links in readme and documents * Update s3 related links in readme and documents * Update s3 related links in readme and documents * update * update * modify line length limit * update * Update mxnet-mkl version in hyper-zoo dockerfile (intel-analytics#2720) Co-authored-by: gaoping <pingx.gao@intel.com> * update bigdl version (intel-analytics#2743) * update bigdl version * hyperzoo dockerfile add cluster-serving (intel-analytics#2731) * hyperzoo dockerfile add cluster-serving * update * update * update * update jdk url * update jdk url * update Co-authored-by: gaoping <pingx.gao@intel.com> * Support init_spark_on_k8s (intel-analytics#2813) * initial * fix * code refactor * bug fix * update docker * style * add conda to docker image (intel-analytics#2894) * add conda to docker image * Update Dockerfile * Update Dockerfile Co-authored-by: glorysdj <glorysdj@gmail.com> * Fix code blocks indents in .md files (intel-analytics#2978) * Fix code blocks indents in .md files Previously a lot of the code blocks in markdown files were horribly indented with bad white spaces in the beginning of lines. Users can't just select, copy, paste, and run (in the case of python). I have fixed all these, so there is no longer any code block with bad white space at the beginning of the lines. It would be nice if you could try to make sure in future commits that all code blocks are properly indented inside and have the right amount of white space in the beginning! * Fix small style issue * Fix indents * Fix indent and add \ for multiline commands Change indent from 3 spaces to 4, and add "\" for multiline bash commands Co-authored-by: Yifan Zhu <fanzhuyifan@gmail.com> * enable bigdl 0.12 (intel-analytics#3101) * switch to bigdl 0.12 * Hyperzoo example ref (intel-analytics#3143) * specify pip version to fix oserror 0 of proxy (intel-analytics#3165) * Bigdl0.12.1 (intel-analytics#3155) * bigdl 0.12.1 * bump 0.10.0-Snapshot (intel-analytics#3237) * update runtime image name (intel-analytics#3250) * update jdk download url (intel-analytics#3316) * update jdk8 url (intel-analytics#3411) Co-authored-by: ardaci <dongjie.shi@intel.com> * update hyperzoo docker image (intel-analytics#3429) * update hyperzoo image (intel-analytics#3457) * fix jdk in az docker (intel-analytics#3478) * fix jdk in az docker * fix jdk for hyperzoo * fix jdk in jenkins docker * fix jdk in cluster serving docker * fix jdk * fix readme * update python dep to fit cnvrg (intel-analytics#3486) * update ray version doc (intel-analytics#3568) * fix deploy hyperzoo issue (intel-analytics#3574) Co-authored-by: gaoping <pingx.gao@intel.com> * add spark fix and net-tools and status check (intel-analytics#3742) * intsall netstat and add check status * add spark fix for graphene * bigdl 0.12.2 (intel-analytics#3780) * bump to 0.11-S and fix version issues except ipynb * add multi-stage build Dockerfile (intel-analytics#3916) * add multi-stage build Dockerfile * multi-stage build dockerfile * multi-stage build dockerfile * Rename Dockerfile.multi to Dockerfile * delete Dockerfile.multi * remove comments, add TINI_VERSION to common arg, remove Dockerfile.multi * multi-stage add tf_slim Co-authored-by: shaojie <shaojiex.bai@intel.com> * update hyperzoo doc and k8s doc (intel-analytics#3959) * update userguide of k8s * update k8s guide * update hyperzoo doc * Update k8s.md add note * Update k8s.md add note * Update k8s.md update notes * fix 4087 issue (intel-analytics#4097) Co-authored-by: shaojie <shaojiex.bai@intel.com> * fixed 4086 and 4083 issues (intel-analytics#4098) Co-authored-by: shaojie <shaojiex.bai@intel.com> * Reduce image size (intel-analytics#4132) * Reduce Dockerfile size 1. del redis stage 2. del flink stage 3. del conda & exclude some python packages 4. add copies layer stage * update numpy version to 1.18.1 Co-authored-by: zzti-bsj <shaojiex.bai@intel.com> * update hyperzoo image (intel-analytics#4250) Co-authored-by: Adria777 <Adria777@github.com> * bigdl 0.13 (intel-analytics#4210) * bigdl 0.13 * update * print exception * pyspark2.4.6 * update release PyPI script * update * flip snapshot-0.12.0 and spark2.4.6 (intel-analytics#4254) * s-0.12.0 master * Update __init__.py * Update python.md * fix docker issues due to version update (intel-analytics#4280) * fix docker issues * fix docker issues * update Dockerfile to support spark 3.1.2 && 2.4.6 (intel-analytics#4436) Co-authored-by: shaojie <otnw_bsj@163.com> * update hyperzoo, add lib for tf2 (intel-analytics#4614) * delete tf 1.15.0 (intel-analytics#4719) Co-authored-by: Le-Zheng <30695225+Le-Zheng@users.noreply.github.com> Co-authored-by: pinggao18 <44043817+pinggao18@users.noreply.github.com> Co-authored-by: pinggao187 <44044110+pinggao187@users.noreply.github.com> Co-authored-by: gaoping <pingx.gao@intel.com> Co-authored-by: Kai Huang <huangkaivision@gmail.com> Co-authored-by: GavinGu07 <55721214+GavinGu07@users.noreply.github.com> Co-authored-by: Yifan Zhu <zhuyifan@stanford.edu> Co-authored-by: Yifan Zhu <fanzhuyifan@gmail.com> Co-authored-by: Song Jiaming <litchy233@gmail.com> Co-authored-by: ardaci <dongjie.shi@intel.com> Co-authored-by: Yang Wang <yang3.wang@intel.com> Co-authored-by: zzti-bsj <2779090360@qq.com> Co-authored-by: shaojie <shaojiex.bai@intel.com> Co-authored-by: Lingqi Su <33695124+Adria777@users.noreply.github.com> Co-authored-by: Adria777 <Adria777@github.com> Co-authored-by: shaojie <otnw_bsj@163.com>

LoannGio changed the title ~~Error during validation~~ java.lang.IllegalArgumentException: requirement failed: self element number... Error during validation Dec 4, 2019

i8run changed the title ~~java.lang.IllegalArgumentException: requirement failed: self element number... Error during validation~~ java.lang.IllegalArgumentException: requirement failed: self element number... Error during saving checkpoint Dec 9, 2019

i8run mentioned this issue Dec 9, 2019

fix: bn nhwc error, the channel should be the last dim #2981

Merged

i8run closed this as completed in #2981 Dec 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

java.lang.IllegalArgumentException: requirement failed: self element number... Error during saving checkpoint #2978

java.lang.IllegalArgumentException: requirement failed: self element number... Error during saving checkpoint #2978

LoannGio commented Dec 4, 2019

i8run commented Dec 5, 2019

LoannGio commented Dec 5, 2019 •

edited

Loading

i8run commented Dec 9, 2019 •

edited

Loading

i8run commented Dec 9, 2019

LoannGio commented Dec 9, 2019 •

edited

Loading

i8run commented Dec 10, 2019

java.lang.IllegalArgumentException: requirement failed: self element number... Error during saving checkpoint #2978

java.lang.IllegalArgumentException: requirement failed: self element number... Error during saving checkpoint #2978

Comments

LoannGio commented Dec 4, 2019

i8run commented Dec 5, 2019

LoannGio commented Dec 5, 2019 • edited Loading

i8run commented Dec 9, 2019 • edited Loading

i8run commented Dec 9, 2019

LoannGio commented Dec 9, 2019 • edited Loading

i8run commented Dec 10, 2019

LoannGio commented Dec 5, 2019 •

edited

Loading

i8run commented Dec 9, 2019 •

edited

Loading

LoannGio commented Dec 9, 2019 •

edited

Loading