Running Standalone cluster stuck while feeding training data 

Hello, 

Here's my problem: it stucks while feeding training data.

2018-10-04 04:14:35,233 INFO (MainThread-29426) node: {'executor_id': 1, 'host': '192.168.179.144', 'job_name': 'worker', 'task_index': 0, 'port': 33319, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-yb30teu7/listener-i7b5ukps', 'authkey': b'뜅�mdOG��1���'}
2018-10-04 04:14:35,233 INFO (MainThread-29426) Starting TensorFlow worker:0 as worker on cluster node 1 on background process
18/10/04 04:14:35 INFO PythonRunner: Times: total = 3479, boot = 479, init = 976, finish = 2024
2018-10-04 04:14:35,245 INFO (MainThread-29466) 1: ======== worker:0 ========
2018-10-04 04:14:35,245 INFO (MainThread-29466) 1: Cluster spec: {'ps': ['192.168.179.144:38999'], 'worker': ['192.168.179.144:33319']}
2018-10-04 04:14:35,245 INFO (MainThread-29466) 1: Using CPU
2018-10-04 04:14:35.246785: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-10-04 04:14:35.248870: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> 192.168.179.144:38999}
2018-10-04 04:14:35.248887: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:33319}
2018-10-04 04:14:35.249647: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:33319
18/10/04 04:14:35 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1801 bytes result sent to driver
18/10/04 04:14:35 INFO CoarseGrainedExecutorBackend: Got assigned task 2
18/10/04 04:14:35 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
18/10/04 04:14:35 INFO TorrentBroadcast: Started reading broadcast variable 3
18/10/04 04:14:35 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 7.8 KB, free 366.3 MB)
18/10/04 04:14:35 INFO TorrentBroadcast: Reading broadcast variable 3 took 20 ms
18/10/04 04:14:35 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 14.0 KB, free 366.3 MB)
18/10/04 04:14:35 INFO HadoopRDD: Input split: hdfs://localhost:9000/user/tyron/examples/mnist/csv/train/images/part-00000:0+9338236
18/10/04 04:14:35 INFO TorrentBroadcast: Started reading broadcast variable 0
18/10/04 04:14:35 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.0 KB, free 366.2 MB)
18/10/04 04:14:35 INFO TorrentBroadcast: Reading broadcast variable 0 took 19 ms
18/10/04 04:14:35 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 320.9 KB, free 365.9 MB)
tensorflow model path: hdfs://localhost:9000/user/tyron/mnist_model
18/10/04 04:14:36 INFO HadoopRDD: Input split: hdfs://localhost:9000/user/tyron/examples/mnist/csv/train/labels/part-00000:0+204800
18/10/04 04:14:36 INFO TorrentBroadcast: Started reading broadcast variable 1
18/10/04 04:14:36 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 23.0 KB, free 365.9 MB)
18/10/04 04:14:36 INFO TorrentBroadcast: Reading broadcast variable 1 took 13 ms
18/10/04 04:14:36 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 320.9 KB, free 365.6 MB)
2018-10-04 04:14:36,418 INFO (MainThread-29513) Connected to TFSparkNode.mgr on 192.168.179.144, executor=1, state='running'
2018-10-04 04:14:36,432 INFO (MainThread-29513) mgr.state='running'
2018-10-04 04:14:36,432 INFO (MainThread-29513) Feeding partition <itertools.chain object at 0x7f06aa8da048> into input queue <multiprocessing.queues.JoinableQueue object at 0x7f06aa8da358>
18/10/04 04:14:37 INFO PythonRunner: Times: total = 1123, boot = 4, init = 73, finish = 1046
18/10/04 04:14:37 INFO PythonRunner: Times: total = 194, boot = 9, init = 28, finish = 157
18/10/04 04:24:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File /usr/local/lib/spark/python/lib/pyspark.zip/pyspark/worker.py, line 177, in main
    process()
  File /usr/local/lib/spark/python/lib/pyspark.zip/pyspark/worker.py, line 172, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File /usr/local/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py, line 2423, in pipeline_func
  File /usr/local/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py, line 2423, in pipeline_func
  File /usr/local/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py, line 2423, in pipeline_func
  File /usr/local/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py, line 346, in func
  File /usr/local/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py, line 794, in func
  File /usr/local/lib/python3.6/dist-packages/tensorflowonspark/TFSparkNode.py, line 414, in _train
    raise Exception(Timeout while feeding partition)
Exception: Timeout while feeding partition

	at org.apache.spark.api.python.PythonRunner12727anon.read(PythonRDD.scala:193)
	at org.apache.spark.api.python.PythonRunner12727anon.<init>(PythonRDD.scala:234)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor.run(Executor.scala:338)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
18/10/04 04:24:38 INFO CoarseGrainedExecutorBackend: Got assigned task 3
18/10/04 04:24:38 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
18/10/04 04:24:38 INFO HadoopRDD: Input split: hdfs://localhost:9000/user/tyron/examples/mnist/csv/train/images/part-00001:0+11231804
18/10/04 04:24:38 INFO HadoopRDD: Input split: hdfs://localhost:9000/user/tyron/examples/mnist/csv/train/labels/part-00001:0+245760
2018-10-04 04:24:38,940 INFO (MainThread-29986) Connected to TFSparkNode.mgr on 192.168.179.144, executor=1, state='running'
2018-10-04 04:24:38,964 INFO (MainThread-29986) mgr.state='running'
2018-10-04 04:24:38,964 INFO (MainThread-29986) Feeding partition <itertools.chain object at 0x7f06aa8da048> into input queue <multiprocessing.queues.JoinableQueue object at 0x7f06aa8da358>
18/10/04 04:24:40 INFO PythonRunner: Times: total = 1574, boot = 4, init = 31, finish = 1539
18/10/04 04:24:40 INFO PythonRunner: Times: total = 119, boot = 14, init = 52, finish = 53
18/10/04 04:34:41 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)

here are my export  : 
export LD_LIBRARY_PATH=${PATH}
export HADOOP_HOME=/usr/local/lib/hadoop-2.7.4/
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop/
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
export SPARK_WORKER_INSTANCES=2
export CORES_PER_WORKER=1
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))

Here how I run the program : 
${SPARK_HOME}/bin/spark-submit \
--master ${MASTER} \
--py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
--cluster_size ${SPARK_WORKER_INSTANCES} \
--images examples/mnist/csv/train/images \
--labels examples/mnist/csv/train/labels \
--format csv \
--mode train \
--model mnist_model



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Standalone cluster stuck while feeding training data #349

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Running Standalone cluster stuck while feeding training data #349

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions