Skip to content

Running Standalone cluster stuck while feeding training data  #349

Closed
@Tyron2308

Description

@Tyron2308

Hello,

Here's my problem: it stucks while feeding training data.

2018-10-04 04:14:35,233 INFO (MainThread-29426) node: {'executor_id': 1, 'host': '192.168.179.144', 'job_name': 'worker', 'task_index': 0, 'port': 33319, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-yb30teu7/listener-i7b5ukps', 'authkey': b'뜅�mdOG��1���'}
2018-10-04 04:14:35,233 INFO (MainThread-29426) Starting TensorFlow worker:0 as worker on cluster node 1 on background process
18/10/04 04:14:35 INFO PythonRunner: Times: total = 3479, boot = 479, init = 976, finish = 2024
2018-10-04 04:14:35,245 INFO (MainThread-29466) 1: ======== worker:0 ========
2018-10-04 04:14:35,245 INFO (MainThread-29466) 1: Cluster spec: {'ps': ['192.168.179.144:38999'], 'worker': ['192.168.179.144:33319']}
2018-10-04 04:14:35,245 INFO (MainThread-29466) 1: Using CPU
2018-10-04 04:14:35.246785: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-10-04 04:14:35.248870: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> 192.168.179.144:38999}
2018-10-04 04:14:35.248887: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:33319}
2018-10-04 04:14:35.249647: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:33319
18/10/04 04:14:35 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1801 bytes result sent to driver
18/10/04 04:14:35 INFO CoarseGrainedExecutorBackend: Got assigned task 2
18/10/04 04:14:35 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
18/10/04 04:14:35 INFO TorrentBroadcast: Started reading broadcast variable 3
18/10/04 04:14:35 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 7.8 KB, free 366.3 MB)
18/10/04 04:14:35 INFO TorrentBroadcast: Reading broadcast variable 3 took 20 ms
18/10/04 04:14:35 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 14.0 KB, free 366.3 MB)
18/10/04 04:14:35 INFO HadoopRDD: Input split: hdfs://localhost:9000/user/tyron/examples/mnist/csv/train/images/part-00000:0+9338236
18/10/04 04:14:35 INFO TorrentBroadcast: Started reading broadcast variable 0
18/10/04 04:14:35 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.0 KB, free 366.2 MB)
18/10/04 04:14:35 INFO TorrentBroadcast: Reading broadcast variable 0 took 19 ms
18/10/04 04:14:35 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 320.9 KB, free 365.9 MB)
tensorflow model path: hdfs://localhost:9000/user/tyron/mnist_model
18/10/04 04:14:36 INFO HadoopRDD: Input split: hdfs://localhost:9000/user/tyron/examples/mnist/csv/train/labels/part-00000:0+204800
18/10/04 04:14:36 INFO TorrentBroadcast: Started reading broadcast variable 1
18/10/04 04:14:36 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 23.0 KB, free 365.9 MB)
18/10/04 04:14:36 INFO TorrentBroadcast: Reading broadcast variable 1 took 13 ms
18/10/04 04:14:36 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 320.9 KB, free 365.6 MB)
2018-10-04 04:14:36,418 INFO (MainThread-29513) Connected to TFSparkNode.mgr on 192.168.179.144, executor=1, state='running'
2018-10-04 04:14:36,432 INFO (MainThread-29513) mgr.state='running'
2018-10-04 04:14:36,432 INFO (MainThread-29513) Feeding partition <itertools.chain object at 0x7f06aa8da048> into input queue <multiprocessing.queues.JoinableQueue object at 0x7f06aa8da358>
18/10/04 04:14:37 INFO PythonRunner: Times: total = 1123, boot = 4, init = 73, finish = 1046
18/10/04 04:14:37 INFO PythonRunner: Times: total = 194, boot = 9, init = 28, finish = 157
18/10/04 04:24:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File /usr/local/lib/spark/python/lib/pyspark.zip/pyspark/worker.py, line 177, in main
process()
File /usr/local/lib/spark/python/lib/pyspark.zip/pyspark/worker.py, line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File /usr/local/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py, line 2423, in pipeline_func
File /usr/local/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py, line 2423, in pipeline_func
File /usr/local/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py, line 2423, in pipeline_func
File /usr/local/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py, line 346, in func
File /usr/local/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py, line 794, in func
File /usr/local/lib/python3.6/dist-packages/tensorflowonspark/TFSparkNode.py, line 414, in _train
raise Exception(Timeout while feeding partition)
Exception: Timeout while feeding partition

at org.apache.spark.api.python.PythonRunner12727anon.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner12727anon.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

18/10/04 04:24:38 INFO CoarseGrainedExecutorBackend: Got assigned task 3
18/10/04 04:24:38 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
18/10/04 04:24:38 INFO HadoopRDD: Input split: hdfs://localhost:9000/user/tyron/examples/mnist/csv/train/images/part-00001:0+11231804
18/10/04 04:24:38 INFO HadoopRDD: Input split: hdfs://localhost:9000/user/tyron/examples/mnist/csv/train/labels/part-00001:0+245760
2018-10-04 04:24:38,940 INFO (MainThread-29986) Connected to TFSparkNode.mgr on 192.168.179.144, executor=1, state='running'
2018-10-04 04:24:38,964 INFO (MainThread-29986) mgr.state='running'
2018-10-04 04:24:38,964 INFO (MainThread-29986) Feeding partition <itertools.chain object at 0x7f06aa8da048> into input queue <multiprocessing.queues.JoinableQueue object at 0x7f06aa8da358>
18/10/04 04:24:40 INFO PythonRunner: Times: total = 1574, boot = 4, init = 31, finish = 1539
18/10/04 04:24:40 INFO PythonRunner: Times: total = 119, boot = 14, init = 52, finish = 53
18/10/04 04:34:41 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)

here are my export :
export LD_LIBRARY_PATH=${PATH}
export HADOOP_HOME=/usr/local/lib/hadoop-2.7.4/
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop/
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
export SPARK_WORKER_INSTANCES=2
export CORES_PER_WORKER=1
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))

Here how I run the program :
${SPARK_HOME}/bin/spark-submit
--master ${MASTER}
--py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.task.cpus=${CORES_PER_WORKER}
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME"
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py
--cluster_size ${SPARK_WORKER_INSTANCES}
--images examples/mnist/csv/train/images
--labels examples/mnist/csv/train/labels
--format csv
--mode train
--model mnist_model

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions