-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to finetune on TPU v3-8 nodes? It runs without error but does not seem to progress. #38
Comments
Are you using GKE or GCE? Can you try with the
and in the patched file:
|
I am using GCE with the 256 version. Previously I used a TF 1.14 vm and a TF 1.14 v3-8 TPU node. Now I tried again with a TF1.14 vm and a TF 1.14.1.dev20190518 TPU v3-8 node (version was confirmed in the console) and the settings you provided, but I still get an OOM error. |
Are you able to train a smaller model (& pointing it to empty model directory instead of the 48-layer model)? |
I have run this and get the following error. Related to #65
|
+1 |
Hi!
thanks for the great paper and for providing code and model. I am trying to finetune the model on a TPU v3-8 node in the Google cloud. I made the following changes:
optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer)
to training.pyuse_tpu=True
andbatch_size=8
.num_cores_per_replica=8
,iterations_per_loop=1
and addedcluster=tf.contrib.cluster_resolver.TPUClusterResolver()
in the call totf.contrib.tpu.RunConfig
. This should distribute the models across the 8 cores in a TPU. I found that with lower numbers fornum_cores_per_replica
I get an out-of-memory error. This is the exact code:run_config = tf.contrib.tpu.RunConfig( cluster=tf.contrib.cluster_resolver.TPUClusterResolver(), model_dir=args.model_dir, session_config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True), tpu_config=tf.contrib.tpu.TPUConfig(iterations_per_loop=1, num_cores_per_replica=8, per_host_input_for_training=3))
With these changes I can get the training.py to run with the seq256_v1 model without error. However, it doesn't seem to be doing anything after the model has been compiled, initialized from the checkpoint and the batches are being fed to the TPU. Even with a batch_size of only 8 and a total of 256 TFRecords in the input file, it never completes. The output I get is
The last WARNING line keeps repeating.
With Tensorboard I wasn't able to get a trace, which may indicate nothing is happening on the TPU.
By my simple calculation based on the numbers presented in the paper, I should be able to get 1024 (examples/batch) * 800,000 (# iterations) / 32 ( = 256/8, number of cores in TPU V3-256 Pod used in paper / number of cores in TPU v3-8 node) / 24 (hours) / 14 (days) / 3600 (seconds/hr) ~20 examples per second.
I have been able to run other (much smaller) Keras models in tf 1.14 on a TPU v3-8 using the same RunConfig, where I also parallelized the model across the 8 TPU cores.
Do you have any idea why the training does not seem to work (or at best is extremely slow)? Am I parallellizing the model across the 8 TPU cores in the correct way? How was this done for the paper?
Any help would be greatly appreciated!
Many thanks,
Kees
PS I get the same result when I add
input_partition_dims=[[1, 1], [1, 1]]
as an option to tpu_config.The text was updated successfully, but these errors were encountered: