-
-
Notifications
You must be signed in to change notification settings - Fork 220
Open
Description
Although you can train an aitextgen model on TPUs by setting n_tpu_cores=8 in an appropriate runtime, and the training loss indeed does decrease, there are a number of miscellaneous blocking problems:
- The
modelstored inaitextgendoes not update, even after training. - Saving the model via
save_pretrained()causes hang, even withxm.rendezvous() - Memory leaks on the host system (especially with large batch size)
fp16doesn't work at all, and there's no training loss decrease.
Will gladly take any suggestions/PRs to help resolve these!
Metadata
Metadata
Assignees
Labels
No labels