Hello,
thank you for the very interesting paper. I have a question regarding the training setup of TabFlex-S100, TabFlex-L100, and TabFlex-H1K.
In Appendix C.2 (Model Training), it is stated that the models were trained with batch sizes 1210, 110, and 1410 for 8, 4, and 4 epochs respectively. While experimenting with pre training, it seems that using such batch sizes would require significantly more GPU memory than the 80 GB A100 reported in the paper.
Am I missing something, or does the reported batch size correspond to the effective batch size, including gradient accumulation (batch_size × aggregate_k_gradients)? If so, I would be very interested in the concrete values used for batch_size and aggregate_k_gradients and the reasons for the overall very high batch_size value.
Thank you very much in advance.