Clarification on effective batch size in TabFlex training setup

Hello,

thank you for the very interesting paper. I have a question regarding the training setup of TabFlex-S100, TabFlex-L100, and TabFlex-H1K.

In Appendix C.2 (Model Training), it is stated that the models were trained with batch sizes 1210, 110, and 1410 for 8, 4, and 4 epochs respectively. While experimenting with pre training, it seems that using such batch sizes would require significantly more GPU memory than the 80 GB A100 reported in the paper.

Am I missing something, or does the reported batch size correspond to the effective batch size, including gradient accumulation (batch_size × aggregate_k_gradients)? If so, I would be very interested in the concrete values used for batch_size and aggregate_k_gradients and the reasons for the overall very high batch_size value.

Thank you very much in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on effective batch size in TabFlex training setup #23

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification on effective batch size in TabFlex training setup #23

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions