Skip to content

[BUG] Incorrect batch_size used for ThroughputTimer #2498

Closed
@clumsy

Description

Describe the bug
ThroughputTimer is started/stopped for model steps, but in runtime/engine it uses the micro_batch_size only:
here

Compare the above line to the following that uses the right value in runtime/pipe/engine: here

To Reproduce

  1. Enable wall_clock_breakdown

Expected behavior
This timer should use self.train_micro_batch_size_per_gpu() * self.gradient_accumulation_steps().

ds_report output
N/A

Screenshots
N/A

System info (please complete the following information):

  • OS: Amazon Linux 2
  • GPU count and types: 8 x A100
  • Interconnects (if applicable): 4 x 100 Gbps
  • Python version: 3.8
  • Any other relevant info about your setup:

Launcher context
Launching the experiment with the deepspeed launcher.

Docker context
Can't share

Additional context
DeepSpeed v0.7.3

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

bugSomething isn't workingtraining

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions