[BUG] Incorrect batch_size used for ThroughputTimer

**Describe the bug**
`ThroughputTimer` is started/stopped for model steps, but in `runtime/engine` it uses the micro_batch_size only:
[here](https://github.com/microsoft/DeepSpeed/blob/ee39187d8f07f7efc615f64affa8cc60ccf41eb5/deepspeed/runtime/engine.py#L297)

Compare the above line to the following that uses the right value in `runtime/pipe/engine`: [here](https://github.com/microsoft/DeepSpeed/blob/b8fb9c3f1a8a8e3e574a7d53e83ce2b72d471aa3/deepspeed/runtime/pipe/engine.py#L107)

**To Reproduce**
1. Enable `wall_clock_breakdown`

**Expected behavior**
This timer should use  `self.train_micro_batch_size_per_gpu() * self.gradient_accumulation_steps()`.

**ds_report output**
N/A

**Screenshots**
N/A

**System info (please complete the following information):**
 - OS: `Amazon Linux 2`
 - GPU count and types: `8 x A100`
 - Interconnects (if applicable): `4 x 100 Gbps`
 - Python version: `3.8`
 - Any other relevant info about your setup: 

**Launcher context**
Launching the experiment with the `deepspeed` launcher.

**Docker context**
Can't share

**Additional context**
DeepSpeed v0.7.3


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Incorrect batch_size used for ThroughputTimer #2498

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development