Skip to content

[BUG] overflow warning needs to be different for fp16 and non-fp16 #2911

@stas00

Description

Describe the bug

This code has an issue when it is run under non-fp16 regime.

if dist.get_rank() == 0:
logger.info(
"[deepspeed] OVERFLOW! Rank {} Skipping step. Attempted loss scale: {}, "
"reducing to {}".format(dist.get_rank(),
prev_scale,
self.loss_scale))

There are no scalers under bf16/fp32. So this warning is alarming to see - we rushed to see if somehow the config was broken, but it wasn't.

It should only say the Attempted loss scale:... part under fp16.

Most likely the same applies to its counterpart in stage 1/2.

Also do you think it'd be helpful to tell the user specifically if it's Inf vs. NaN? Since NaN isn't really an overflow or does it? Perhaps one of you with a more rigorous math background knows better. I think overflow is one of many types of NaN, thus NaN isn't always on Overflow. Please correct me if I'm wrong.

The reason I'm asking this question is to help the user to know what to look for, NaNs, Infinity, else.

@tjruwase

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

bugSomething isn't workingtraining

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions