Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeed Communication Profiling and Logging #2012

Merged
merged 54 commits into from
Jul 25, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
867a853
Staging comms v1 (#301)
Quentin-Anthony May 27, 2022
c93fcfe
Delete stage1.py
awan-10 May 27, 2022
7f8ca01
Delete distributed.py
awan-10 May 27, 2022
977ee32
revert deepspeed/__init__.py logging calls
Quentin-Anthony May 28, 2022
68eb9f4
Delete test.py
Quentin-Anthony May 28, 2022
54796bb
Update comments and move custom comm ops to internal functions
Quentin-Anthony May 28, 2022
c06c72d
Merge branch 'staging-comms-next' of https://github.com/microsoft/Dee…
Quentin-Anthony May 28, 2022
f070a0c
Remove unnecessary print and update backend description
Quentin-Anthony May 28, 2022
9976681
Relax assertion to allow Megatron-DeepSpeed MoE to use ZeRO 1
Quentin-Anthony May 31, 2022
09063a3
Simplify ZeRO stage 1 check for previous commit
Quentin-Anthony May 31, 2022
656b415
Remove misleading world_size prints
Quentin-Anthony May 31, 2022
2e7129c
Add commslogger class, and introduce rough prototype comms logging
Quentin-Anthony Jun 1, 2022
0023b3e
Clean up logger
Quentin-Anthony Jun 1, 2022
e55c8e9
Add more robust arg checks
Quentin-Anthony Jun 3, 2022
31c7dcf
Add labels to common collective calls for logger
Quentin-Anthony Jun 3, 2022
8e23f50
Add more annotations
Quentin-Anthony Jun 3, 2022
7998350
Fix up log_summary_new and fix logging bug for barrier
Quentin-Anthony Jun 7, 2022
227874e
Clean up arg sweep logic and add isend/irecv
Quentin-Anthony Jun 7, 2022
27c38f9
Merge branch 'master' into staging-comms-logging-v1
Quentin-Anthony Jun 13, 2022
26e15ae
Clean up logging branch
Quentin-Anthony Jun 13, 2022
3aa3e38
Unify naming and fix circular import
Quentin-Anthony Jun 13, 2022
d2561dc
Fix deepspeed comm imports for logging.py
Quentin-Anthony Jun 13, 2022
c85f3c1
Added comms config support, removed some log names
Quentin-Anthony Jun 14, 2022
f70addb
Add comms config file
Quentin-Anthony Jun 14, 2022
a153331
Add pydantic to requirements
Quentin-Anthony Jun 14, 2022
351f384
Add configure non-op to old torch
Quentin-Anthony Jun 14, 2022
bcb3afd
Update logging call for old torch
Quentin-Anthony Jun 14, 2022
2f8320a
Add log_name placeholder args for old torch
Quentin-Anthony Jun 14, 2022
95aa7d8
Add basic verbosity setup
Quentin-Anthony Jun 15, 2022
93d1a31
Complete verbosity setup
Quentin-Anthony Jun 18, 2022
4a6236d
move comms logging to separate file and clean up
Quentin-Anthony Jun 18, 2022
393c90a
Change debug message design
Quentin-Anthony Jun 25, 2022
527d1c8
refactor debug helper and clean up
Quentin-Anthony Jun 25, 2022
40482a8
Refactor a bit and clean up prints
Quentin-Anthony Jun 25, 2022
a6beecf
Merge branch 'master' into staging-comms-logging-v1
Quentin-Anthony Jun 25, 2022
9343f87
config docs, remove old log_summary func, fix imports
Quentin-Anthony Jun 25, 2022
c07bc13
Finished docs, added import, fixed non-debug calls
Quentin-Anthony Jun 25, 2022
f5fd1f2
Ran pre-commit
Quentin-Anthony Jun 25, 2022
1b31798
Removed old comments
Quentin-Anthony Jun 25, 2022
298349d
Updated fn signatures for torch1.2
Quentin-Anthony Jun 27, 2022
102ae1d
Remove lingering prof arg
Quentin-Anthony Jun 27, 2022
2185f16
Merge branch 'master' into staging-comms-logging-v1
jeffra Jun 29, 2022
4faf3b9
Update logging tutorial
Quentin-Anthony Jun 29, 2022
6381187
Removed unnecessary imports and cleaned up comments
Quentin-Anthony Jun 30, 2022
56dbd71
Take master's cleaner comms init logic
Quentin-Anthony Jun 30, 2022
ae524f0
Fixed bw calculations and made all logging calls blocking
Quentin-Anthony Jul 20, 2022
19bcf79
Added comms logging synch disclaimer
Quentin-Anthony Jul 20, 2022
b9cb4d3
Merge branch 'master' into staging-comms-logging-v1
Quentin-Anthony Jul 21, 2022
c6925a1
Added using_mpi flag for logging
Quentin-Anthony Jul 22, 2022
5a0715c
Formatting
Quentin-Anthony Jul 22, 2022
b4449a2
Merge branch 'master' of https://github.com/microsoft/DeepSpeed into …
Quentin-Anthony Jul 22, 2022
b648979
Merge branch 'master' into staging-comms-logging-v1
Quentin-Anthony Jul 22, 2022
9357a16
Merge branch 'master' into staging-comms-logging-v1
Quentin-Anthony Jul 25, 2022
c85e323
Merge branch 'master' into staging-comms-logging-v1
Quentin-Anthony Jul 25, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
config docs, remove old log_summary func, fix imports
  • Loading branch information
Quentin-Anthony committed Jun 25, 2022
commit 9343f8789b413a56519beeaaf3c03d1c6e96f92c
25 changes: 1 addition & 24 deletions deepspeed/comm/comm.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,6 @@ def log_wrapper(*args, **kwargs):
msg_size = get_msg_size_from_args(func, *args, **kwargs)
log_name = get_debug_log_name(func_args, comms_logger.debug)
timers(log_name).start()
#timers(func_args['log_name']).start()
# Return the op, then stop the op's timer
try:
return func(*args, **kwargs)
Expand All @@ -137,10 +136,6 @@ def log_wrapper(*args, **kwargs):
'log_name' in kwargs
and kwargs['log_name'] in comms_logger.prof_ops):
log_name = get_debug_log_name(func_args, comms_logger.debug)
#timers(func_args['log_name']).stop()
# need temp var since 'elapsed' resets events
#time_elapsed = timers(func_args['log_name']).elapsed(reset=False)
#comms_logger.append(func_args['log_name'], time_elapsed, msg_size)
timers(log_name).stop()
# need temp var since 'elapsed' resets events
time_elapsed = timers(log_name).elapsed(reset=False)
Expand All @@ -149,24 +144,6 @@ def log_wrapper(*args, **kwargs):
return log_wrapper


def log_summary(coll_names, ranks=None):
global cdb
if coll_names == ['all']:
coll_names = timers.get_timers()
timers.log(names=coll_names, reset=False)
# Populate records for averaging and remove empty ones
#for name in coll_names:
# print(timers(name).elapsed(reset=False))
# Calculate average dict
coll_means = timers.get_mean(coll_names, reset=False)
# Print averages
for coll, mean in coll_means.items():
string = f"rank={cdb.get_rank()} avg time (ms)" + " | {}: {:.2f}".format(
coll,
mean / 1000.0)
log_dist(string, ranks=ranks or [0])


# For compatibility with torch distributed's init_process_group, we shall retain the signature from PyTorch code.
# DeepSpeed NCCL/MPI backend may not need all these params as we will have our own implementation.
# Please read full torch.distributed API docs from https://pytorch.org/docs/stable/distributed.html
Expand Down Expand Up @@ -481,7 +458,7 @@ def barrier(group=None, prof=False, log_name='barrier', debug=get_caller_func())
return cdb.barrier()


def log_summary_new():
def log_summary():
global cdb
barrier(log_name='log_summary_barrier')
if cdb.get_rank() == 0:
Expand Down
2 changes: 1 addition & 1 deletion deepspeed/comm/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@

# comms logger profile all ops signal
COMMS_LOGGER_PROF_ALL = "prof_all"
COMMS_LOGGER_PROF_ALL_DEFAULT = False
COMMS_LOGGER_PROF_ALL_DEFAULT = True

# comms logger show all ops signal
COMMS_LOGGER_DEBUG = "debug"
Expand Down
2 changes: 0 additions & 2 deletions deepspeed/runtime/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -264,12 +264,10 @@ def __init__(

self._set_distributed_vars(args)


dist.configure(self._config)

self.monitor = MonitorMaster(self._config.monitor_config)


see_memory_usage(
f"DeepSpeed Engine: Before configure distributed model",
force=self.memory_breakdown(),
Expand Down
6 changes: 2 additions & 4 deletions deepspeed/utils/logging.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@
import os
import math
Quentin-Anthony marked this conversation as resolved.
Show resolved Hide resolved

from deepspeed import comm as dist

log_levels = {
"debug": logging.DEBUG,
"info": logging.INFO,
Expand Down Expand Up @@ -48,7 +46,7 @@ def create_logger(name=None, level=logging.INFO):


def log_dist(message, ranks=None, level=logging.INFO):
import deepspeed.comm as dist
from deepspeed import comm as dist
"""Log message when one of following condition meets

+ not dist.is_initialized()
Expand All @@ -72,7 +70,7 @@ def log_dist(message, ranks=None, level=logging.INFO):


def print_json_dist(message, ranks=None, path=None):
import deepspeed.comm as dist
from deepspeed import comm as dist
"""Print message when one of following condition meets

+ not dist.is_initialized()
Expand Down
1 change: 1 addition & 0 deletions docs/_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ collections:
- mixture-of-experts-nlg.md
- mixture-of-experts-inference.md
- monitor.md
- comms-logging.md
- one-cycle.md
- onebit-adam.md
- zero-one-adam.md
Expand Down
4 changes: 4 additions & 0 deletions docs/_data/navigation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,8 @@ lnav:
url: /docs/config-json/#sparse-attention
- title: 'Monitoring'
url: /docs/config-json/#monitoring-module-tensorboard-wandb-csv
- title: 'Communication Logging'
url: /docs/config-json/#communication-logging
- title: 'Tutorials'
url: /tutorials/
children:
Expand Down Expand Up @@ -102,6 +104,8 @@ lnav:
url: /tutorials/MoQ-tutorial/
- title: 'Monitoring'
url: /tutorials/monitor
- title: 'Communication Logging'
url: /tutorials/comms-logging
- title: 'One-Cycle Schedule'
url: /tutorials/one-cycle/
- title: 'One-Bit Adam'
Expand Down
29 changes: 29 additions & 0 deletions docs/_pages/config-json.md
Original file line number Diff line number Diff line change
Expand Up @@ -1044,3 +1044,32 @@ Example of <i>**csv_monitor**</i> configuration:
"job_name": "train_bert"
}
```

### Communication Logging


DeepSpeed provides a flexible communication logging tool which can automatically detect and record communication operations launched via `deepspeed.comm`. Once the logs are populated, they can be summarized with `deepspeed.comm.log_summary()`. For more detail and example usage, see the [tutorial](/tutorials/comms-logging/))


<i>**comms_logger**</i>: [dictionary]

| Fields | Value |Default |
| ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----- |
| enabled | Whether communication logging is enabled. | `false` |
| verbose | Whether to immediately print every communication operation | `false` |
| prof_all | Whether to profile all operations. | `true` |
| debug | Appends the caller function to each communication operation's `log_name`. | `false` |
| prof_ops | A list of communication operations to log (only the specified ops will be profiled). | `[]` |


Example of <i>**comms_logger**</i> configuration:

```json
"comms_logger": {
"enabled": true,
"verbose": false,
"prof_all": true,
"debug": false,
"prof_ops": ["all_reduce", "custom_all_reduce_name"]
}
```