Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeed Communication Profiling and Logging #2012

Merged
merged 54 commits into from
Jul 25, 2022
Merged

Conversation

Quentin-Anthony
Copy link
Contributor

@Quentin-Anthony Quentin-Anthony commented Jun 13, 2022

This PR implements logging for all DeepSpeed communication calls

This PR introduces the DeepSpeed Communication Logger

After this PR, all communication calls from #1985 are automatically detected and logged (depending on config options). A final summary is then printed. For example:

Comm. Op            Message Size        Count               Total Latency(ms)   Avg Latency(ms)     tput_avg (Gbps)     busbw_avg (Gbps)    
broadcast
                    0B                  2                   0.19                0.10                0.00                0.00                
                    2.0 KB              146                 11.12               0.08                0.43                0.41                
                    6.0 KB              24                  1.84                0.08                1.29                1.21                
                    6.31 KB             1                   0.12                0.12                0.87                0.82                
                    8.0 KB              24                  1.85                0.08                1.73                1.62                
                    2.0 MB              24                  2.05                0.08                397.06              372.24              
                    4.0 MB              1                   0.15                0.15                434.01              406.89              
                    6.0 MB              24                  2.78                0.12                872.00              817.50              
                    8.0 MB              48                  6.36                0.13                1020.36             956.59              
                    98.25 MB            1                   8317.12             8317.12             0.20                0.19                
barrier
                    0B                  3                   237.96              79.32               0.00                0.00                
all_gather
                    128.0 B             146                 70.81               0.38                0.01                0.01                
                    384.0 B             24                  11.45               0.37                0.02                0.02                
                    512.0 B             24                  10.72               0.38                0.02                0.02                
                    128.0 KB            24                  12.44               0.52                4.20                3.94                
                    256.0 KB            1                   0.45                0.45                9.29                8.71                
                    384.0 KB            24                  18.01               0.57                11.50               10.78               
                    512.0 KB            48                  43.42               0.65                13.22               12.40               
                    6.14 MB             2                   17.76               8.88                40.96               38.40               
all_gather_base
                    128.0 B             1460                87.36               0.06                0.03                0.03                
                    256.0 B             147                 3.18                0.02                0.67                0.63                
                    384.0 B             240                 14.28               0.06                0.10                0.10                
                    512.0 B             240                 14.43               0.06                0.14                0.13                
                    128.0 KB            48                  3.38                0.07                29.79               27.93               
                    128.12 KB           24                  0.24                0.00                471.78              442.30              
                    256.0 KB            2                   0.18                0.09                45.52               42.67               
                    384.38 KB           72                  3.69                0.05                353.74              331.63              
                    512.0 KB            96                  5.22                0.06                412.72              386.92              
                    512.12 KB           24                  1.16                0.05                275.55              258.33              
                    512.5 KB            24                  1.70                0.07                120.35              112.83              
                    6.0 MB              219                 16.63               0.07                1401.24             1313.66             
                    6.0 MB              6                   0.51                0.08                1234.17             1157.03             
                    6.14 MB             11                  1.00                0.08                1308.19             1226.43             
                    6.25 MB             9                   0.31                0.03                15263.03            14309.09            
reduce_scatter_base
                    678.86 MB           40                  602.29              9.69                1468.06             1376.31             
all_reduce
                    1.0 B               20                  5572.57             6.37                0.00                0.00                
                    8.0 B               40                  100.00              0.58                0.00                0.00                
log_summary_barrier
                    0B                  1                   0.11                0.11                0.00                0.00     

This PR contributes the following features:

  • Automatic detection and logging of all comms calls with custom log names
  • Final comms summary (manual prints with log_summary method)
  • Config support
  • verbosity levels (for automatic grouping within DeepSpeed, e.g. all_gather_zero3)
  • An associated tutorial/documentation

Co-authored-by: Quentin Anthony qganthony@yahoo.com
Co-authored-by: Ammar Ahmad Awan ammar.awan@microsoft.com
Co-authored-by: Jeff Rasley jerasley@microsoft.com

Copy link
Contributor Author

@Quentin-Anthony Quentin-Anthony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been reviewed by Ammar

@Quentin-Anthony Quentin-Anthony changed the title DeepSpeed Communication Logging DeepSpeed Communication Profiling and Logging Jun 30, 2022
deepspeed/comm/comm.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@jeffra jeffra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Quentin-Anthony
Copy link
Contributor Author

@jeffra and @awan-10 -- All comments have been resolved and we're ready to merge from my side.

@jeffra jeffra merged commit 5349347 into master Jul 25, 2022
@jeffra jeffra deleted the staging-comms-logging-v1 branch July 25, 2022 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants