-
Notifications
You must be signed in to change notification settings - Fork 170
feature(tj): addd monitoring for the gradient conflict metric of MoE in ScaleZero #416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev-multitask-balance-clean
Are you sure you want to change the base?
feature(tj): addd monitoring for the gradient conflict metric of MoE in ScaleZero #416
Conversation
from tensorboardX import SummaryWriter | ||
|
||
from lzero.entry.utils import log_buffer_memory_usage, TemperatureScheduler | ||
# 添加性能监控相关导入 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改一下注释为英文,统一用类似DI-engine/ding/model/common/encoder.py at main · opendilab/DI-engine这样的规范格式
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
你只改相对dev-multitask-balance-clean分支的moe grad改动这块哈,dev-multitask-balance-clean分支内部的注释规范我去修改哈
@@ -0,0 +1,2207 @@ | |||
diff --git a/lzero/entry/train_unizero_multitask_segment_ddp.py b/lzero/entry/train_unizero_multitask_segment_ddp.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删除不用的文件
|
||
import sys | ||
import os | ||
PROJECT_ROOT = os.path.abspath("/fs-computility/niuyazhe/tangjia/github/LightZero") # 或者直接写死路径 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个是必要的吗 如果是写成相对路径,不要带有个人信息
# ------------------------------------------------------------ | ||
# MOE专家选择统计相关函数 | ||
# ------------------------------------------------------------ | ||
def merge_expert_stats_across_ranks(all_expert_stats): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
通用工具函数放在utils中去
_GLOBAL_HEATMAP_FIG.set_size_inches(figsize) | ||
return _GLOBAL_HEATMAP_FIG, _GLOBAL_HEATMAP_AX | ||
|
||
def create_heatmap_with_values_fast(matrix, task_ids, title="Task-Expert Selection Frequencies"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
通用工具函数放在utils中去 这里也是
'wasserstein_stats': wasserstein_stats | ||
} | ||
|
||
def create_similarity_heatmap_no_diagonal(similarity_matrix, task_ids, metric_name, title_suffix=""): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
工具函数已经放在utils里面了,就把lzero/entry/train_unizero_multitask_segment_ddp.py里面没用的删除了哈
betas=(0.9, 0.95), | ||
) | ||
|
||
# self.a=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删除
# "max_moe_layer_grad_conflict", | ||
] | ||
|
||
# # # If the model uses MoE, add expert gradient conflict variables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删除不用的注释
import os | ||
PROJECT_ROOT = os.path.abspath("/fs-computility/niuyazhe/tangjia/github/LightZero") # 或者直接写死路径 | ||
sys.path.insert(0, PROJECT_ROOT) | ||
# /fs-computility/niuyazhe/tangjia/github/LightZero/zoo/atari/config/atari_unizero_multitask_segment_ddp_config_debug.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删除debug config
|
||
# 手动销毁进程组 | ||
# 手动销毁进程组 /fs-computility/niuyazhe/tangjia/github/LightZero/zoo/atari/config/atari_unizero_multitask_segment_ddp_config_debug.py | ||
if dist.is_initialized(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config中只保留我们scalezero论文里面用到分析的吧
No description provided.