-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resnet npu #412
base: main
Are you sure you want to change the base?
Resnet npu #412
Changes from all commits
779f4ea
8571548
caa6d4b
ac2ecb7
b1908dd
de581e1
affb3df
8ba3cd8
82620cc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# set -aux | ||
|
||
export PYTHONUNBUFFERED=1 | ||
echo PYTHONUNBUFFERED=$PYTHONUNBUFFERED | ||
|
||
CHECKPOINT_SAVE_PATH="./graph_checkpoints" | ||
if [ ! -d "$CHECKPOINT_SAVE_PATH" ]; then | ||
mkdir $CHECKPOINT_SAVE_PATH | ||
fi | ||
|
||
#OFRECORD_PATH="./mini-imagenet/ofrecord" | ||
OFRECORD_PATH="/data0/datasets/ImageNet/ofrecord/" | ||
|
||
if [ ! -d "$OFRECORD_PATH" ]; then | ||
wget https://oneflow-public.oss-cn-beijing.aliyuncs.com/online_document/dataset/imagenet/mini-imagenet.zip | ||
unzip mini-imagenet.zip | ||
fi | ||
|
||
OFRECORD_PART_NUM=1 | ||
LEARNING_RATE=0.256 | ||
MOM=0.875 | ||
EPOCH=90 | ||
TRAIN_BATCH_SIZE=50 | ||
VAL_BATCH_SIZE=50 | ||
|
||
# SRC_DIR=/path/to/models/resnet50 | ||
SRC_DIR=$(realpath $(dirname $0)/..) | ||
|
||
python3 $SRC_DIR/train.py \ | ||
--ofrecord-path $OFRECORD_PATH \ | ||
--ofrecord-part-num $OFRECORD_PART_NUM \ | ||
--num-devices-per-node 1 \ | ||
--lr $LEARNING_RATE \ | ||
--momentum $MOM \ | ||
--num-epochs $EPOCH \ | ||
--warmup-epochs 5 \ | ||
--train-batch-size $TRAIN_BATCH_SIZE \ | ||
--val-batch-size $VAL_BATCH_SIZE \ | ||
--save $CHECKPOINT_SAVE_PATH \ | ||
--scale-grad \ | ||
--print-interval 1 \ | ||
--load checkpoints/init \ | ||
--device npu | ||
#--use-gpu-decode \ | ||
#--samples-per-epoch 50 \ | ||
#--val-samples-per-epoch 50 \ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# set -aux | ||
|
||
export PYTHONUNBUFFERED=1 | ||
echo PYTHONUNBUFFERED=$PYTHONUNBUFFERED | ||
|
||
CHECKPOINT_SAVE_PATH="./graph_checkpoints" | ||
if [ ! -d "$CHECKPOINT_SAVE_PATH" ]; then | ||
mkdir $CHECKPOINT_SAVE_PATH | ||
fi | ||
|
||
#OFRECORD_PATH="./mini-imagenet/ofrecord" | ||
OFRECORD_PATH="/data0/datasets/ImageNet/ofrecord/" | ||
|
||
if [ ! -d "$OFRECORD_PATH" ]; then | ||
wget https://oneflow-public.oss-cn-beijing.aliyuncs.com/online_document/dataset/imagenet/mini-imagenet.zip | ||
unzip mini-imagenet.zip | ||
fi | ||
|
||
OFRECORD_PART_NUM=1 | ||
LEARNING_RATE=0.256 | ||
MOM=0.875 | ||
EPOCH=90 | ||
TRAIN_BATCH_SIZE=50 | ||
VAL_BATCH_SIZE=50 | ||
|
||
# SRC_DIR=/path/to/models/resnet50 | ||
SRC_DIR=$(realpath $(dirname $0)/..) | ||
|
||
python3 $SRC_DIR/train.py \ | ||
--ofrecord-path $OFRECORD_PATH \ | ||
--ofrecord-part-num $OFRECORD_PART_NUM \ | ||
--num-devices-per-node 1 \ | ||
--lr $LEARNING_RATE \ | ||
--momentum $MOM \ | ||
--num-epochs $EPOCH \ | ||
--warmup-epochs 5 \ | ||
--train-batch-size $TRAIN_BATCH_SIZE \ | ||
--val-batch-size $VAL_BATCH_SIZE \ | ||
--save $CHECKPOINT_SAVE_PATH \ | ||
--scale-grad \ | ||
--print-interval 1 \ | ||
--load checkpoints/init \ | ||
--graph \ | ||
--device npu | ||
#--use-gpu-decode \ | ||
#--samples-per-epoch 50 \ | ||
#--val-samples-per-epoch 50 \ |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -83,5 +83,30 @@ def forward(self, input, label): | |
# log_prob = input.softmax(dim=-1).log() | ||
# onehot_label = flow.F.cast(onehot_label, log_prob.dtype) | ||
# loss = flow.mul(log_prob * -1, onehot_label).sum(dim=-1).mean() | ||
loss = flow._C.softmax_cross_entropy(input, onehot_label.to(dtype=input.dtype)) | ||
#loss = flow._C.softmax_cross_entropy(input, onehot_label.to(dtype=input.dtype)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 目前npu不支持 |
||
loss = flow._C.cross_entropy(input, onehot_label.to(dtype=input.dtype), reduction='none') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 待验证训练是否收敛。 |
||
return loss.mean() | ||
|
||
class oldLabelSmoothLoss(flow.nn.Module): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这是flowvision里面的loss。 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. softmax_cross_entropy和dim_gather应该不难开发,我们可以列到npu开发计划里,后面等开发完成再试试 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 看了下npu的dim_gather已经支持了: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. softmax_cross_entropy 我开发了一个,可能反向还有问题。不过softmax_cross_entropy没有和torch对应,我倾向于开发和torch兼容的算子,所以选了flowvision的方案,不用softmax_cross_entropy。 我回头试试 dim_gather |
||
"""NLL Loss with label smoothing | ||
""" | ||
|
||
#def __init__(self, smoothing=0.1): | ||
#super(LabelSmoothingCrossEntropy, self).__init__() | ||
def __init__(self, num_classes=-1, smooth_rate=0.0): | ||
super().__init__() | ||
assert smooth_rate < 1.0 | ||
self.smoothing = smooth_rate | ||
self.confidence = 1.0 - smooth_rate | ||
|
||
def forward(self, x: flow.Tensor, target: flow.Tensor) -> flow.Tensor: | ||
# TODO: register F.log_softmax() function and switch flow.log(flow.softmax()) to F.log_softmax() | ||
logprobs = flow.log_softmax(x, dim=-1) | ||
# TODO: fix gather bug when dim < 0 | ||
# FIXME: only support cls task now | ||
nll_loss = -logprobs.gather(dim=1, index=target.unsqueeze(1)) | ||
nll_loss = nll_loss.squeeze(1) | ||
smooth_loss = -logprobs.mean(dim=-1) | ||
loss = self.confidence * nll_loss + self.smoothing * smooth_loss | ||
return loss.mean() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
暂时用cpu解码