[Gemini] fix the convert_to_torch_module bug #2269

feifeibear · 2023-01-03T06:39:43Z

What's new

Previously, convert_to_torch_module dose not work. Because it can not find the correct place to store param.data. Now, I fix this bug.

You can test the feature with the following code.

torchrun --standalone --nproc_per_node=1 XXX.py

import colossalai
from colossalai.utils import get_current_device
from colossalai.nn.parallel import GeminiDDP
from transformers import BloomForCausalLM, AutoTokenizer
from colossalai.utils.model.colo_init_context import ColoInitContext, post_process_colo_init_ctx
from colossalai.tensor import ColoTensor, ProcessGroup
from colossalai.nn.parallel.utils import convert_to_torch_module


config = {
        "BATCH": 4,
        "gradient_accumulation_steps": 1,
        "clip_grad_norm": 1,
    }

colossalai.launch_from_torch(config=config)

pg = ProcessGroup()

with ColoInitContext():
    model = BloomForCausalLM.from_pretrained("/data2/users/lczht/bloom-560m")

param_num = len([p for n, p in model.named_parameters()])
print(f'param num {param_num}')
# for p in model.parameters():
#    print(p)

# after wrap with GeminiDDP model parameter is broken.
model = GeminiDDP(model,
                device=get_current_device(),
                placement_policy="cuda",
                pin_memory=True,
                search_range_mb=32)

# for n, p in model.torch_named_parameters():
#     print(n, p)

# for k, v in modelv2.state_dict().items():
#     print(k, v)

model = convert_to_torch_module(model)

cnt = 0
for name, param in model.named_parameters(recurse = True):
    if isinstance(param, ColoTensor):
        print('meet ColoTensor', name)
    cnt += 1
print(f'{cnt} params found')

colossalai/nn/parallel/utils.py

feifeibear added 2 commits January 3, 2023 13:07

[NFC] fix typos

b2fa016

[Gemini] fix the convert_to_torch_module bug

18c13cb

feifeibear added the Run Build and Test label Jan 3, 2023

1SAA reviewed Jan 3, 2023

View reviewed changes

colossalai/nn/parallel/utils.py Outdated Show resolved Hide resolved

polish code

bf796a2

1SAA approved these changes Jan 3, 2023

View reviewed changes

feifeibear merged commit af32022 into hpcaitech:main Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Gemini] fix the convert_to_torch_module bug #2269

[Gemini] fix the convert_to_torch_module bug #2269

Uh oh!

feifeibear commented Jan 3, 2023

Uh oh!

Uh oh!

Uh oh!

[Gemini] fix the convert_to_torch_module bug #2269

[Gemini] fix the convert_to_torch_module bug #2269

Uh oh!

Conversation

feifeibear commented Jan 3, 2023

What's new

Uh oh!

Uh oh!

Uh oh!