-
Notifications
You must be signed in to change notification settings - Fork 33
llama3 rope #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama3 rope #55
Conversation
Hi @RaymondLi0! Functionally this looks like what we want (pending model conversion), but are you confident (i.e. have you checked) that the forward and backward passes of hf-llama and fast-llm-llama are the same? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't review for correctness, but looks ok. I'll leave the approval to @tscholak
@@ -127,6 +127,11 @@ class TransformerArchitectureConfig(BaseModelArchitectureConfig): | |||
desc="Scale for the rotary positional embeddings. Default: -math.log(10000) = -9.210", | |||
hint=FieldHint.feature, | |||
) | |||
rotary_scaling_type: RotaryScalingType = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about rotary_embedding_type
so we make it more general?
Or better, merge with use_rotary_embeddings
into rotary_embeddings: RotaryEmbeddingType | None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would make sense.
Although that would make this a breaking change right? Would that be OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tscholak @jlamypoirier I could merge into rotary_embeddings: RotaryEmbeddingType | None
, and keep use_rotary_embeddings
as a deprecated argument to maintain backwards compatibility. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, in addition to breaking existing checkpoints it will make conversion more complex than needed. Let's postpone to another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! We only want to unblock model conversion at this point. Thanks!
Haven't done that check. Do we have existing tests comparing forward/backward of fast-llm and hf-transformers? If no I can look into adding this |
There is one in test_checkpoint https://github.com/ServiceNow/Fast-LLM/blob/main/tests/test_checkpoint.py#L31. It could work for this one if we added a llama3 model to the testing suite (in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @RaymondLi0!
@RaymondLi0 @jlamypoirier merge? |
Looks ok, I'll do some adjustments and then merge. |
Main changes I did:
Still need to run tests, @RaymondLi0 @tscholak feel free to have a look and comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got the tests to pass for both default and llama3 models. Not sure if that's enough? (Could try loading an old checkpoint to be sure)
I'll continue the training of Llama 3.1 8B with this PR as a further test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs revision, because it doesn't work correctly.
I am currently cpt-ing Llama 3.1 8B on LambdaLabs, and I see this in config.yaml
in the output folder:
model:
base_model:
cross_entropy_impl: fused
init_method_std_embed: 0.015625
tie_word_embeddings: false
transformer:
activation_type: silu
add_linear_biases: false
ffn_hidden_size: 14336
gated: true
head_groups: 8
hidden_size: 4096
init_method_std: 0.015625
init_method_std_attn_proj: 0.001953125
init_method_std_mlp_1: 0.015625
init_method_std_mlp_2: 0.001953125
init_method_std_qkv: 0.015625
kv_channels: 128
mlp_lr_scale:
- null
normalization:
type: rms_norm
num_attention_heads: 32
num_layers: 32
rotary:
theta: 500000.0
type: default
use_position_embeddings: false
vocab_size: 128256
Instead of type: default
in the model.base_model.transformer.rotary
, I expected to see llama3
.
Please fix @RaymondLi0 @jlamypoirier
Hmm this used to work, let me have a look |
@jlamypoirier The logic here
converter.export_name can be a tuple[str, ...] .That's why I had moved it to get_nested_dict_value in fc9ef13The result is that currently, value is set to None for when importing a nested dict value like rope_scaling.rope_type
Was there a specific reason to remove that logic from |
I fixed the config comparison tool that should have caught this error in the tests. But another error appeared in the tests, having a look |
There remains 2 points:
|
Thanks @RaymondLi0
|
I removed the |
tests/test_checkpoint.py
Outdated
@@ -241,7 +241,7 @@ def _compare_configs(config_ref, config_test): | |||
@pytest.mark.depends(on=["test_converted_distributed"]) | |||
def test_load_pretrained_distributed_checkpoint(): | |||
config = TEST_ARCHITECTURE_CONFIG_CLS.from_dict( | |||
yaml.safe_load((_CKPT_PATH / ".." / ".." / "config.yaml").open("r")), strict=False | |||
yaml.safe_load((_CKPT_PATH / ".." / ".." / "config.yaml").open("r"))["model"]["base_model"], strict=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this one. Wasn't it working before? I'd have caught it in tests...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config comparison was broken I think (fixed in fa6440e), which is why this one wasn't caught
Bugs should be fixed, I but I still haven't checked beyond the tests. One of the tests fail (test_load_distributed_checkpoint_dp2) but it's very minor and unrelated to this PR, looks like something goes wrong with the default init_method_std when loading models. |
Thanks @RaymondLi0 and @jlamypoirier for making progress on this. Loading the Llama 3.1 model with the correct rope scaling type works fine now: model:
base_model:
transformer:
rotary:
type: llama3
theta: 500000.0 However, exporting the model to Llama 3.1 format still fails:
|
@tscholak I am not able to reproduce this issue, could you send more details? Which config did you run? |
Hi @RaymondLi0, I ran the Big QuickStart guide on LambdaLabs, so: kubectl apply -f - <<EOF
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
name: "fast-llm-train"
spec:
nprocPerNode: "8"
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: Never
template:
spec:
tolerations:
- key: nvidia.com/gpu
value: "true"
operator: Equal
effect: NoSchedule
containers:
- name: pytorch
image: ghcr.io/servicenow/fast-llm:sha-41fa4bd
resources:
limits:
nvidia.com/gpu: 8
rdma/rdma_shared_device_a: 1
memory: "1024Gi"
cpu:
requests:
nvidia.com/gpu: 8
rdma/rdma_shared_device_a: 1
memory: "1024Gi"
cpu: 128
command:
- /bin/bash
- -c
- |
torchrun --rdzv_backend=static \
--rdzv_endpoint=\${MASTER_ADDR}:\${MASTER_PORT} \
--node_rank=\${RANK} \
--nproc_per_node=\${PET_NPROC_PER_NODE} \
--nnodes=\${PET_NNODES} \
--max_restarts=0 \
--rdzv_conf=timeout=3600 \
--no_python \
fast-llm train gpt \
--config fast-llm-tutorial/train-config.yaml
env:
- name: PYTHONHASHSEED
value: "0"
- name: WANDB_API_KEY_PATH
value: "/app/fast-llm-tutorial/.wandb_api_key"
- name: TORCH_NCCL_ASYNC_ERROR_HANDLING
value: "1"
- name: NCCL_DEBUG
value: "INFO"
securityContext:
capabilities:
add:
- IPC_LOCK
volumeMounts:
- mountPath: /app/fast-llm-tutorial
name: fast-llm-inputs
- mountPath: /dev/shm
name: dshm
volumes:
- name: fast-llm-inputs
persistentVolumeClaim:
claimName: pvc-fast-llm-tutorial
- name: dshm
emptyDir:
medium: Memory
sizeLimit: "1024Gi"
Worker:
replicas: 3
restartPolicy: Never
template:
spec:
tolerations:
- key: nvidia.com/gpu
value: "true"
operator: Equal
effect: NoSchedule
containers:
- name: pytorch
image: ghcr.io/servicenow/fast-llm:sha-41fa4bd
resources:
limits:
nvidia.com/gpu: 8
rdma/rdma_shared_device_a: 1
memory: "1024Gi"
cpu:
requests:
nvidia.com/gpu: 8
rdma/rdma_shared_device_a: 1
memory: "1024Gi"
cpu: 128
command:
- /bin/bash
- -c
- |
torchrun --rdzv_backend=static \
--rdzv_endpoint=\${MASTER_ADDR}:\${MASTER_PORT} \
--node_rank=\${RANK} \
--nproc_per_node=\${PET_NPROC_PER_NODE} \
--nnodes=\${PET_NNODES} \
--max_restarts=0 \
--rdzv_conf=timeout=3600 \
--no_python \
fast-llm train gpt \
--config fast-llm-tutorial/train-config.yaml
env:
- name: PYTHONHASHSEED
value: "0"
- name: WANDB_API_KEY_PATH
value: "/app/fast-llm-tutorial/.wandb_api_key"
- name: TORCH_NCCL_ASYNC_ERROR_HANDLING
value: "1"
- name: NCCL_DEBUG
value: "INFO"
securityContext:
capabilities:
add:
- IPC_LOCK
volumeMounts:
- mountPath: /app/fast-llm-tutorial
name: fast-llm-inputs
- mountPath: /dev/shm
name: dshm
volumes:
- name: fast-llm-inputs
persistentVolumeClaim:
claimName: pvc-fast-llm-tutorial
- name: dshm
emptyDir:
medium: Memory
sizeLimit: "1024Gi"
EOF with training:
train_iters: 100_000
logs:
interval: 10
validation:
iterations: 25
interval: 1000
checkpoint:
interval: 1000
keep: 5
test_iters: 0
export:
format: llama
interval: 1_000
wandb:
project_name: fast-llm-tutorial
group_name: Big
entity_name: null
batch:
micro_batch_size: 2
sequence_length: 4096
batch_size: 512
data:
format: file
path: fast-llm-tutorial/dataset/fast_llm_dataset.json
split: [99, 1, 0]
optimizer:
weight_decay: 0.1
beta_1: 0.9
beta_2: 0.95
learning_rate:
base: 6.0e-04
minimum: 6.0e-05
decay_style: cosine
decay_iterations: 100_000
warmup_iterations: 2000
pretrained:
format: llama
path: fast-llm-tutorial/pretrained-model
model_weights: yes
model:
base_model:
transformer:
use_flash_attention: yes
cross_entropy_impl: fused
multi_stage:
zero_stage: 2
distributed:
training_dtype: bf16
run:
experiment_dir: fast-llm-tutorial/experiment |
Could you check if you get the same error on toolkit? I ran a very similar config on yul201 that exported the model successfully:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed that checkpoint loading and saving works as designed for Llama-3.1-8B.
β¨ Description
Closes #39
π Type of change
Select all that apply:
π Changes
MODEL=llama3 pytest ./tests/test_checkpoint.py
)Testing
MODEL=llama3 pytest ./tests/test_checkpoint.py
passes.When "sabotaging" the conversion here
Fast-LLM/fast_llm/models/gpt/conversion.py
Lines 251 to 270 in fc9ef13
with
then
test_run_converted_model
fails, showing that model outputs differ when using an incorrect rotary config.