You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
data:
instruct_data: "/home/ec2-user/data/train_file.jsonl" # Fill
data: "" # Optionally fill with pretraining data
eval_instruct_data: "/home/ec2-user/data/test_file.jsonl" # Optionally fill
model_id_or_path: "/home/ec2-user/mistral_models/" # Change to downloaded path
lora:
rank: 32
save_adapters: True # save only trained LoRA adapters. Set to False to merge LoRA adapter into the base model and save full fine-tuned model
run_dir: "/home/ec2-user/outputs" # Fill
3. i use this command to initiate fine tune process:
torchrun --nproc-per-node 1 --master_port $RANDOM -m train example/7B.yaml
4. after fine tune is finished, i download mistral-inference, and run this:
mistral-chat /home/ec2-user/mistral_models/ --max_tokens 256 --temparature 1.0 --instruct --lora_path /home/ec2-user/outputs/checkpoints/checkpoint_000300/consolidated/lora.safetensors
This is the error i get:
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/bin/mistral-chat", line 8, in <module>
sys.exit(mistral_chat())
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/mistral_inference/main.py", line 207, in mistral_chat
fire.Fire(interactive)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/mistral_inference/main.py", line 91, in interactive
model.load_lora(Path(lora_path))
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/mistral_inference/lora.py", line 101, in load_lora
self._load_lora_state_dict(state_dict, scaling=scaling)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/mistral_inference/lora.py", line 135, in _load_lora_state_dict
+ (lora_state_dict[name + ".lora_B.weight"] @ lora_state_dict[name + ".lora_A.weight"])
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 21.99 GiB of which 99.38 MiB is free. Including non-PyTorch memory, this process has 21.88 GiB memory in use. Of the allocated memory 21.57 GiB is allocated by PyTorch, and 13.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
### Expected Behavior
i expect to interact with the fine tune model
### Additional Context
_No response_
### Suggested Solutions
_No response_
The text was updated successfully, but these errors were encountered:
kiranshivaraju
changed the title
[BUG: after training the mode, i am not able to merge the model and run inference on it
[BUG: after training the model, i am not able to merge the model and run inference on it
Aug 7, 2024
Python Version
Pip Freeze
data:
instruct_data: "/home/ec2-user/data/train_file.jsonl" # Fill
data: "" # Optionally fill with pretraining data
eval_instruct_data: "/home/ec2-user/data/test_file.jsonl" # Optionally fill
model_id_or_path: "/home/ec2-user/mistral_models/" # Change to downloaded path
lora:
rank: 32
seq_len: 2048
batch_size: 1
max_steps: 300
optim:
lr: 6.e-5
weight_decay: 0.1
pct_start: 0.05
seed: 0
log_freq: 1
eval_freq: 100
no_eval: False
ckpt_freq: 100
save_adapters: True # save only trained LoRA adapters. Set to
False
to merge LoRA adapter into the base model and save full fine-tuned modelrun_dir: "/home/ec2-user/outputs" # Fill
The text was updated successfully, but these errors were encountered: