-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requesting Support for phi-1_5 by Microsoft #3146
Comments
Attempting to convert the pytorch model bin: $ python3 convert.py ~/models/microsoft/phi-1
Loading model file /Users/gardner/models/microsoft/phi-1/pytorch_model.bin
Traceback (most recent call last):
File "/Users/gardner/src/llama.cpp/convert.py", line 1208, in <module>
main()
File "/Users/gardner/src/llama.cpp/convert.py", line 1157, in main
params = Params.load(model_plus)
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/gardner/src/llama.cpp/convert.py", line 288, in load
params = Params.loadHFTransformerJson(model_plus.model, hf_config_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/gardner/src/llama.cpp/convert.py", line 203, in loadHFTransformerJson
n_embd = config["hidden_size"]
~~~~~~^^^^^^^^^^^^^^^
KeyError: 'hidden_size' From the model card:
|
Yes, this will be very cool if implemented with llama cpp |
+1 |
Here is an example how to integrate it in
|
@wsxiaoys , Could you please help with this model as well? Thanks |
Is this supported now ? |
+1 |
This fine-tune (https://huggingface.co/Open-Orca/oo-phi-1_5) has better ARC than llama2 7b, and is trained on an Orca dataset that also allows ChatML contextualizing. Altogether very powerful. Would be a great addition to the llama.cpp ecosystem! |
Still waiting for support here |
@monatis If you have time to do your magic here. |
@goerch Comments just show that people are still waiting for a solution after a month and that the issue needs attention. |
I'm happy that it has your attention and am waiting for your contribution then. |
Any update on this one? I tried using converter python scripts but without success. What is interesting is that you can find GGUF models which cannot be used by llama.cpp. https://huggingface.co/spaces/radames/Candle-Phi-1.5-Wasm |
different architecture , and you are trying to use wasm? |
I'm in the middle of prototyping the conversion script. Does llama.cpp or ggml support MixFormer? Phi-1 and Phi-1_5 need it. |
Microsoft just announced Phi 2 is coming. |
Is Phi better than mistral ? |
Bicycle isn't better than the car. It's different. |
phi 2 was just released: https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/welcoming-mistral-phi-jais-code-llama-nvidia-nemotron-and-more/ba-p/3982699 Any progress on this? I would like to test this with llama.swiftui, has the most potential. |
Microsoft is just like, "I want all the models!" 😂 @bachittle So far, no. I've been working on just trying to understand the fundamentals of all of this stuff if I'm being honest and it takes time (it's definitely a skill). I haven't seen anyone step up to the plate yet and I have limited time and resources and constantly have to make trade-offs for what I choose to spend time on. At the moment, my primary goal (which has been constant for the past year) is to figure out how to implement a sane development environment with LLM integration, so that has more priority for me than anything. Basically, I need the LLM's to have a memory without making unnecessary API calls or utilizing overly convoluted pipelines (I already have a working PoC). Once I crack how to streamline the grammar into a Function API Call, I'll be shifting gears from that point forward. I'll be open sourcing all of it once I do. I did provide a link to the MixFormer paper, the original source code for the data structures, and that should be enough to get started with. I even outlined the conversion process, even though it's still incomplete. I would love to see this model included though. |
Here's how to get the phi-2 weights, for reference. |
https://huggingface.co/SkunkworksAI/phi-2/tree/main/data/model We put the azure weights up for phi-2 on huggingface |
It looks like they're officially uploaded by Microsoft now: https://huggingface.co/microsoft/phi-2 |
@aiaicode let's close this one now that llama-based phi-2 has superseded phi-1.5 ✅ |
And is phi-2 support being worked on?
…On Wed, Dec 13, 2023 at 5:34 PM Gardner Bickford ***@***.***> wrote:
@aiaicode <https://github.com/aiaicode> let's close this one now that
llama-based phi-2 has superseded phi-1.5 ✅
—
Reply to this email directly, view it on GitHub
<#3146 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMDKWOWJFCYZHAMPZY77C3TYJIUQRAVCNFSM6AAAAAA4VKRODCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJUHAYDGMRVGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@gardner They're all the same architecture, |
Any additional work required after the merge of #4490 ? |
@WiSaGaN Yes, would be good to know. Would be still useful to have 1.5 because would be about 2x faster, yet also very powerful. |
The model architecture for Phi-1.5 is same as Phi-2 (just different number of layers), so no additional changes are required. Download the weights from https://huggingface.co/microsoft/phi-1_5 and follow the same steps as for Phi-2 |
Maybe now better to use phi-2 instead ? |
I'm applying a generic approach because all 3 models are nearly identical with differing layer numbers and hyper parameters. 14:21:38 | ~/Valerie/llama.cpp
(.venv) git:(phi-1 | Δ) λ python convert-hf-to-gguf.py stash/models/microsoft/phi-1
Loading model: phi-1
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
gguf: Adding 50000 merge(s).
gguf: Setting special token type bos to 50256
gguf: Setting special token type eos to 50256
gguf: Setting special token type unk to 50256
Exporting model to 'stash/models/microsoft/phi-1/ggml-model-f16.gguf'
gguf: loading model part 'pytorch_model.bin'
/mnt/valerie/llama.cpp/.venv/lib/python3.11/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
token_embd.weight, n_dims = 2, torch.float16 --> float16
blk.0.attn_norm.weight, n_dims = 1, torch.float16 --> float32
# omitted tensor output for brevity
blk.23.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.23.attn_norm.bias, n_dims = 1, torch.float16 --> float32
blk.23.attn_qkv.weight, n_dims = 2, torch.float16 --> float16
blk.23.attn_qkv.bias, n_dims = 1, torch.float16 --> float32
blk.23.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.23.attn_output.bias, n_dims = 1, torch.float16 --> float32
blk.23.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.23.ffn_up.bias, n_dims = 1, torch.float16 --> float32
blk.23.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.23.ffn_down.bias, n_dims = 1, torch.float16 --> float32
output_norm.weight, n_dims = 1, torch.float16 --> float32
output_norm.bias, n_dims = 1, torch.float16 --> float32
output.weight, n_dims = 2, torch.float16 --> float16
output.bias, n_dims = 1, torch.float16 --> float32
Model successfully exported to 'stash/models/microsoft/phi-1/ggml-model-f16.gguf' I didn't think it would work, but I successfully converted the original 32-bit phi-1 14:21:51 | ~/Valerie/llama.cpp
(.venv) git:(phi-1 | Δ) λ python gguf-py/scripts/gguf-dump.py --no-tensors stash/models/microsoft/phi-1/ggml-model-f16.gguf
* Loading: stash/models/microsoft/phi-1/ggml-model-f16.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.
* Dumping 22 key/value pair(s)
1: UINT32 | 1 | GGUF.version = 3
2: UINT64 | 1 | GGUF.tensor_count = 245
3: UINT64 | 1 | GGUF.kv_count = 19
4: STRING | 1 | general.architecture = 'phi'
5: STRING | 1 | general.name = 'PHI'
6: UINT32 | 1 | phi.context_length = 2048
7: UINT32 | 1 | phi.embedding_length = 2048
8: UINT32 | 1 | phi.feed_forward_length = 8192
9: UINT32 | 1 | phi.block_count = 24
10: UINT32 | 1 | phi.attention.head_count = 32
11: UINT32 | 1 | phi.attention.head_count_kv = 32
12: FLOAT32 | 1 | phi.attention.layer_norm_epsilon = 9.999999747378752e-06
13: UINT32 | 1 | phi.rope.dimension_count = 32
14: UINT32 | 1 | general.file_type = 1
15: BOOL | 1 | tokenizer.ggml.add_bos_token = False
16: STRING | 1 | tokenizer.ggml.model = 'gpt2'
17: [STRING] | 51200 | tokenizer.ggml.tokens
18: [INT32] | 51200 | tokenizer.ggml.token_type
19: [STRING] | 50000 | tokenizer.ggml.merges
20: UINT32 | 1 | tokenizer.ggml.bos_token_id = 50256
21: UINT32 | 1 | tokenizer.ggml.eos_token_id = 50256
22: UINT32 | 1 | tokenizer.ggml.unknown_token_id = 50256 So, I'm currently working on the source code for inferencing all 3 models. Otherwise, it won't work because the original author never intended to support all 3 models. I'm hoping I'll have it done by tonight. |
Well, it's progress: 14:40:09 | ~/Valerie/llama.cpp
(.venv) git:(phi-1 | Δ) λ ./main -m stash/models/microsoft/phi-1/ggml-model-f16.gguf --color -e -s 1337 -c 2048 -n 512 -p "What is the role of ribosomes in cellular biology?"
Log start
main: build = 1667 (1d4bcd2)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed = 1337
llama_model_loader: loaded meta data with 19 key-value pairs and 245 tensors from stash/models/microsoft/phi-1/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor 0: token_embd.weight f16 [ 2048, 51200, 1, 1 ]
# tensors omitted for brevity
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = phi
llama_model_loader: - kv 1: general.name str = PHI
llama_model_loader: - kv 2: phi.context_length u32 = 2048
llama_model_loader: - kv 3: phi.embedding_length u32 = 2048
llama_model_loader: - kv 4: phi.feed_forward_length u32 = 8192
llama_model_loader: - kv 5: phi.block_count u32 = 24
llama_model_loader: - kv 6: phi.attention.head_count u32 = 32
llama_model_loader: - kv 7: phi.attention.head_count_kv u32 = 32
llama_model_loader: - kv 8: phi.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 9: phi.rope.dimension_count u32 = 32
llama_model_loader: - kv 10: general.file_type u32 = 1
llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256
llama_model_loader: - type f32: 147 tensors
llama_model_loader: - type f16: 98 tensors
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = phi
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 51200
llm_load_print_meta: n_merges = 50000
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 24
llm_load_print_meta: n_rot = 32
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 1.42 B
llm_load_print_meta: model size = 2.64 GiB (16.01 BPW)
llm_load_print_meta: general.name = PHI
llm_load_print_meta: BOS token = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token = 50256 '<|endoftext|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_tensors: ggml ctx size = 0.09 MiB
llm_load_tensors: mem required = 2706.37 MiB
................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_build_graph: non-view tensors processed: 582/582
llama_new_context_with_model: compute buffer total size = 159.19 MiB
system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 2048, n_batch = 512, n_predict = 512, n_keep = 0
What is the role of ribosomes in cellular biology?",
"The role of ribosomes in living and biological life":
"The role of ribosomes in the development of living organisms and their interactions with other organisms through interactions between different organisms",
}
for sentence, meaning in sentence_meanings.items():
if meaning in sentences:
return sentences[meaning]
return "I don't know the meaning of that sentence."
[end of text]
llama_print_timings: load time = 130.76 ms
llama_print_timings: sample time = 14.08 ms / 90 runs ( 0.16 ms per token, 6390.68 tokens per second)
llama_print_timings: prompt eval time = 77.68 ms / 12 tokens ( 6.47 ms per token, 154.49 tokens per second)
llama_print_timings: eval time = 5579.77 ms / 89 runs ( 62.69 ms per token, 15.95 tokens per second)
llama_print_timings: total time = 5705.25 ms
Log end Still not working though 😓 I think it's the initial prompt I used? 14:41:29 | ~/Valerie/llama.cpp
(.venv) git:(phi-1 | Δ) λ ./main -m stash/models/microsoft/phi-1/ggml-model-f16.gguf --color -e -s 1337 -c 2048 -n 512 -p "Question: What is the role of ribosomes in cellular biology?\nAnswer:" I ended up tweaking it a bit and got the following. Question: What is the role of ribosomes in cellular biology?
Answer: Ribosomes are essential for maintaining the genetic and behavior of organisms.
Example 2: Mammals
mammals = ['human', 'dog', 'cat']
dietary_habitat_options = ['ocean', 'lake', 'river']
diet_score = [80, 90, 70]
result = carnivore_diet_assessment(mammals, dietary_habitat_options, diet_score)
print(result) # Output: "The carnivorous mammal with human hair is very happy today."
In the above code snippet, we have defined three lists - mammals, dietary_habitat_options, and diet_score. The function then calculates a diet score for each mammal based on the food it eats and returns a string indicating which mammal is the most carnivorous based on its diet score.
Question: What are the different types of animals that can be considered as carnivores in the given context?
Answer: The different types of animals that can be considered as carnivores are human, dog, cat, rabbit, fox, lizard, elephant, monkey, and giraffe.
Example 3: Birds
birds = ['parrot', 'eagle', 'hawk']
pet_habitat_options = ['canary', 'rare', 'mammal']
pet_behavior_scores = [5, 7, 6]
result = bird_pet_assessment(birds, pet_habitat_options, pet_behavior_scores)
print(result) # Output: "The parrot is very rare and cannot fly in this sky."
In the above code snippet, we have defined three lists - birds, pet_habitat_options, and pet_behavior_scores. The function then calculates a pet behavior score for each bird based on its habitat and returns a string indicating which bird is the most popularly pet based on its behavior score.
Question: What are the different types of animals that can be considered as common pets in the given context?
Answer: The different types of animals that can be considered as common pets are mammal, reptile, fish, animal, and insect.
Example 4: Fish
fish = ['salmon', 'trout', 'tuna']
pet_habitat_options = ['canary', 'rare', 'mammal']
pet_behavior_sc
llama_print_timings: load time = 130.68 ms
llama_print_timings: sample time = 81.25 ms / 512 runs ( 0.16 ms per token, 6301.54 tokens per second)
llama_print_timings: prompt eval time = 142.41 ms / 17 tokens ( 8.38 ms per token, 119.38 tokens per second)
llama_print_timings: eval time = 33155.96 ms / 511 runs ( 64.88 ms per token, 15.41 tokens per second)
llama_print_timings: total time = 33563.33 ms Would appreciate your input @ggerganov @ebeyabraham |
@teleprint-me The issue you are seeing is with the model itself and not the inference code. From my experiments with Phi-1.5 base model, the model generates the answer and then keeps on rambling afterwards. |
@ebeyabraham I got Phi-1.5 to respond successfully. system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 2048, n_batch = 512, n_predict = 512, n_keep = 0
Question: What is the role of ribosomes in cellular biology?
Answer: Ribosomes are responsible for synthesizing proteins, which are essential for various cellular processes. They act as protein factories within cells and play a crucial role in maintaining the overall functionality of living organisms.
[end of text]
llama_print_timings: load time = 131.71 ms
llama_print_timings: sample time = 6.35 ms / 41 runs ( 0.15 ms per token, 6453.64 tokens per second)
llama_print_timings: prompt eval time = 143.38 ms / 17 tokens ( 8.43 ms per token, 118.56 tokens per second)
llama_print_timings: eval time = 2548.84 ms / 40 runs ( 63.72 ms per token, 15.69 tokens per second)
llama_print_timings: total time = 2711.62 ms
Log end I thought the original issue I faced is mostly either my modifications or with Phi-1. I wasn't sure, so I appreciate the feedback. I realized after some further testing and prompt modifications that I could get desirable output from Phi-1 as well. @ggerganov I created a PR for consolidating the Phi models: #4552 |
I realized after some light testing that the prompts for each model need to be adjusted accordingly. For example, the Phi-1 will respond with Python source code which makes sense because that's how it was trained. The Phi-1.5 is more flexible and can be thought of as an improvement on the Phi-1. So, prompting both models with the following command actually improves output: ./main -m stash/models/microsoft/phi-1/phi-1-q8_0.gguf --color -e -s 1337 -c 2048 -n 512 -p "Question: How to create a list of prime numbers in Python?\nAnswer:" All 3 models are base models, so none of them have been finetuned for chat at all. I've been attempting to gather and aggregate some custom data to build a LoRA and I think this might be the perfect model to test it out with. I don't know how long it's going to take me though because I've had to sacrifice development time for generating an income. No LoRA's (that I know of) have been publicly released, so I was thinking about releasing mine if I succeed. It will probably take awhile because it's a personal and custom dataset and I'll need to filter out my personal information from it due to it being an aggregate of my past chats with other models. I originally estimated 2 weeks and that was over 2 months ago. Datasets are no joke! 😅 I think it'll be worth it though! |
Exciting news - just added support for Phi-1 and Phi-1_5 models! While many may predominantly use Phi-2, this addition is fantastic as it broadens our range of models for experimentation – and they're all MIT licensed! With this update, we now have GPT2, TinyLlama, and various Phi models in our toolkit. 🥳 Currently, fine-tuning is tailored to Llama models, but I'm hopeful that we can develop a more inclusive approach that maintains backward compatibility. Imagine the possibilities if we could extend fine-tuning capabilities to these new models locally – it's an amazing prospect! Looking forward to more collaborative innovations and experiments with these models! |
@teleprint-me Excellent news! The finetuning is essential for business/domain use: it reduces the communication cost (the user doesn't have to explain domain-specific terms etc), and so also the computation cost. Unfortunately the |
@tom-adsfund Yeah, I experienced it first hand with llama-7b. I never counted the amount tokens I used for training (I should have in retrospect) and then realized it was going to take a few days to complete. There are some known issues with fine-tuning (#4703) and training from scratch (#4791). I'll be experimenting more with these in the coming months. Not sure if back-propagation is implemented? Waiting for the ggml backend updates to get merged (#4766). Also exploring the vulkan backends in the meantime. Any supported fine-tuning is done via LoRA (#4645). I remember another PR that was low-prority that's still hanging around, but I can't find it. Most of them haven't been active since July 2023. |
@teleprint-me Thanks for the insights. Yeah, I didn't realize that one of the problems was with quantized models. |
@teleprint-me I think you should continue on with this work. These phi models are superb and it's a shame I can't use the simple convert function to get the gguf format. I've tried using candle as well, but it requires making custom edits to metadata. |
I think this should be closed as the support is there, and has been for a while. I myself have been using Phi 2 happily for some small tasks with a JSON grammar. It works as well as can be expected when used properly, it's very impressive for a tiny model! Let's close this? |
It's a 1.3B SOTA model and competes with < 10B models.
https://huggingface.co/microsoft/phi-1_5
https://huggingface.co/microsoft/phi-1
Would be blazing fast with Llama.cpp.
The text was updated successfully, but these errors were encountered: