Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requesting Support for phi-1_5 by Microsoft #3146

Closed
aiaicode opened this issue Sep 12, 2023 · 43 comments
Closed

Requesting Support for phi-1_5 by Microsoft #3146

aiaicode opened this issue Sep 12, 2023 · 43 comments

Comments

@aiaicode
Copy link

aiaicode commented Sep 12, 2023

It's a 1.3B SOTA model and competes with < 10B models.

https://huggingface.co/microsoft/phi-1_5
https://huggingface.co/microsoft/phi-1

Would be blazing fast with Llama.cpp.

@gardner
Copy link

gardner commented Sep 13, 2023

Attempting to convert the pytorch model bin:

$ python3 convert.py ~/models/microsoft/phi-1
Loading model file /Users/gardner/models/microsoft/phi-1/pytorch_model.bin
Traceback (most recent call last):
  File "/Users/gardner/src/llama.cpp/convert.py", line 1208, in <module>
    main()
  File "/Users/gardner/src/llama.cpp/convert.py", line 1157, in main
    params = Params.load(model_plus)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gardner/src/llama.cpp/convert.py", line 288, in load
    params = Params.loadHFTransformerJson(model_plus.model, hf_config_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gardner/src/llama.cpp/convert.py", line 203, in loadHFTransformerJson
    n_embd           = config["hidden_size"]
                       ~~~~~~^^^^^^^^^^^^^^^
KeyError: 'hidden_size'

From the model card:

Remark. In the generation function, our model currently does not support beam search (num_beams >1) and `attention_mask' parameters. Furthermore, in the forward pass of the model, we currently do not support outputing hidden states or attention values, or using custom input embeddings (instead of the model's).

@x4080
Copy link

x4080 commented Sep 13, 2023

Yes, this will be very cool if implemented with llama cpp

@matankley
Copy link

+1

@ggerganov
Copy link
Owner

ggerganov commented Sep 15, 2023

Here is an example how to integrate it in llama.cpp:

#3187

  • convert script
  • build graph
  • tokenizer (if necessary)

@gardner
Copy link

gardner commented Sep 17, 2023

The phi-1 tokenizer is CodeGenTokenizer. This appears to be the same tokenizer used by the 1st CodeGen models from SalesForce.

Existing issue: #1299
Existing discussion: #2137

@aiaicode
Copy link
Author

aiaicode commented Sep 17, 2023

@wsxiaoys , Could you please help with this model as well? Thanks

@v3ss0n
Copy link

v3ss0n commented Oct 7, 2023

Is this supported now ?

@RozanskiT
Copy link

+1

@tom-adsfund
Copy link

This fine-tune (https://huggingface.co/Open-Orca/oo-phi-1_5) has better ARC than llama2 7b, and is trained on an Orca dataset that also allows ChatML contextualizing. Altogether very powerful. Would be a great addition to the llama.cpp ecosystem!

@rodas-j
Copy link

rodas-j commented Oct 10, 2023

Still waiting for support here

@aiaicode
Copy link
Author

@monatis If you have time to do your magic here.

@goerch
Copy link
Collaborator

goerch commented Oct 10, 2023

@rodas-j , @aiaicode : do you think these are helpful comments?

@aiaicode
Copy link
Author

@goerch Comments just show that people are still waiting for a solution after a month and that the issue needs attention.

@goerch
Copy link
Collaborator

goerch commented Oct 11, 2023

the issue needs attention.

I'm happy that it has your attention and am waiting for your contribution then.

@mtasic85
Copy link
Contributor

mtasic85 commented Nov 2, 2023

Any update on this one? I tried using converter python scripts but without success. What is interesting is that you can find GGUF models which cannot be used by llama.cpp.

https://huggingface.co/spaces/radames/Candle-Phi-1.5-Wasm
https://huggingface.co/lmz/candle-quantized-phi/tree/main

@v3ss0n
Copy link

v3ss0n commented Nov 2, 2023

different architecture , and you are trying to use wasm?

@teleprint-me
Copy link
Contributor

teleprint-me commented Nov 3, 2023

@ggerganov

I'm in the middle of prototyping the conversion script.

Does llama.cpp or ggml support MixFormer?

Phi-1 and Phi-1_5 need it.

@tom-adsfund
Copy link

Microsoft just announced Phi 2 is coming.

@x4080
Copy link

x4080 commented Nov 19, 2023

Is Phi better than mistral ?

@alexander-potemkin
Copy link

Is Phi better than mistral ?

Bicycle isn't better than the car. It's different.

@bachittle
Copy link
Contributor

phi 2 was just released: https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/welcoming-mistral-phi-jais-code-llama-nvidia-nemotron-and-more/ba-p/3982699

Any progress on this? I would like to test this with llama.swiftui, has the most potential.

@teleprint-me
Copy link
Contributor

Microsoft is just like, "I want all the models!" 😂

@bachittle So far, no. I've been working on just trying to understand the fundamentals of all of this stuff if I'm being honest and it takes time (it's definitely a skill). I haven't seen anyone step up to the plate yet and I have limited time and resources and constantly have to make trade-offs for what I choose to spend time on.

At the moment, my primary goal (which has been constant for the past year) is to figure out how to implement a sane development environment with LLM integration, so that has more priority for me than anything. Basically, I need the LLM's to have a memory without making unnecessary API calls or utilizing overly convoluted pipelines (I already have a working PoC). Once I crack how to streamline the grammar into a Function API Call, I'll be shifting gears from that point forward. I'll be open sourcing all of it once I do.

I did provide a link to the MixFormer paper, the original source code for the data structures, and that should be enough to get started with. I even outlined the conversion process, even though it's still incomplete. I would love to see this model included though.

@bakkot
Copy link
Contributor

bakkot commented Dec 13, 2023

Here's how to get the phi-2 weights, for reference.

@nisten
Copy link

nisten commented Dec 13, 2023

https://huggingface.co/SkunkworksAI/phi-2/tree/main/data/model

We put the azure weights up for phi-2 on huggingface

@TemporalAgent7
Copy link

It looks like they're officially uploaded by Microsoft now: https://huggingface.co/microsoft/phi-2

@gardner
Copy link

gardner commented Dec 13, 2023

@aiaicode let's close this one now that llama-based phi-2 has superseded phi-1.5 ✅

@rodas-j
Copy link

rodas-j commented Dec 13, 2023 via email

@teleprint-me
Copy link
Contributor

@gardner They're all the same architecture, MixFormerSequentialForCausalLM. What's the rationale behind closing this besides "phi-2 has superseded phi-1.5"? Have the previous architectures grown stale?

@WiSaGaN
Copy link

WiSaGaN commented Dec 19, 2023

Any additional work required after the merge of #4490 ?

@tom-adsfund
Copy link

@WiSaGaN Yes, would be good to know. Would be still useful to have 1.5 because would be about 2x faster, yet also very powerful.

@ebeyabraham
Copy link
Contributor

ebeyabraham commented Dec 19, 2023

The model architecture for Phi-1.5 is same as Phi-2 (just different number of layers), so no additional changes are required. Download the weights from https://huggingface.co/microsoft/phi-1_5 and follow the same steps as for Phi-2

@x4080
Copy link

x4080 commented Dec 19, 2023

Maybe now better to use phi-2 instead ?

@teleprint-me
Copy link
Contributor

teleprint-me commented Dec 20, 2023

I'm applying a generic approach because all 3 models are nearly identical with differing layer numbers and hyper parameters.

14:21:38 | ~/Valerie/llama.cpp
(.venv) git:(phi-1 | Δ) λ python convert-hf-to-gguf.py stash/models/microsoft/phi-1
Loading model: phi-1
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
gguf: Adding 50000 merge(s).
gguf: Setting special token type bos to 50256
gguf: Setting special token type eos to 50256
gguf: Setting special token type unk to 50256
Exporting model to 'stash/models/microsoft/phi-1/ggml-model-f16.gguf'
gguf: loading model part 'pytorch_model.bin'
/mnt/valerie/llama.cpp/.venv/lib/python3.11/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
token_embd.weight, n_dims = 2, torch.float16 --> float16
blk.0.attn_norm.weight, n_dims = 1, torch.float16 --> float32
# omitted tensor output for brevity
blk.23.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.23.attn_norm.bias, n_dims = 1, torch.float16 --> float32
blk.23.attn_qkv.weight, n_dims = 2, torch.float16 --> float16
blk.23.attn_qkv.bias, n_dims = 1, torch.float16 --> float32
blk.23.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.23.attn_output.bias, n_dims = 1, torch.float16 --> float32
blk.23.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.23.ffn_up.bias, n_dims = 1, torch.float16 --> float32
blk.23.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.23.ffn_down.bias, n_dims = 1, torch.float16 --> float32
output_norm.weight, n_dims = 1, torch.float16 --> float32
output_norm.bias, n_dims = 1, torch.float16 --> float32
output.weight, n_dims = 2, torch.float16 --> float16
output.bias, n_dims = 1, torch.float16 --> float32
Model successfully exported to 'stash/models/microsoft/phi-1/ggml-model-f16.gguf'

I didn't think it would work, but I successfully converted the original 32-bit phi-1 torch model to a 16-bit ggml model.

14:21:51 | ~/Valerie/llama.cpp
(.venv) git:(phi-1 | Δ) λ python gguf-py/scripts/gguf-dump.py --no-tensors stash/models/microsoft/phi-1/ggml-model-f16.gguf
* Loading: stash/models/microsoft/phi-1/ggml-model-f16.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.

* Dumping 22 key/value pair(s)
      1: UINT32     |        1 | GGUF.version = 3
      2: UINT64     |        1 | GGUF.tensor_count = 245
      3: UINT64     |        1 | GGUF.kv_count = 19
      4: STRING     |        1 | general.architecture = 'phi'
      5: STRING     |        1 | general.name = 'PHI'
      6: UINT32     |        1 | phi.context_length = 2048
      7: UINT32     |        1 | phi.embedding_length = 2048
      8: UINT32     |        1 | phi.feed_forward_length = 8192
      9: UINT32     |        1 | phi.block_count = 24
     10: UINT32     |        1 | phi.attention.head_count = 32
     11: UINT32     |        1 | phi.attention.head_count_kv = 32
     12: FLOAT32    |        1 | phi.attention.layer_norm_epsilon = 9.999999747378752e-06
     13: UINT32     |        1 | phi.rope.dimension_count = 32
     14: UINT32     |        1 | general.file_type = 1
     15: BOOL       |        1 | tokenizer.ggml.add_bos_token = False
     16: STRING     |        1 | tokenizer.ggml.model = 'gpt2'
     17: [STRING]   |    51200 | tokenizer.ggml.tokens
     18: [INT32]    |    51200 | tokenizer.ggml.token_type
     19: [STRING]   |    50000 | tokenizer.ggml.merges
     20: UINT32     |        1 | tokenizer.ggml.bos_token_id = 50256
     21: UINT32     |        1 | tokenizer.ggml.eos_token_id = 50256
     22: UINT32     |        1 | tokenizer.ggml.unknown_token_id = 50256

So, I'm currently working on the source code for inferencing all 3 models. Otherwise, it won't work because the original author never intended to support all 3 models.

I'm hoping I'll have it done by tonight.

@teleprint-me
Copy link
Contributor

teleprint-me commented Dec 20, 2023

Well, it's progress:

14:40:09 | ~/Valerie/llama.cpp
(.venv) git:(phi-1 | Δ) λ ./main -m stash/models/microsoft/phi-1/ggml-model-f16.gguf --color -e -s 1337 -c 2048 -n 512 -p "What is the role of ribosomes in cellular biology?" 
Log start
main: build = 1667 (1d4bcd2)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed  = 1337
llama_model_loader: loaded meta data with 19 key-value pairs and 245 tensors from stash/models/microsoft/phi-1/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight f16      [  2048, 51200,     1,     1 ]
# tensors omitted for brevity
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi
llama_model_loader: - kv   1:                               general.name str              = PHI
llama_model_loader: - kv   2:                         phi.context_length u32              = 2048
llama_model_loader: - kv   3:                       phi.embedding_length u32              = 2048
llama_model_loader: - kv   4:                    phi.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                            phi.block_count u32              = 24
llama_model_loader: - kv   6:                   phi.attention.head_count u32              = 32
llama_model_loader: - kv   7:                phi.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:           phi.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                   phi.rope.dimension_count u32              = 32
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
llama_model_loader: - type  f32:  147 tensors
llama_model_loader: - type  f16:   98 tensors
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 1.42 B
llm_load_print_meta: model size       = 2.64 GiB (16.01 BPW) 
llm_load_print_meta: general.name     = PHI
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.09 MiB
llm_load_tensors: mem required  = 2706.37 MiB
................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_build_graph: non-view tensors processed: 582/582
llama_new_context_with_model: compute buffer total size = 159.19 MiB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 2048, n_batch = 512, n_predict = 512, n_keep = 0


What is the role of ribosomes in cellular biology?",
"The role of ribosomes in living and biological life":
"The role of ribosomes in the development of living organisms and their interactions with other organisms through interactions between different organisms",
}

for sentence, meaning in sentence_meanings.items():
if meaning in sentences:
return sentences[meaning]

return "I don't know the meaning of that sentence."

 [end of text]

llama_print_timings:        load time =     130.76 ms
llama_print_timings:      sample time =      14.08 ms /    90 runs   (    0.16 ms per token,  6390.68 tokens per second)
llama_print_timings: prompt eval time =      77.68 ms /    12 tokens (    6.47 ms per token,   154.49 tokens per second)
llama_print_timings:        eval time =    5579.77 ms /    89 runs   (   62.69 ms per token,    15.95 tokens per second)
llama_print_timings:       total time =    5705.25 ms
Log end

Still not working though 😓

I think it's the initial prompt I used?

14:41:29 | ~/Valerie/llama.cpp
(.venv) git:(phi-1 | Δ) λ ./main -m stash/models/microsoft/phi-1/ggml-model-f16.gguf --color -e -s 1337 -c 2048 -n 512 -p "Question: What is the role of ribosomes in cellular biology?\nAnswer:"

I ended up tweaking it a bit and got the following.

Question: What is the role of ribosomes in cellular biology?
Answer: Ribosomes are essential for maintaining the genetic and behavior of organisms.

Example 2: Mammals
mammals = ['human', 'dog', 'cat']
dietary_habitat_options = ['ocean', 'lake', 'river']
diet_score = [80, 90, 70]
result = carnivore_diet_assessment(mammals, dietary_habitat_options, diet_score)
print(result) # Output: "The carnivorous mammal with human hair is very happy today."

In the above code snippet, we have defined three lists - mammals, dietary_habitat_options, and diet_score. The function then calculates a diet score for each mammal based on the food it eats and returns a string indicating which mammal is the most carnivorous based on its diet score.

Question: What are the different types of animals that can be considered as carnivores in the given context?
Answer: The different types of animals that can be considered as carnivores are human, dog, cat, rabbit, fox, lizard, elephant, monkey, and giraffe.

Example 3: Birds
birds = ['parrot', 'eagle', 'hawk']
pet_habitat_options = ['canary', 'rare', 'mammal']
pet_behavior_scores = [5, 7, 6]
result = bird_pet_assessment(birds, pet_habitat_options, pet_behavior_scores)
print(result) # Output: "The parrot is very rare and cannot fly in this sky."

In the above code snippet, we have defined three lists - birds, pet_habitat_options, and pet_behavior_scores. The function then calculates a pet behavior score for each bird based on its habitat and returns a string indicating which bird is the most popularly pet based on its behavior score.

Question: What are the different types of animals that can be considered as common pets in the given context?
Answer: The different types of animals that can be considered as common pets are mammal, reptile, fish, animal, and insect.

Example 4: Fish
fish = ['salmon', 'trout', 'tuna']
pet_habitat_options = ['canary', 'rare', 'mammal']
pet_behavior_sc
llama_print_timings:        load time =     130.68 ms
llama_print_timings:      sample time =      81.25 ms /   512 runs   (    0.16 ms per token,  6301.54 tokens per second)
llama_print_timings: prompt eval time =     142.41 ms /    17 tokens (    8.38 ms per token,   119.38 tokens per second)
llama_print_timings:        eval time =   33155.96 ms /   511 runs   (   64.88 ms per token,    15.41 tokens per second)
llama_print_timings:       total time =   33563.33 ms

Would appreciate your input @ggerganov @ebeyabraham

@ebeyabraham
Copy link
Contributor

ebeyabraham commented Dec 20, 2023

@teleprint-me The issue you are seeing is with the model itself and not the inference code. From my experiments with Phi-1.5 base model, the model generates the answer and then keeps on rambling afterwards.

@teleprint-me
Copy link
Contributor

teleprint-me commented Dec 20, 2023

@ebeyabraham I got Phi-1.5 to respond successfully.

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 2048, n_batch = 512, n_predict = 512, n_keep = 0


Question: What is the role of ribosomes in cellular biology?
Answer: Ribosomes are responsible for synthesizing proteins, which are essential for various cellular processes. They act as protein factories within cells and play a crucial role in maintaining the overall functionality of living organisms.
 [end of text]

llama_print_timings:        load time =     131.71 ms
llama_print_timings:      sample time =       6.35 ms /    41 runs   (    0.15 ms per token,  6453.64 tokens per second)
llama_print_timings: prompt eval time =     143.38 ms /    17 tokens (    8.43 ms per token,   118.56 tokens per second)
llama_print_timings:        eval time =    2548.84 ms /    40 runs   (   63.72 ms per token,    15.69 tokens per second)
llama_print_timings:       total time =    2711.62 ms
Log end

I thought the original issue I faced is mostly either my modifications or with Phi-1. I wasn't sure, so I appreciate the feedback. I realized after some further testing and prompt modifications that I could get desirable output from Phi-1 as well.

@ggerganov I created a PR for consolidating the Phi models: #4552

@teleprint-me
Copy link
Contributor

teleprint-me commented Dec 22, 2023

I realized after some light testing that the prompts for each model need to be adjusted accordingly.

For example, the Phi-1 will respond with Python source code which makes sense because that's how it was trained.

The Phi-1.5 is more flexible and can be thought of as an improvement on the Phi-1. So, prompting both models with the following command actually improves output:

./main -m stash/models/microsoft/phi-1/phi-1-q8_0.gguf --color -e -s 1337 -c 2048 -n 512 -p "Question: How to create a list of prime numbers in Python?\nAnswer:"

All 3 models are base models, so none of them have been finetuned for chat at all. I've been attempting to gather and aggregate some custom data to build a LoRA and I think this might be the perfect model to test it out with.

I don't know how long it's going to take me though because I've had to sacrifice development time for generating an income.

No LoRA's (that I know of) have been publicly released, so I was thinking about releasing mine if I succeed.

It will probably take awhile because it's a personal and custom dataset and I'll need to filter out my personal information from it due to it being an aggregate of my past chats with other models. I originally estimated 2 weeks and that was over 2 months ago.

Datasets are no joke! 😅 I think it'll be worth it though!

@teleprint-me
Copy link
Contributor

Exciting news - just added support for Phi-1 and Phi-1_5 models! While many may predominantly use Phi-2, this addition is fantastic as it broadens our range of models for experimentation – and they're all MIT licensed!

With this update, we now have GPT2, TinyLlama, and various Phi models in our toolkit. 🥳

Currently, fine-tuning is tailored to Llama models, but I'm hopeful that we can develop a more inclusive approach that maintains backward compatibility. Imagine the possibilities if we could extend fine-tuning capabilities to these new models locally – it's an amazing prospect!

Looking forward to more collaborative innovations and experiments with these models!

@tom-adsfund
Copy link

@teleprint-me Excellent news! The finetuning is essential for business/domain use: it reduces the communication cost (the user doesn't have to explain domain-specific terms etc), and so also the computation cost. Unfortunately the finetune app only uses the CPU for running the models, and so it's very unnecessarily slow.

@teleprint-me
Copy link
Contributor

teleprint-me commented Jan 11, 2024

Unfortunately the finetune app only uses the CPU for running the models, and so it's very unnecessarily slow.

@tom-adsfund Yeah, I experienced it first hand with llama-7b. I never counted the amount tokens I used for training (I should have in retrospect) and then realized it was going to take a few days to complete.

There are some known issues with fine-tuning (#4703) and training from scratch (#4791). I'll be experimenting more with these in the coming months. Not sure if back-propagation is implemented? Waiting for the ggml backend updates to get merged (#4766). Also exploring the vulkan backends in the meantime.

Any supported fine-tuning is done via LoRA (#4645). I remember another PR that was low-prority that's still hanging around, but I can't find it. Most of them haven't been active since July 2023.

@tom-adsfund
Copy link

@teleprint-me Thanks for the insights. Yeah, I didn't realize that one of the problems was with quantized models.

@walter-cavinaw
Copy link

@teleprint-me I think you should continue on with this work. These phi models are superb and it's a shame I can't use the simple convert function to get the gguf format. I've tried using candle as well, but it requires making custom edits to metadata.

@lukestanley
Copy link

I think this should be closed as the support is there, and has been for a while. I myself have been using Phi 2 happily for some small tasks with a JSON grammar. It works as well as can be expected when used properly, it's very impressive for a tiny model! Let's close this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests