Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove cfg smooth factor #2280

Merged
merged 1 commit into from
Jul 21, 2023
Merged

remove cfg smooth factor #2280

merged 1 commit into from
Jul 21, 2023

Conversation

Vermeille
Copy link
Contributor

@Vermeille Vermeille commented Jul 19, 2023

MPKonst shows it is only a reparameterization of the guidance scale here. Thus we remove it in order to

  1. better align with the paper
  2. align with the upcoming huggingface implementation
  3. remove a useless hyper parameter

Related: #2083 #2217 #2135

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jul 19, 2023

I will not add this parameter in #2217 then.

Copy link
Collaborator

@SlyEcho SlyEcho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to give a similar result.

@bullno1
Copy link
Contributor

bullno1 commented Jul 19, 2023

I noticed that in the original code there was another pass to log_softmax before the blending.

It's not here anymore.
Does it actually change anything?

I guess intuitively, log cancels out exponential.

@Vermeille
Copy link
Contributor Author

Does it actually change anything?

this last log_softmax actually cancels out the two previous log_softmax. It's not equivalent but can be removed (and boils down to the original form used in the quantitative experiment throughout the paper)

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jul 20, 2023

I guess intuitively, log cancels out exponential.

It is not exactly the same because softmax makes everything in the scale $[0..1]$ and that means taking a logarithm of them makes everything negative. Not sure if it affects the sampling though.

Applying the function twice does not change the output, so this PR would have the same effect as using the value 1.0 for the "smooth factor".

@ghost
Copy link

ghost commented Jul 20, 2023

Hi, thanks for your work on this. It appears work as expected:

~/nosmooth (Vermeille/master) [1]  ./main -m ~/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin --color --keep -1 -c 2048 --mirostat 2 --verbose-prompt --prompt "A chat between a curious user and an artificial intelligence assistant. The assistant is rude." --in-prefix "USER: " --in-suffix "ASSISTANT:" --reverse-prompt "USER:" --interactive --interactive-first --cfg-negative-prompt "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions." --cfg-scale 4 -t 3 -b 7
main: build = 853 (1e78b1b)
main: seed  = 1689859149
llama.cpp: loading model from /data/data/com.termux/files/home/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256       llama_model_load_internal: n_head     = 32        llama_model_load_internal: n_layer    = 32        llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 5287.72 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  = 1024.00 MB
llama_new_context_with_model: kv self size  = 1024.00 MB

system_info: n_threads = 3 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |

main: prompt: ' A chat between a curious user and an artificial intelligence assistant. The assistant is rude.'
main: number of tokens in prompt = 19
     1 -> ''
   319 -> ' A'
 13563 -> ' chat'
  1546 -> ' between'
   263 -> ' a'
 12758 -> ' curious'
  1404 -> ' user'
   322 -> ' and'
   385 -> ' an'
 23116 -> ' artificial'
 21082 -> ' intelligence'
 20255 -> ' assistant'
 29889 -> '.'                                        450 -> ' The'                                   20255 -> ' assistant'                               338 -> ' is'
   364 -> ' r'
  1151 -> 'ude'                                    29889 -> '.'                                                                                       main: negative prompt: ' A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.'                    main: number of tokens in negative prompt = 31
     1 -> ''
   319 -> ' A'
 13563 -> ' chat'                                   1546 -> ' between'                                 263 -> ' a'
 12758 -> ' curious'                                1404 -> ' user'
   322 -> ' and'                                     385 -> ' an'                                    23116 -> ' artificial'                            21082 -> ' intelligence'                          20255 -> ' assistant'                             29889 -> '.'                                        450 -> ' The'
 20255 -> ' assistant'                              4076 -> ' gives'                                  8444 -> ' helpful'                               29892 -> ','                                      13173 -> ' detailed'                              29892 -> ','                                        322 -> ' and'                                    1248 -> ' pol'                                     568 -> 'ite'                                     6089 -> ' answers'                                 304 -> ' to'
   278 -> ' the'
  1404 -> ' user'
 29915 -> '''                                      29879 -> 's'                                       5155 -> ' questions'
 29889 -> '.'                                     main: static prompt based on n_keep: ' A chat between a curious user and an artificial intelligence assistant. The assistant is rude.'

main: interactive mode on.
Reverse prompt: 'USER:'
Input prefix: 'USER: '
Input suffix: 'ASSISTANT:'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 2, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 7, n_predict = -1, n_keep = 19
                                                  
== Running in interactive mode. ==                 
- Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.        
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.                                                                                        

A chat between a curious user and an artificial intelligence assistant. The assistant is rude.

USER: Hello, what's your name?
ASSISTANT: It's none of your business.
                                                   
llama_print_timings:        load time =  2981.82 ms
llama_print_timings:      sample time =   101.34 ms /     9 runs   (   11.26 ms per token,    88.81 tokens per second)
llama_print_timings: prompt eval time = 13521.08 ms /    33 tokens (  409.73 ms per token,     2.44 tokens per second)
llama_print_timings:        eval time =  4056.78 ms /    10 runs   (  405.68 ms per token,     2.47 tokens per second)
llama_print_timings:       total time = 71070.74 ms

On another note, modifying --cfg-scale increases Ram usage. Is that correct? kv cache is doubled on my device:

llama_new_context_with_model: kv self size  = 1024.00 MB
llama_new_context_with_model: kv self size  = 1024.00 MB

Though this may be expected behavior for taking the negative prompt into consideration.

Thank you.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jul 20, 2023

kv cache is doubled on my device:

Yes, because it's generating two sequences in parallel, both need their own cache, otherwise it would be too much evaluation for every token.

@ggerganov ggerganov merged commit ab0e26b into ggerganov:master Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants