Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GGML model #617

Merged
merged 18 commits into from
Nov 3, 2023
Merged

Add GGML model #617

merged 18 commits into from
Nov 3, 2023

Conversation

matthoffner
Copy link

@matthoffner matthoffner commented Jun 26, 2023

This PR adds initial ggml model support based on the llama-cpp-python server. This should support the majority of models, with the potential of using a library like ctransformers useful for ggml models that do no work with llama.cpp.

@CLAassistant
Copy link

CLAassistant commented Jun 26, 2023

CLA assistant check
All committers have signed the CLA.

@matthoffner matthoffner mentioned this pull request Jun 26, 2023
@matthoffner matthoffner changed the title Draft: Add Llama model Draft: Add GGML model Jun 26, 2023
lm_eval/models/ggml.py Outdated Show resolved Hide resolved
lm_eval/models/ggml.py Outdated Show resolved Hide resolved
lm_eval/models/ggml.py Outdated Show resolved Hide resolved
@haileyschoelkopf
Copy link
Collaborator

Thanks so much for the contribution so far! Looks great so far. Left some comments which may hopefully be helpful, and also just noting that loglikelihood_rolling is still todo.

@matthoffner
Copy link
Author

matthoffner commented Jun 27, 2023

Thanks for the comments @haileyschoelkopf I'm looking into implementing them today. I will also look into loglikelihood_rolling.

Update: I've implemented a function using loglikelihood_rolling with a test.

@haileyschoelkopf
Copy link
Collaborator

Will try to review this tomorrow! Looks great, thanks so much!

From a quick skim, I think there are just a couple minor things extant re: ensuring things break more than just logging an error message, if there's no/malformed response from the server. I can probably go and add those if need be.

A few-line pointer to add to the library README on what GGML apis are supported + a pointer on (where to go in order to) spin up an inference server for models would be awesome if not too much work!

@matthoffner matthoffner changed the title Draft: Add GGML model Add GGML model Jul 4, 2023
@matthoffner
Copy link
Author

Thanks @haileyschoelkopf! I updated the README with the command I'm using locally. I'm pretty new to evaluating models so I may be missing some details still.

@haileyschoelkopf
Copy link
Collaborator

Thanks! LLaMA-CPP is surprisingly easy to install... do you have any recommendations for models + tasks to test with GGML?

Running the 8bit LLaMA from here: https://huggingface.co/TheBloke/LLaMa-7B-GGML gives me 0.2504 accuracy on Hellaswag, which is I think way worse than I'd hope from an 8-bit quantized LLaMA-7b.

@matthoffner
Copy link
Author

matthoffner commented Jul 7, 2023

Thanks @haileyschoelkopf I'm getting similar results.

Update: I took out the reordering code which seemed to prevent the script from running. Pushed changes if anyone else wants to try.

@IgnacioFDM
Copy link

Seems to be broken

I tried 85d3b8d and it fails to run

Running loglikelihood requests
  0%|                                                                                | 0/40168 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/idm/lm-evaluation-harness/main.py", line 93, in <module>
    main()
  File "/home/idm/lm-evaluation-harness/main.py", line 59, in main
    results = evaluator.simple_evaluate(
  File "/home/idm/lm-evaluation-harness/lm_eval/utils.py", line 243, in _wrapper
    return fn(*args, **kwargs)
  File "/home/idm/lm-evaluation-harness/lm_eval/evaluator.py", line 94, in simple_evaluate
    results = evaluate(
  File "/home/idm/lm-evaluation-harness/lm_eval/utils.py", line 243, in _wrapper
    return fn(*args, **kwargs)
  File "/home/idm/lm-evaluation-harness/lm_eval/evaluator.py", line 289, in evaluate
    resps = getattr(lm, reqtype)([req.args for req in reqs])
  File "/home/idm/lm-evaluation-harness/lm_eval/base.py", line 893, in fn
    rem_res = getattr(self.lm, attr)(remaining_reqs)
  File "/home/idm/lm-evaluation-harness/lm_eval/models/ggml.py", line 47, in loglikelihood
    response = self.ggml_completion(self.base_url, context=context, continuation=continuation)
TypeError: GGMLLM.ggml_completion() got multiple values for argument 'context'

I tried again with the previous commit 0fa2c7d and it runs but it returns a clearly incorrect result after doing a single completion (it finishes instantly instead of taking ~30min as you'd expect hellaswag, and returns the following results).

|  Task   |Version| Metric |Value |   |Stderr|
|---------|------:|--------|-----:|---|-----:|
|hellaswag|      0|acc     |0.2504|±  |0.0043|
|         |       |acc_norm|0.2409|±  |0.0043|

@matthoffner matthoffner marked this pull request as draft July 9, 2023 02:58
@matthoffner
Copy link
Author

Thanks @IgnacioFDM I'll look into a regression there. The results I'm getting are even worse although I'm able to run it for longer iterations now.

@JettScythe
Copy link

just tried this out and getting the same issue as @IgnacioFDM.
Any luck getting this going @matthoffner ?

@emmatyping
Copy link
Contributor

It seems llama.cpp is moving to a new GGUF format, so it might be good to change the name to GGUFLM

@matthoffner matthoffner marked this pull request as ready for review August 29, 2023 05:02
@StellaAthena
Copy link
Member

@matthoffner Can you post some evals run with the patched implementation, ideally with non-quantized #s for comparison?

@matthoffner
Copy link
Author

@matthoffner Can you post some evals run with the patched implementation, ideally with non-quantized #s for comparison?

Will try to revisit this week, if anyone else can confirm that would be great too, thanks.

@StellaAthena
Copy link
Member

@matthoffner any luck?

@matthoffner
Copy link
Author

@StellaAthena No luck. It runs without crashing but the results don't seem accurate.

@StellaAthena
Copy link
Member

@StellaAthena No luck. It runs without crashing but the results don't seem accurate.

Hmmm. That's unfortunate, though it could be that the quantization causes performance degradation.

Try manually inspecting a few of the results (esp. the incorrect ones) and then try running the same prompts through the library's native interface. Do you get the same logits when using the model through our library?

@LorenzoMinto
Copy link

LorenzoMinto commented Oct 31, 2023

hey @matthoffner opened a PR against your fork: matthoffner#1 changing how the loglikelihood for completions is computed, following more closely what is done for the gpt3 and huggingface models.

For llama-2-7b-hf.gguf on winogrande I'm getting something close to what reported in the paper (there might some source of imprecision left). Before I was getting ~0.5 accuracy.

|   Task   |Version|Metric|Value |   |Stderr|
|----------|------:|------|-----:|---|-----:|
|winogrande|      0|acc   |0.6977|±  |0.0129|

Return score from continuation logprobs
@matthoffner
Copy link
Author

Thanks @LorenzoMinto I merged your PR

@LorenzoMinto
Copy link

LorenzoMinto commented Nov 1, 2023

This is the eval for arc-easy. Also slightly overestimated (in the paper they report 75.2), but quite close

{
  "results": {
    "arc_easy": {
      "acc": 0.7567340067340067,
      "acc_stderr": 0.008804009846865538,
      "acc_norm": 0.7373737373737373,
      "acc_norm_stderr": 0.009029861776763754
    }
  },
  "versions": {
    "arc_easy": 0
  },
  ...
}

@StellaAthena
Copy link
Member

These look good enough to merge to me, unless we want to confirm with other models and more datasets first.

@StellaAthena StellaAthena added this to the v0.3.0 milestone Nov 1, 2023
Copy link
Collaborator

@haileyschoelkopf haileyschoelkopf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you everyone for the contributions! The posted results look good enough to merge. I will test this locally myself tomorrow, also left a few nitpicks which I can incorporate prior to merge (in particular, we should avoid silently continuing when the model does not return logprobs as expected.)

lm_eval/models/ggml.py Outdated Show resolved Hide resolved
lm_eval/models/ggml.py Outdated Show resolved Hide resolved
lm_eval/models/ggml.py Outdated Show resolved Hide resolved
@haileyschoelkopf haileyschoelkopf merged commit 008fc2a into EleutherAI:master Nov 3, 2023
@StellaAthena
Copy link
Member

@haileyschoelkopf Let's update the README to reflect this too!

@haileyschoelkopf
Copy link
Collaborator

Will do! Already done for the refactor branch PR

@StellaAthena
Copy link
Member

@haileyschoelkopf are you sure? It doesn't look like it.

@haileyschoelkopf
Copy link
Collaborator

In #967 (not yet merged) there’s the updated model support table with GGUF

@StellaAthena
Copy link
Member

In #967 (not yet merged) there’s the updated model support table with GGUF

Sorry, I shouldn't try to read things at midnight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants