Bug: quantized gemma 27b output still wrong after tokenizer fix and soft capping

### What happened?

The quantized version of gemma 27b (Q8_0) still gets the answer wrong to even simple problems.
The version of gemma on ai studio answers correctly all my questions.

Example problem that quantized gemma consistently fails while the ai studio gemma answers correctly.
```
Matteo has 20 apples, he buys 20 oranges. Then he discards half of his fruits equally.
Then he discards a quarter of his fruits equally between apples and oranges. How many apples remain?
```
The correct answer is 7 or 8.

I also tried asking the model to repeat the question by prepending "Repeat the question and then answer it: ".
The model in llama.cpp fails this simple task while the model in ai studio repeats the question word by word.

I noticed that the ai studio response starts with
`Here's how to solve the...`
while the response when run in llama.cpp starts with
`Here's how to solve this...`

So I printed the probabilities from llama.cpp and this is the output. I would have expected a much higher probabilities for "the" with respect to "this" even after quantization:
```
<bos><start_of_turn>user
Matteo has 20 apples, he buys 20 oranges. Then he discards half of his fruits equally.
Then he discards a quarter of his fruits equally between apples and oranges. How many apples remain?<end_of_turn>
<start_of_turn>model
[('Here', 1.0), ('This', 0.0), ('1', 0.0), ('**', 0.0), ('It', 0.0)]
[("'", 1.0), (' is', 0.0), (' are', 0.0), (' how', 0.0), (' we', 0.0)]
[('s', 1.0), ('d', 0.0), ('st', 0.0), ('ss', 0.0), ('re', 0.0)]
[(' how', 1.0), (' a', 0.0), (' the', 0.0), (' why', 0.0), (' step', 0.0)]
[(' to', 1.0), (' we', 0.0), (' this', 0.0), (' the', 0.0), (' you', 0.0)]
[(' solve', 1.0), (' break', 0.0), (' figure', 0.0), (' work', 0.0), (' breakdown', 0.0)]
[(' this', 1.0), (' the', 0.0), (' problem', 0.0), (' that', 0.0), (' these', 0.0)]
[(' problem', 1.0), (' word', 0.0), (':', 0.0), (' riddle', 0.0), (' puzzle', 0.0)]
```

Here is the setup:
model version: bartowski gemma-27b-it at Q8_0 after tokenizer fix
llama-server: 3264 (upstream version after merge)
inference parameters:  temperature = 0.01, seed = 0


### Name and Version

$ ./llama-cli --version
version: 3264 (09a5534f)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu


### What operating system are you seeing the problem on?

Linux

### Relevant log output

```shell
<bos><start_of_turn>user
Repeat the question and then answer it: Matteo has 20 apples, he buys 20 oranges. Then he discards half of his fruits equally. Then he discards a quarter of his fruits equally between apples and oranges. How many apples remain?<end_of_turn>
<start_of_turn>model
**Question:** Matteo has 20 apples. He buys 20 oranges. Then he discards a quarter of his fruits equally between apples and oranges. How many oranges remain?

**Answer:** Here's how to solve this problem:

* **Step 1: Understand the Problem**

The problem states that Matteo discards a quarter of his total fruit, not just a quarter of the oranges.

* **Step 2: Calculate Total Fruit**

Matteo starts with 20 apples.

* **Step 3: Calculate the Split**

After buying 20 oranges, he has 20 + 20 = 40 fruits total.

* **Step 4: Calculate the Discard**

Half of his total fruit is 40 / 2 = 20 fruits.

* **Step 5: Calculate the Remaining Fruit**

Since he discards a quarter of his fruit, he has 40 / 4 = 10 fruits discarded.

* **Step 6: Calculate the Remaining Apples**

He discards 10 fruits / 2 = 5 apples.

* **Answer:**  Therefore, 5 apples are discarded.

Let me know if you'd like to know how many apples and oranges Matteo has left! 🍎🍊
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: quantized gemma 27b output still wrong after tokenizer fix and soft capping #8183

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: quantized gemma 27b output still wrong after tokenizer fix and soft capping #8183

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions