Llama 3.2 3B Core ML poor output quality

### 🐛 Describe the bug

## Description

I tried running [Llama 3.2 3B Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct), exported for Core ML, using the llama runner built for the Core ML backend. However, the model's output quality is poor.

## Results

Running the Llama 3.2 3B Instruct with the following prompt:
```
What's PyTorch?
```
returns this output:
```
regulatorships, and following sugarsigmaMA and pi(r)del of NOLa and UINextre(S) above alligiobscop to Loss and Gainsto Loss to Loss to Reggain and Loss to Regainstomatchaugh linethat, and thatis, that is.

Okay, let's start over. Let's start with a clean slate. Let's start fresh and clean. Let's start with a clean slate. Let's start fresh and clean. Let's start with a clean slate. Let's start fresh and clean. Let's start with a clean
```

## Logs

```
(executorch) jakubmroz@Jakubs-MacBook-Pro executorch % cmake-out/examples/models/llama/llama_main --model_path=llama3.pte --tokenizer_path=llama323B/tokenizer.model --prompt="What's PyTorch?"
I 00:00:00.003438 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003486 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003493 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003498 executorch:cpuinfo_utils.cpp:100] Failed to open midr file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003500 executorch:cpuinfo_utils.cpp:116] CPU info and manual query on # of cpus dont match.
I 00:00:00.003502 executorch:main.cpp:69] Resetting threadpool with num threads = 0
I 00:00:00.003514 executorch:runner.cpp:65] Creating LLaMa runner: model_path=llama3.pte, tokenizer_path=llama323B/tokenizer.model
I 00:00:01.222029 executorch:runner.cpp:98] Reading metadata from model
I 00:00:01.222049 executorch:runner.cpp:123] Metadata: use_sdpa_with_kv_cache = 0
I 00:00:01.222052 executorch:runner.cpp:123] Metadata: use_kv_cache = 1
I 00:00:01.222053 executorch:runner.cpp:123] Metadata: get_vocab_size = 128256
I 00:00:01.222054 executorch:runner.cpp:123] Metadata: get_bos_id = 1
I 00:00:01.222056 executorch:runner.cpp:123] Metadata: get_max_seq_len = 128
I 00:00:01.222058 executorch:runner.cpp:123] Metadata: enable_dynamic_shape = 0
I 00:00:01.222059 executorch:runner.cpp:130] eos_id = 2
I 00:00:01.222065 executorch:runner.cpp:184] RSS after loading model: 0.000000 MiB (0 if unsupported)
What's PyTorch?II 00:00:01.738157 executorch:runner.cpp:254] RSS after prompt prefill: 0.000000 MiB (0 if unsupported)
regulatorships, and following sugarsigmaMA and pi(r)del of NOLa and UINextre(S) above alligiobscop to Loss and Gainsto Loss to Loss to Reggain and Loss to Regainstomatchaugh linethat, and thatis, that is.

Okay, let's start over. Let's start with a clean slate. Let's start fresh and clean. Let's start with a clean slate. Let's start fresh and clean. Let's start with a clean slate. Let's start fresh and clean. Let's start with a clean
I 00:00:11.833666 executorch:runner.cpp:268] RSS after finishing text generation: 0.000000 MiB (0 if unsupported)
PyTorchObserver {"prompt_tokens":6,"generated_tokens":121,"model_load_start_ms":1742382949879,"model_load_end_ms":1742382951098,"inference_start_ms":1742382951098,"inference_end_ms":1742382961709,"prompt_eval_end_ms":1742382951614,"first_token_ms":1742382951614,"aggregate_sampling_time_ms":84,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:11.833679 executorch:stats.h:110] 	Prompt Tokens: 6    Generated Tokens: 121
I 00:00:11.833680 executorch:stats.h:116] 	Model Load Time:		1.219000 (seconds)
I 00:00:11.833681 executorch:stats.h:126] 	Total inference time:		10.611000 (seconds)		 Rate: 	11.403261 (tokens/second)
I 00:00:11.833683 executorch:stats.h:134] 		Prompt evaluation:	0.516000 (seconds)		 Rate: 	11.627907 (tokens/second)
I 00:00:11.833684 executorch:stats.h:145] 		Generated 121 tokens:	10.095000 (seconds)		 Rate: 	11.986132 (tokens/second)
I 00:00:11.833685 executorch:stats.h:153] 	Time to first generated token:	0.516000 (seconds)
I 00:00:11.833686 executorch:stats.h:160] 	Sampling time over 127 tokens:	0.084000 (seconds)
```

## Minimal Example

This is my command history:
```
conda create -yn executorch python=3.10.0
conda activate executorch

git clone --branch viable/strict https://github.com/pytorch/executorch.git
cd executorch

git submodule sync
git submodule update --init

./install_executorch.sh --pybind coreml mps xnnpack

./backends/apple/coreml/scripts/install_requirements.sh
./backends/apple/mps/install_requirements.sh

examples/models/llama/install_requirements.sh

python -m examples.models.llama.export_llama --checkpoint llama323B/consolidated.00.pth --params llama323B/params.json -kv --disable_dynamic_shape --coreml

cmake -DPYTHON_EXECUTABLE=python \
    -DCMAKE_INSTALL_PREFIX=cmake-out \
    -DEXECUTORCH_ENABLE_LOGGING=1 \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
    -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
    -DEXECUTORCH_BUILD_COREML=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -Bcmake-out .

cmake --build cmake-out -j16 --target install --config Release

git submodule update --init --recursive

cmake -DPYTHON_EXECUTABLE=python \
    -DCMAKE_INSTALL_PREFIX=cmake-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_COREML=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -Bcmake-out/examples/models/llama \
    examples/models/llama

cmake --build cmake-out/examples/models/llama -j16 --config Release

cmake-out/examples/models/llama/llama_main --model_path=llama3.pte --tokenizer_path=llama323B/tokenizer.model --prompt="What's PyTorch?"
```

## Comparison
For comparison, the same model (Llama 3.2 3B Instruct) exported for the XNNPACK backend using the following command:
```
python -m examples.models.llama.export_llama --model "llama3_2" --checkpoint llama323B/consolidated.00.pth --params llama323B/params.json -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2_bf16.pte"
```
When run in the Llama runner with the same prompt:
```
What's PyTorch?
```
produces the following output:
```
PyTorch is an open-source machine learning library developed by Facebook's AI Research Lab. It provides a dynamic computation graph, similar to TensorFlow, but with a more flexible and Pythonic API.

### Key Features

1.  **Dynamic Computation Graph**: PyTorch's computation graph is dynamic, which means that it can be modified at runtime. This allows for more flexibility and easier debugging.
2.  **Pythonic API**: PyTorch's API is designed to be Pythonic, which means that it's easy to use and understand. It's also compatible with Python 3.5
```
Complete logs for XNNPACK example:
```
(executorch) jakubmroz@Jakubs-MacBook-Pro executorch % cmake-out/examples/models/llama/llama_main --model_path=llama3_2_bf16.pte --tokenizer_path=llama323B/tokenizer.model --prompt="What's PyTorch?"
I 00:00:00.003284 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003325 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003333 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003336 executorch:cpuinfo_utils.cpp:100] Failed to open midr file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003338 executorch:cpuinfo_utils.cpp:116] CPU info and manual query on # of cpus dont match.
I 00:00:00.003340 executorch:main.cpp:69] Resetting threadpool with num threads = 0
I 00:00:00.003350 executorch:runner.cpp:65] Creating LLaMa runner: model_path=llama3_2_bf16.pte, tokenizer_path=llama323B/tokenizer.model
I 00:00:00.697519 executorch:runner.cpp:98] Reading metadata from model
I 00:00:00.697534 executorch:runner.cpp:123] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:00.697536 executorch:runner.cpp:123] Metadata: use_kv_cache = 1
I 00:00:00.697538 executorch:runner.cpp:123] Metadata: get_vocab_size = 128256
I 00:00:00.697539 executorch:runner.cpp:123] Metadata: get_bos_id = 128000
I 00:00:00.697541 executorch:runner.cpp:123] Metadata: get_max_seq_len = 128
I 00:00:00.697542 executorch:runner.cpp:123] Metadata: enable_dynamic_shape = 1
I 00:00:00.697544 executorch:runner.cpp:130] eos_id = 128009
I 00:00:00.697545 executorch:runner.cpp:130] eos_id = 128001
I 00:00:00.697551 executorch:runner.cpp:184] RSS after loading model: 0.000000 MiB (0 if unsupported)
What's PyTorch?I 00:00:00.822303 executorch:text_prefiller.cpp:53] Prefill token result numel(): 128256
**

I 00:00:00.823610 executorch:runner.cpp:254] RSS after prompt prefill: 0.000000 MiB (0 if unsupported)
PyTorch is an open-source machine learning library developed by Facebook's AI Research Lab. It provides a dynamic computation graph, similar to TensorFlow, but with a more flexible and Pythonic API.

### Key Features

1.  **Dynamic Computation Graph**: PyTorch's computation graph is dynamic, which means that it can be modified at runtime. This allows for more flexibility and easier debugging.
2.  **Pythonic API**: PyTorch's API is designed to be Pythonic, which means that it's easy to use and understand. It's also compatible with Python 3.5
I 00:00:08.556003 executorch:runner.cpp:268] RSS after finishing text generation: 0.000000 MiB (0 if unsupported)
PyTorchObserver {"prompt_tokens":6,"generated_tokens":121,"model_load_start_ms":1742383796391,"model_load_end_ms":1742383797086,"inference_start_ms":1742383797086,"inference_end_ms":1742383804944,"prompt_eval_end_ms":1742383797212,"first_token_ms":1742383797212,"aggregate_sampling_time_ms":137,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:08.556034 executorch:stats.h:110] 	Prompt Tokens: 6    Generated Tokens: 121
I 00:00:08.556035 executorch:stats.h:116] 	Model Load Time:		0.695000 (seconds)
I 00:00:08.556036 executorch:stats.h:126] 	Total inference time:		7.858000 (seconds)		 Rate: 	15.398320 (tokens/second)
I 00:00:08.556037 executorch:stats.h:134] 		Prompt evaluation:	0.126000 (seconds)		 Rate: 	47.619048 (tokens/second)
I 00:00:08.556038 executorch:stats.h:145] 		Generated 121 tokens:	7.732000 (seconds)		 Rate: 	15.649250 (tokens/second)
I 00:00:08.556040 executorch:stats.h:153] 	Time to first generated token:	0.126000 (seconds)
I 00:00:08.556040 executorch:stats.h:160] 	Sampling time over 127 tokens:	0.137000 (seconds)
```

### Versions

Collecting environment information...
PyTorch version: 2.7.0.dev20250311
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 15.3.2 (arm64)
GCC version: Could not collect
Clang version: 16.0.0 (clang-1600.0.26.6)
CMake version: version 3.31.6
Libc version: N/A

Python version: 3.10.0 (default, Mar  3 2022, 03:54:28) [Clang 12.0.0 ] (64-bit runtime)
Python platform: macOS-15.3.2-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M4 Pro

Versions of relevant libraries:
[pip3] executorch==0.6.0a0+993b36b
[pip3] executorchcoreml==0.0.1
[pip3] numpy==2.0.0
[pip3] torch==2.7.0.dev20250311
[pip3] torchao==0.10.0+git7d879462
[pip3] torchaudio==2.6.0.dev20250311
[pip3] torchsr==1.0.4
[pip3] torchvision==0.22.0.dev20250311
[conda] executorch                0.6.0a0+993b36b          pypi_0    pypi
[conda] executorchcoreml          0.0.1                    pypi_0    pypi
[conda] numpy                     2.0.0                    pypi_0    pypi
[conda] torch                     2.7.0.dev20250311          pypi_0    pypi
[conda] torchao                   0.10.0+git7d879462          pypi_0    pypi
[conda] torchaudio                2.6.0.dev20250311          pypi_0    pypi
[conda] torchsr                   1.0.4                    pypi_0    pypi
[conda] torchvision               0.22.0.dev20250311          pypi_0    pypi

cc @kimishpatel @YifanShenSZ @cymbalrush @metascroy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama 3.2 3B Core ML poor output quality #9395

🐛 Describe the bug

Description

Results

Logs

Minimal Example

Comparison

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Llama 3.2 3B Core ML poor output quality #9395

Description

🐛 Describe the bug

Description

Results

Logs

Minimal Example

Comparison

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions