Skip to content

Llama 3.2 3B Core ML poor output quality #9395

Open
@jakmro

Description

@jakmro

🐛 Describe the bug

Description

I tried running Llama 3.2 3B Instruct, exported for Core ML, using the llama runner built for the Core ML backend. However, the model's output quality is poor.

Results

Running the Llama 3.2 3B Instruct with the following prompt:

What's PyTorch?

returns this output:

regulatorships, and following sugarsigmaMA and pi(r)del of NOLa and UINextre(S) above alligiobscop to Loss and Gainsto Loss to Loss to Reggain and Loss to Regainstomatchaugh linethat, and thatis, that is.

Okay, let's start over. Let's start with a clean slate. Let's start fresh and clean. Let's start with a clean slate. Let's start fresh and clean. Let's start with a clean slate. Let's start fresh and clean. Let's start with a clean

Logs

(executorch) jakubmroz@Jakubs-MacBook-Pro executorch % cmake-out/examples/models/llama/llama_main --model_path=llama3.pte --tokenizer_path=llama323B/tokenizer.model --prompt="What's PyTorch?"
I 00:00:00.003438 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003486 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003493 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003498 executorch:cpuinfo_utils.cpp:100] Failed to open midr file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003500 executorch:cpuinfo_utils.cpp:116] CPU info and manual query on # of cpus dont match.
I 00:00:00.003502 executorch:main.cpp:69] Resetting threadpool with num threads = 0
I 00:00:00.003514 executorch:runner.cpp:65] Creating LLaMa runner: model_path=llama3.pte, tokenizer_path=llama323B/tokenizer.model
I 00:00:01.222029 executorch:runner.cpp:98] Reading metadata from model
I 00:00:01.222049 executorch:runner.cpp:123] Metadata: use_sdpa_with_kv_cache = 0
I 00:00:01.222052 executorch:runner.cpp:123] Metadata: use_kv_cache = 1
I 00:00:01.222053 executorch:runner.cpp:123] Metadata: get_vocab_size = 128256
I 00:00:01.222054 executorch:runner.cpp:123] Metadata: get_bos_id = 1
I 00:00:01.222056 executorch:runner.cpp:123] Metadata: get_max_seq_len = 128
I 00:00:01.222058 executorch:runner.cpp:123] Metadata: enable_dynamic_shape = 0
I 00:00:01.222059 executorch:runner.cpp:130] eos_id = 2
I 00:00:01.222065 executorch:runner.cpp:184] RSS after loading model: 0.000000 MiB (0 if unsupported)
What's PyTorch?II 00:00:01.738157 executorch:runner.cpp:254] RSS after prompt prefill: 0.000000 MiB (0 if unsupported)
regulatorships, and following sugarsigmaMA and pi(r)del of NOLa and UINextre(S) above alligiobscop to Loss and Gainsto Loss to Loss to Reggain and Loss to Regainstomatchaugh linethat, and thatis, that is.

Okay, let's start over. Let's start with a clean slate. Let's start fresh and clean. Let's start with a clean slate. Let's start fresh and clean. Let's start with a clean slate. Let's start fresh and clean. Let's start with a clean
I 00:00:11.833666 executorch:runner.cpp:268] RSS after finishing text generation: 0.000000 MiB (0 if unsupported)
PyTorchObserver {"prompt_tokens":6,"generated_tokens":121,"model_load_start_ms":1742382949879,"model_load_end_ms":1742382951098,"inference_start_ms":1742382951098,"inference_end_ms":1742382961709,"prompt_eval_end_ms":1742382951614,"first_token_ms":1742382951614,"aggregate_sampling_time_ms":84,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:11.833679 executorch:stats.h:110] 	Prompt Tokens: 6    Generated Tokens: 121
I 00:00:11.833680 executorch:stats.h:116] 	Model Load Time:		1.219000 (seconds)
I 00:00:11.833681 executorch:stats.h:126] 	Total inference time:		10.611000 (seconds)		 Rate: 	11.403261 (tokens/second)
I 00:00:11.833683 executorch:stats.h:134] 		Prompt evaluation:	0.516000 (seconds)		 Rate: 	11.627907 (tokens/second)
I 00:00:11.833684 executorch:stats.h:145] 		Generated 121 tokens:	10.095000 (seconds)		 Rate: 	11.986132 (tokens/second)
I 00:00:11.833685 executorch:stats.h:153] 	Time to first generated token:	0.516000 (seconds)
I 00:00:11.833686 executorch:stats.h:160] 	Sampling time over 127 tokens:	0.084000 (seconds)

Minimal Example

This is my command history:

conda create -yn executorch python=3.10.0
conda activate executorch

git clone --branch viable/strict https://github.com/pytorch/executorch.git
cd executorch

git submodule sync
git submodule update --init

./install_executorch.sh --pybind coreml mps xnnpack

./backends/apple/coreml/scripts/install_requirements.sh
./backends/apple/mps/install_requirements.sh

examples/models/llama/install_requirements.sh

python -m examples.models.llama.export_llama --checkpoint llama323B/consolidated.00.pth --params llama323B/params.json -kv --disable_dynamic_shape --coreml

cmake -DPYTHON_EXECUTABLE=python \
    -DCMAKE_INSTALL_PREFIX=cmake-out \
    -DEXECUTORCH_ENABLE_LOGGING=1 \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
    -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
    -DEXECUTORCH_BUILD_COREML=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -Bcmake-out .

cmake --build cmake-out -j16 --target install --config Release

git submodule update --init --recursive

cmake -DPYTHON_EXECUTABLE=python \
    -DCMAKE_INSTALL_PREFIX=cmake-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_COREML=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -Bcmake-out/examples/models/llama \
    examples/models/llama

cmake --build cmake-out/examples/models/llama -j16 --config Release

cmake-out/examples/models/llama/llama_main --model_path=llama3.pte --tokenizer_path=llama323B/tokenizer.model --prompt="What's PyTorch?"

Comparison

For comparison, the same model (Llama 3.2 3B Instruct) exported for the XNNPACK backend using the following command:

python -m examples.models.llama.export_llama --model "llama3_2" --checkpoint llama323B/consolidated.00.pth --params llama323B/params.json -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2_bf16.pte"

When run in the Llama runner with the same prompt:

What's PyTorch?

produces the following output:

PyTorch is an open-source machine learning library developed by Facebook's AI Research Lab. It provides a dynamic computation graph, similar to TensorFlow, but with a more flexible and Pythonic API.

### Key Features

1.  **Dynamic Computation Graph**: PyTorch's computation graph is dynamic, which means that it can be modified at runtime. This allows for more flexibility and easier debugging.
2.  **Pythonic API**: PyTorch's API is designed to be Pythonic, which means that it's easy to use and understand. It's also compatible with Python 3.5

Complete logs for XNNPACK example:

(executorch) jakubmroz@Jakubs-MacBook-Pro executorch % cmake-out/examples/models/llama/llama_main --model_path=llama3_2_bf16.pte --tokenizer_path=llama323B/tokenizer.model --prompt="What's PyTorch?"
I 00:00:00.003284 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003325 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003333 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003336 executorch:cpuinfo_utils.cpp:100] Failed to open midr file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003338 executorch:cpuinfo_utils.cpp:116] CPU info and manual query on # of cpus dont match.
I 00:00:00.003340 executorch:main.cpp:69] Resetting threadpool with num threads = 0
I 00:00:00.003350 executorch:runner.cpp:65] Creating LLaMa runner: model_path=llama3_2_bf16.pte, tokenizer_path=llama323B/tokenizer.model
I 00:00:00.697519 executorch:runner.cpp:98] Reading metadata from model
I 00:00:00.697534 executorch:runner.cpp:123] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:00.697536 executorch:runner.cpp:123] Metadata: use_kv_cache = 1
I 00:00:00.697538 executorch:runner.cpp:123] Metadata: get_vocab_size = 128256
I 00:00:00.697539 executorch:runner.cpp:123] Metadata: get_bos_id = 128000
I 00:00:00.697541 executorch:runner.cpp:123] Metadata: get_max_seq_len = 128
I 00:00:00.697542 executorch:runner.cpp:123] Metadata: enable_dynamic_shape = 1
I 00:00:00.697544 executorch:runner.cpp:130] eos_id = 128009
I 00:00:00.697545 executorch:runner.cpp:130] eos_id = 128001
I 00:00:00.697551 executorch:runner.cpp:184] RSS after loading model: 0.000000 MiB (0 if unsupported)
What's PyTorch?I 00:00:00.822303 executorch:text_prefiller.cpp:53] Prefill token result numel(): 128256
**

I 00:00:00.823610 executorch:runner.cpp:254] RSS after prompt prefill: 0.000000 MiB (0 if unsupported)
PyTorch is an open-source machine learning library developed by Facebook's AI Research Lab. It provides a dynamic computation graph, similar to TensorFlow, but with a more flexible and Pythonic API.

### Key Features

1.  **Dynamic Computation Graph**: PyTorch's computation graph is dynamic, which means that it can be modified at runtime. This allows for more flexibility and easier debugging.
2.  **Pythonic API**: PyTorch's API is designed to be Pythonic, which means that it's easy to use and understand. It's also compatible with Python 3.5
I 00:00:08.556003 executorch:runner.cpp:268] RSS after finishing text generation: 0.000000 MiB (0 if unsupported)
PyTorchObserver {"prompt_tokens":6,"generated_tokens":121,"model_load_start_ms":1742383796391,"model_load_end_ms":1742383797086,"inference_start_ms":1742383797086,"inference_end_ms":1742383804944,"prompt_eval_end_ms":1742383797212,"first_token_ms":1742383797212,"aggregate_sampling_time_ms":137,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:08.556034 executorch:stats.h:110] 	Prompt Tokens: 6    Generated Tokens: 121
I 00:00:08.556035 executorch:stats.h:116] 	Model Load Time:		0.695000 (seconds)
I 00:00:08.556036 executorch:stats.h:126] 	Total inference time:		7.858000 (seconds)		 Rate: 	15.398320 (tokens/second)
I 00:00:08.556037 executorch:stats.h:134] 		Prompt evaluation:	0.126000 (seconds)		 Rate: 	47.619048 (tokens/second)
I 00:00:08.556038 executorch:stats.h:145] 		Generated 121 tokens:	7.732000 (seconds)		 Rate: 	15.649250 (tokens/second)
I 00:00:08.556040 executorch:stats.h:153] 	Time to first generated token:	0.126000 (seconds)
I 00:00:08.556040 executorch:stats.h:160] 	Sampling time over 127 tokens:	0.137000 (seconds)

Versions

Collecting environment information...
PyTorch version: 2.7.0.dev20250311
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 15.3.2 (arm64)
GCC version: Could not collect
Clang version: 16.0.0 (clang-1600.0.26.6)
CMake version: version 3.31.6
Libc version: N/A

Python version: 3.10.0 (default, Mar 3 2022, 03:54:28) [Clang 12.0.0 ] (64-bit runtime)
Python platform: macOS-15.3.2-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M4 Pro

Versions of relevant libraries:
[pip3] executorch==0.6.0a0+993b36b
[pip3] executorchcoreml==0.0.1
[pip3] numpy==2.0.0
[pip3] torch==2.7.0.dev20250311
[pip3] torchao==0.10.0+git7d879462
[pip3] torchaudio==2.6.0.dev20250311
[pip3] torchsr==1.0.4
[pip3] torchvision==0.22.0.dev20250311
[conda] executorch 0.6.0a0+993b36b pypi_0 pypi
[conda] executorchcoreml 0.0.1 pypi_0 pypi
[conda] numpy 2.0.0 pypi_0 pypi
[conda] torch 2.7.0.dev20250311 pypi_0 pypi
[conda] torchao 0.10.0+git7d879462 pypi_0 pypi
[conda] torchaudio 2.6.0.dev20250311 pypi_0 pypi
[conda] torchsr 1.0.4 pypi_0 pypi
[conda] torchvision 0.22.0.dev20250311 pypi_0 pypi

cc @kimishpatel @YifanShenSZ @cymbalrush @metascroy

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: coremlIssues related to Apple's Core ML delegation and code under backends/apple/coreml/

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions