Product of Experts with LLMs:
Boosting Performance on ARC Is a Matter of Perspective

Code for the ICML 2025 Paper Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of Perspective, providing state-of-the-art methods leveraging Large Language Models (LLMs) and algorithmic sampling strategies to achieve a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set. Includes reproducible experiments, solution evaluation tools, and detailed instructions for running the experiments presented in the paper efficiently on accessible hardware.

Models

Both our models from the paper are available for download from Huggingface:

NeMo-Minitron-8B model (71.6% on the public ARC-AGI evaluation set)
Llama-3.2-3B model (61.4% on the public ARC-AGI evaluation set)

Evaluation (with test-time-training)

The primary entry points for evaluating a model on an ARC dataset are named in the format: evaluation_[model]_on_[dataset]_[sampling_strategy]_with_ttt.py.

These scripts automatically download the model from Huggingface, and then process each task in the dataset sequentially, performing test-time-training followed by candidate generation and scoring. Model outputs are cached using diskcache, allowing quick re-runs with identical settings.

Our evaluation code requires the unsloth and diskcache packages to be installed.

Running the Sudoku evaluation additionally requires downloading the Sudoku-3m dataset and the pandas package to be installed.

Initial finetuning

To re-run the initial finetuning of our NeMo-Minitron-8B model, execute the finetune_NemoReArc1200.py script. The training process requires 20,000 examples per task from the ReArc dataset, which must be generated in advance using Michael Hodel's ReArc code and placed under input/re_arc. The training code for our weaker Llama-3.2-3B model can be found in our older Kaggle Arc Prize 2024 github repository.

Please ensure the unsloth package is installed before running our training code. All our models were initially finetuned on a single Nvidia H100 GPU. If you encounter memory problems, consider reducing batch size and/or the max_tokens value. Using a batch size of 2 should allow finetuning Mistral-NeMo-Minitron-8B-Base on GPUs with 24 GB memory.

Files

Here is a rough overview of our files and classes:

`arc_loader.py`

Purpose: Handles all Data formatting and loading
Capabilities:
- Class ArcDataset which handles all data set related tasks, e.g.:
- Building datasets from various sources.
- Modifying, shuffling, and augmenting examples.
- Splitting, sorting, and filtering examples.
- Handling dataset keys, challenges and solutions.
- Preparing the data for tokenization.
- Creating and verifying submissions.

`model_tools.py`

Purpose: Contains code for loading, saving and manipulating models
Capabilities:
- Load and Save Model and LoRA adapters
- Shrink Tokenizer and Embedding Layers
- Data Collator for masking the task inputs and the first output

`inference_tools.py`

Purpose: Contains tools for inference and scoring
Capabilities:
- Inference code, including our custom DFS
- Score calculation

`selection.py`

Purpose: Contains functions used to select best answer from different Candidates
Capabilities:
- Various score aggregation methods
- Sorting candidates by their score for later submission generation
- Class EvalTool for doing above tasks on-the-fly and printing results

`finetuning_[model].py`

Purpose: Run the initial finetuning process.
Required packages: unsloth
Steps:
- Load the base model and reduce embedding size.
- Load and augment training data.
- Create a lora adapter and execute training.
- Save the trained lora adapter.
- Merge the lora model into the base model and save as final model.

`evaluation_[model]_on_[dataset]_[sampling_strategy]_with_ttt.py`

Purpose: Run inference
Required packages: unsloth and diskcache
Steps:
- Load the finetuned model.
- Run additional finetuning and inference for each task.
- Write the submission.json and results.pickle.bz2 file.
- Reload and verify the submission file.

License

Our code is available under the Apache 2.0 license. See the LICENSE.txt file for more info.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
arc_downloader.py		arc_downloader.py
arc_loader.py		arc_loader.py
evaluate_LlamaReArc_on_ARCeval_16x_dfs0.090_with_ttt.py		evaluate_LlamaReArc_on_ARCeval_16x_dfs0.090_with_ttt.py
evaluate_LlamaReArc_on_Sudoku3m.py		evaluate_LlamaReArc_on_Sudoku3m.py
evaluate_NemoReArc1200_on_ARCeval_16x_beam2_with_ttt.py		evaluate_NemoReArc1200_on_ARCeval_16x_beam2_with_ttt.py
evaluate_NemoReArc1200_on_ARCeval_16x_beam4_with_ttt.py		evaluate_NemoReArc1200_on_ARCeval_16x_beam4_with_ttt.py
evaluate_NemoReArc1200_on_ARCeval_16x_dfs0.005_with_ttt.py		evaluate_NemoReArc1200_on_ARCeval_16x_dfs0.005_with_ttt.py
evaluate_NemoReArc1200_on_ARCeval_16x_dfs0.090_with_ttt.py		evaluate_NemoReArc1200_on_ARCeval_16x_dfs0.090_with_ttt.py
evaluate_NemoReArc1200_on_ARCeval_16x_dfs0.200_with_ttt.py		evaluate_NemoReArc1200_on_ARCeval_16x_dfs0.200_with_ttt.py
evaluate_NemoReArc1200_on_ARCeval_16x_greedy_with_ttt.py		evaluate_NemoReArc1200_on_ARCeval_16x_greedy_with_ttt.py
evaluate_NemoReArc1200_on_ConceptARC_16x_dfs0.090_with_ttt.py		evaluate_NemoReArc1200_on_ConceptARC_16x_dfs0.090_with_ttt.py
finetune_NemoReArc1200.py		finetune_NemoReArc1200.py
inference_tools.py		inference_tools.py
model_tools.py		model_tools.py
selection.py		selection.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Product of Experts with LLMs:
Boosting Performance on ARC Is a Matter of Perspective

Models

Evaluation (with test-time-training)

Initial finetuning

Files

`arc_loader.py`

`model_tools.py`

`inference_tools.py`

`selection.py`

`finetuning_[model].py`

`evaluation_[model]_on_[dataset]_[sampling_strategy]_with_ttt.py`

License

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

LambdaLabsML/Product-of-Experts-ARC-Paper

Folders and files

Latest commit

History

Repository files navigation

Product of Experts with LLMs:Boosting Performance on ARC Is a Matter of Perspective

Models

Evaluation (with test-time-training)

Initial finetuning

Files

arc_loader.py

model_tools.py

inference_tools.py

selection.py

finetuning_[model].py

evaluation_[model]_on_[dataset]_[sampling_strategy]_with_ttt.py

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Product of Experts with LLMs:
Boosting Performance on ARC Is a Matter of Perspective

`arc_loader.py`

`model_tools.py`

`inference_tools.py`

`selection.py`

`finetuning_[model].py`

`evaluation_[model]_on_[dataset]_[sampling_strategy]_with_ttt.py`

Packages