Skip to content

LambdaLabsML/Product-of-Experts-ARC-Paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Product of Experts with LLMs:
Boosting Performance on ARC Is a Matter of Perspective


Code for the ICML 2025 Paper Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of Perspective, providing state-of-the-art methods leveraging Large Language Models (LLMs) and algorithmic sampling strategies to achieve a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set. Includes reproducible experiments, solution evaluation tools, and detailed instructions for running the experiments presented in the paper efficiently on accessible hardware.

Models

Both our models from the paper are available for download from Huggingface:

Evaluation (with test-time-training)

The primary entry points for evaluating a model on an ARC dataset are named in the format: evaluation_[model]_on_[dataset]_[sampling_strategy]_with_ttt.py.

These scripts automatically download the model from Huggingface, and then process each task in the dataset sequentially, performing test-time-training followed by candidate generation and scoring. Model outputs are cached using diskcache, allowing quick re-runs with identical settings.

Our evaluation code requires the unsloth and diskcache packages to be installed.

Running the Sudoku evaluation additionally requires downloading the Sudoku-3m dataset and the pandas package to be installed.

Initial finetuning

To re-run the initial finetuning of our NeMo-Minitron-8B model, execute the finetune_NemoReArc1200.py script. The training process requires 20,000 examples per task from the ReArc dataset, which must be generated in advance using Michael Hodel's ReArc code and placed under input/re_arc. The training code for our weaker Llama-3.2-3B model can be found in our older Kaggle Arc Prize 2024 github repository.

Please ensure the unsloth package is installed before running our training code. All our models were initially finetuned on a single Nvidia H100 GPU. If you encounter memory problems, consider reducing batch size and/or the max_tokens value. Using a batch size of 2 should allow finetuning Mistral-NeMo-Minitron-8B-Base on GPUs with 24 GB memory.

Files

Here is a rough overview of our files and classes:

arc_loader.py

  • Purpose: Handles all Data formatting and loading
  • Capabilities:
    • Class ArcDataset which handles all data set related tasks, e.g.:
    • Building datasets from various sources.
    • Modifying, shuffling, and augmenting examples.
    • Splitting, sorting, and filtering examples.
    • Handling dataset keys, challenges and solutions.
    • Preparing the data for tokenization.
    • Creating and verifying submissions.

model_tools.py

  • Purpose: Contains code for loading, saving and manipulating models
  • Capabilities:
    • Load and Save Model and LoRA adapters
    • Shrink Tokenizer and Embedding Layers
    • Data Collator for masking the task inputs and the first output

inference_tools.py

  • Purpose: Contains tools for inference and scoring
  • Capabilities:
    • Inference code, including our custom DFS
    • Score calculation

selection.py

  • Purpose: Contains functions used to select best answer from different Candidates
  • Capabilities:
    • Various score aggregation methods
    • Sorting candidates by their score for later submission generation
    • Class EvalTool for doing above tasks on-the-fly and printing results

finetuning_[model].py

  • Purpose: Run the initial finetuning process.
  • Required packages: unsloth
  • Steps:
    • Load the base model and reduce embedding size.
    • Load and augment training data.
    • Create a lora adapter and execute training.
    • Save the trained lora adapter.
    • Merge the lora model into the base model and save as final model.

evaluation_[model]_on_[dataset]_[sampling_strategy]_with_ttt.py

  • Purpose: Run inference
  • Required packages: unsloth and diskcache
  • Steps:
    • Load the finetuned model.
    • Run additional finetuning and inference for each task.
    • Write the submission.json and results.pickle.bz2 file.
    • Reload and verify the submission file.

License

Our code is available under the Apache 2.0 license. See the LICENSE.txt file for more info.

About

Code for the ICML 2025 Paper "Product of Experts with LLMs: Boosting Performance on ARC is a Matter of Perspective". (Also mirror of https://github.com/da-fr/Product-of-Experts-ARC-Paper)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages