Code for the Paper "Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models".
๐ If you have any questions or suggestions, please don't hesitate to let us know. You can directly email Pan Lu using the email address lupantech@gmail.com, comment on the Twitter, or post an issue on this repository.
[Project Page] [Paper] [Twitter] [Linkedin] [YouTube Video]
- [2023.04.22] Thrilled to announce that our work has been featured on WorldofAI's YouTube channel!
- [2023.04.21] Our work is the trending project on https://trends.vercel.app. [Link]
- [2023.04.20] Huge thanks to John Nay for sharing our work on Twitter!
- [2023.04.19] Our research is now listed on Papers with Code.
- [2023.04.19] We appreciate Aran Komatsuzaki for featuring our work on Twitter in a timely manner!
- [2023.04.19] Special thanks to @_akhaliq for promptly sharing our work on Twitter!
- [2023.04.19] Visit our project's homepage at Chameleon-LLM.
- [2023.04.19] Our paper is now accessible at https://arxiv.org/abs/2304.09842.
Chameleon is a plug-and-play compositional reasoning framework that augments LLMs with various types of tools. Chameleon synthesizes programs to compose various tools, including LLM models, off-the-shelf vision models, web search engines, Python functions, and rule-based modules tailored to user interests. Built on top of an LLM as a natural language planner, Chameleon infers the appropriate sequence of tools to compose and execute in order to generate a final response.
We showcase the adaptability and effectiveness of Chameleon on two tasks: ScienceQA and TabMWP. Notably, Chameleon with GPT-4 achieves an 86.54% accuracy on ScienceQA, significantly improving upon the best published few-shot model by 11.37%; using GPT-4 as the underlying LLM, Chameleon achieves a 17.8% increase over the state-of-the-art model, leading to a 98.78% overall accuracy on TabMWP. Further studies suggest that using GPT-4 as a planner exhibits more consistent and rational tool selection and is able to infer potential constraints given the instructions, compared to other LLMs like ChatGPT.
For more details, you can find our project page here and our paper here.
We would like to express our immense gratitude to WorldofAI for featuring and introducing our work on YouTube!
- OpenAI API key
- Bing Search API (If you want to enable the bing search module but the module is optional)
Install all required python dependencies (generated by pipreqs
):
python==3.8.10
huggingface-hub
numpy==1.23.2
openai==0.23.0
pandas==1.4.3
transformers==4.21.1
requests==2.28.1
Install all required python dependencies (you can skip this step if you have set up the dependencies before and the verisons are not strictly required):
pip install -r requirements.txt
Obtain your OpenAI API key from: https://platform.openai.com/account/api-keys.
To use OpenAI API key for Chameleon, you NEED to have billing set up (AKA paid account).
You can set up paid account at https://platform.openai.com/account/billing/overview.
Obtain your Bing Search API key from: https://www.microsoft.com/en-us/bing/apis/bing-web-search-api.
The Bing Search API key is optional. Failure to set up this key will lead to a slight performance drop on the ScienceQA task.
Different types of tools in our module inventory:
Tools used on ScienceQA and TabMWP, respectively. The reusable tools in two tasks are highlighted in green:
Science Question Answering (ScienceQA) is a multi-modal question-answering benchmark covering a wide range of scientific topics over diverse contexts. The ScienceQA dataset is provided in data/scienceqa
. For more details, you can explore the datatset and check out the Explore page and Visualize page.
For the current version, the results for the Image Captioner
and Text Detector
are off-the-shelf and stored in data/scienceqa/captions.json
and data/scienceqa/ocrs.json
, respectively. The live calling these two modules are coming soon!
To run Chameleon (GPT-4):
cd run_scienceqa
python run.py \
--model chameleon \
--label chameleon_gpt4 \
--policy_engine gpt-4 \
--kr_engine gpt-4 \
--qg_engine gpt-4 \
--sg_engine gpt-4 \
--test_split test \
--test_number -1
It will generate the predictions and save the results at results/scienceqa/chameleon_gpt4_test.json
, results/scienceqa/chameleon_gpt4_test_cache.jsonl
, and results/scienceqa/chameleon_gpt4_test_cache.json
.
We can get the accuracy metrics on average and across different question classes by running:
python evaluate.py \
--data_file ../data/scienceqa/problems.json \
--result_root ../results/scienceqa \
--result_files chameleon_chatgpt_test_cache.jsonl
To run Chameleon (ChatGPT):
python run.py \
--model chameleon \
--label chameleon_gpt4 \
--policy_engine gpt-3.5-turbo \
--kr_engine gpt-3.5-turbo \
--qg_engine gpt-3.5-turbo \
--sg_engine gpt-3.5-turbo \
--test_split test \
--test_number -1
Our Chameleon is a generalized form of the CoT (chain-of-thought) method, where the generated program is a sequence of Solution Generator
and Answer Generator
. By passing --model
as cot
, modules
is set as ["solution_generator", "answer_generator"]
.
To run CoT (chain-of-thought prompted) GPT-4:
python run.py \
--model cot \
--label cot_gpt4 \
--sg_engine gpt-4 \
--test_split test \
--test_number -1
To run CoT (chain-of-thought prompted) ChatGPT:
python run.py \
--model cot \
--label cot_chatgpt \
--sg_engine gpt-4 \
--test_split test \
--test_number -1
The TabMWP dataset contains 38,431 tabular math word problems. Each question in TabMWP is aligned with a tabular context, which is presented as an image, semi-structured text, and a structured table. The TabMWP dataset is provided in data/tabmwp
. For more details, you can explore the datatset and check out the Explore page and Visualize page.
To run Chameleon (GPT-4):
cd run_tabmwp
python run.py \
--model chameleon \
--label chameleon_gpt4 \
--test_split test \
--policy_engine gpt-4 \
--rl_engine gpt-4 \
--cl_engine gpt-4 \
--tv_engine gpt-4 \
--kr_engine gpt-4 \
--sg_engine gpt-4 \
--pg_engine gpt-4 \
--test_number -1 \
--rl_cell_threshold 18 \
--cl_cell_threshold 18
It will generate the predictions and save the results at results/tabmwp/chameleon_gpt4_test.json
, results/tabmwp/chameleon_gpt4_test_cache.jsonl
, and results/tabmwp/chameleon_gpt4_test_cache.json
.
We can get the accuracy metrics on average and across different question classes by running:
python evaluate.py \
--data_file ../data/tabmwp/problems_test.json \
--result_root ../results/tabmwp \
--result_files chameleon_chatgpt_test_cache.jsonl
To run Chameleon (ChatGPT):
python run.py \
--model chameleon \
--label chameleon_chatgpt \
--test_split test \
--policy_engine gpt-3.5-turbo \
--rl_engine gpt-3.5-turbo \
--cl_engine gpt-3.5-turbo \
--tv_engine gpt-3.5-turbo \
--kr_engine gpt-3.5-turbo \
--sg_engine gpt-3.5-turbo \
--pg_engine gpt-3.5-turbo \
--test_number -1 \
--rl_cell_threshold 18 \
--cl_cell_threshold 18
To run CoT (chain-of-thought prompted) GPT-4:
python run.py \
--model cot \
--label cot_gpt4 \
--test_split test \
--sg_engine gpt-4 \
--test_number -1
To run CoT (chain-of-thought prompted) ChatGPT:
python run.py \
--model cot \
--label cot_chatgpt \
--test_split test \
--sg_engine gpt-3.5-turbo \
--test_number -1
Our Chameleon is a generalized form of the PoT (program-of-thought) method, where the generated program is a sequence of Program Generator
, Program Executor
, and Answer Generator
. By passing --model
as pot
, modules
is set as ["program_generator", "program_executor", "answer_generator"]
.
To run PoT (program-of-thought prompted) GPT-4:
python run.py \
--model pot \
--label pot_gpt4 \
--test_split test \
--pg_engine gpt-4 \
--test_number -1
To run PoT (program-of-thought prompted) ChatGPT:
python run.py \
--model pot \
--label pot_chatgpt \
--test_split test \
--pg_engine gpt-3.5-turbo \
--test_number -1
Chameleon (GPT-4) is able to adapt to different input queries by generating programs that compose various tools and executing them sequentially to obtain the correct answers.
For instance, the query above asks, โWhich animalโs skin is adapted for survival in cold places?โ, which involves scientific terminology related to animal survival. Consequently, the planner decides to rely on the Bing search engine for domain-specific knowledge, benefiting from the numerous online resources available.
The adaptability and versatility of our Chameleon for various queries are also observed on TabMWP, as illustrated in the examples in the figure above.
The first example involves mathematical reasoning on a tax form. Chameleon (1) calls the knowledge retrieval model to recall basic knowledge that assists in understanding such domain-specific tables, (2) describes the table in a more readable natural language format, and (3) finally relies on program-aided tools to perform precise computations.
In the second example, the system generates Python code that closely aligns with the background knowledge provided by the knowledge retrieval model.
The third example requires the system to locate the cell in a large tabular context given the input query. Chameleon calls the row lookup model to help accurately locate the relevant rows and generate the language solution via an LLM model, instead of relying on program-based tools.
Significant improvements are observed for Chameleon over both fine-tuned models and few-shot prompted GPT-4/ChatGPT:
To visualize the predictions made by Chameleon, simply execute the Jupyter Notebook corresponding to your specific task: notebooks/results_viewer_[TASK].ipynb
. This will provide an interactive and user-friendly way to explore the results generated by the model. Alternatively, explore our project page for more information and options.
Tools called in the generated programs from Chameleon (ChatGPT) and Chameleon (GPT-4) on ScienceQA:
Tools called in the generated programs from Chameleon (ChatGPT) and Chameleon (GPT-4) on TabMWP:
Execute notebooks/transition_[TASK]_[Model]_Engine.ipynb
to visualize the module transition graph for programs generated on the test set.
Transitions between modules in programs generated by Chameleon (GPT-4) on ScienceQA. START is the start symbol, END is a terminal symbol and the others are non-terminal symbols.
Transitions between modules in programs generated by Chameleon (GPT-4) on TabMWPQA. START is the start symbol, END is a terminal symbol and the others are non-terminal symbols.
- Construct the module inventory: Create prompts for LLM-based models within the
demos
directory. Define the input, execution, and output for each module inmodel.py
. - Develop the LLM planner: Provide a comprehensive description of the module inventory and include a few examples that demonstrate how to map queries to the target program.
- Implement the data loader and evaluation method: Define the data loader within
model.py
. To modify the evaluation method, update the corresponding section inmain.py
. - Enjoy the process: With the groundwork in place, it's time to have fun and dive into the task at hand!
Fantastic! I'm always open to engaging discussions, collaborations, or even just sharing a virtual coffee. To get in touch, visit Pan Lu's homepage for contact information.
If you find Chameleon useful for your your research and applications, please kindly cite using this BibTeX:
@article{lu2023chameleon,
title={Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models},
author={Lu, Pan and Peng, Baolin and Cheng, Hao and Galley, Michel and Chang, Kai-Wei and Wu, Ying Nian and Zhu, Song-Chun and Gao, Jianfeng},
journal={arXiv preprint arXiv:2304.09842},
year={2023}
}