This repository contains the code bases of OpenScholar.
Blog | Demo | Paper | Model checkpoints and data | ScholarQABench | Expert Evaluation | Slides
- Overview of OpenScholar
- Repository Organizations
- Installation
- Run OpenScholar
- Train OpenScholar-8B
- Run Retriever
- Contact and Citation
Scientific progress hinges on our ability to find, synthesize, and build on relevant knowledge from the scientific literature. However, the exponential growth of this literature—with millions of papers now published each year—has made it increasingly difficult for scientists to find the information they need or even stay abreast of the latest findings in a single subfield.
To help scientists effectively navigate and synthesize scientific literature, we introduce OpenScholar, a retrieval-augmented language model (LM) designed to answer user queries by first searching for relevant papers in the literature and then generating responses grounded in those sources. Try open-scholar.allen.ai/ and check our paper for more detail.
This repository contains codes to run OpenScholar inference.
src/
: Main source codes for OpenScholar.training/
: Our training code to train Llama 3.1 8B using our processed data. We modified earlier version oftorchtune
for training.retriever/
: Code base to run retrieval offline & host retrieval servers for online retrieval.
For automatic and human evaluations, please check the following repositories.
- To run evaluations on ScholarQABench, please check the ScholarQABench repository.
- For our human evaluation interfaces as well as the results, please check the OpenScholar_ExpertEval repository.
To run OpenScholar inference, please ensure that all necessary libraries are installed.
[test environment command]
conda create -n os_env python=3.10.0
conda activate os_env
pip install -r requirements.txt
python -m spacy download en_core_web_sm
Also please set the following API keys:
export S2_API_KEY=YOUR_S2_API_KEY
See instructions to acquire API keys at Semantic Scholar API Page.
If you want to also want to use web search engine, then sign up for you.com web API and set the key.
export YOUR_API_KEY=YOUR_YOU_COM_API_KEY
For information related to OpenScholar training and retriever components, refer to the training/
and retrieval/
directories, respectively.
By default, OpenScholar takes retrieval results from off-line retrieval results after running the retrieval scripts in retrieval/, followed by additional retrieval from Semantic Scholar Paper API and web search API results. See the script src/use_search_apis.py to retrieve related passages offline using external APIs.
We released our retrieval results at google drive.
- Run a Standard RAG pipeline using top 10
python run.py \
--input_file YOUR_INPUT_FILE \
--model_name OpenScholar/Llama-3.1_OpenScholar-8B \
--use_contexts \
--output_file OUTPUT_FILE_PATH \
--top_n 10 --llama3 --zero_shot
- Run a Retriever+ Reranker Pipeline
python run.py \
--input_file YOUR_INPUT_FILE \
--model_name OpenScholar/Llama-3.1_OpenScholar-8B \
--use_contexts \
--ranking_ce \
--reranker OpenScholar/OpenScholar_Reranker \
--output_file OUTPUT_FILE_PATH \
--top_n 10 --llama3 --zero_shot
- Run Open Retriever Self-reflective Generation pipeline
python run.py \
--input_file YOUR_INPUT_FILE \
--model_name OpenScholar/Llama-3.1_OpenScholar-8B \
--use_contexts --output_file OUTPUT_FILE_NAME \
--top_n 10 --llama3 --use_contexts \
--ranking_ce --reranker OpenScholar/OpenScholar_Reranker \
--posthoc --feedack --ss_retriever \
--use_abstract --norm_cite --zero_shot --max_per_paper 3 \
You can also combine the OpenScholar pipeline with propriety LLMs, by specifying model_name
, api
and api_key_fp
.
python run.py \
--input_file YOUR_INPUT_FILE \
--model_name "gpt-4o" \
--api "openai" \
--api_key_fp PATH_TO_YOUR_OPEN_AI_KEY \
--use_contexts \
--output_file OUTPUT_FILE_PATH \
--top_n 10 --llama3 --zero_shot
Below, we provide the detailed of configurations.
top_n
: The number of passages to be fed into the underlying LM. By default, we use10
for multi-paper tasks.feedback
: Set true if you want to use the self-feedback loop during generation.posthoc_at
: Set true if you want to run posthoc citation attributionszero_shot
: Set true if you want to run inference in a zero-shot manner.ranking_ce
: Use a reranking model to reranktop_n
passages; If not set true, we take thetop_n
passages from thectxs
in the provided input file.reranker
: Specify the path to the reranker model file (local or HF hub). If you use our OpenScholar reranker, setOpenScholar/OpenScholar_Reranker
min_citation
: You can set the minimum number of citations. If anyint
is given, we exclude papers whose citations is belowmin_citation
. By default, we set it toNone
and all papers are considered regardless of their citation counts.ss_retriever
: Use semantic scholar API during the feedback generation loop to enhance the feedback results.use_abstract
: Consider abstract to enhance the reranking results.max_per_paper
: set the maximum number of passages from the same paper used during inference time.task_name
: specify the task names when you run the single paper tasks. For SciFact, PubmedQA and QASA, the corresponding task names areclaim_full
,boolean_question_full
andsingle_qa
, respectively.
We train our OpenScholar-8B using our OpenScholar/OS_Train_Data data, which consists of 13k instruction-tuning data. We use our modified version of torchtune to train our 8B model using 8*A100.
See mode detailed instructions for setting up the training in train/
Both our peS2o v2 and v3 datastore (chunked text + index) are available:
See instructions under retriever to run the peS2o index locally. Note that due to the massive-scale of index (200+M embeddings based on 45 million papers), the peS2o retriever requires a lot of CPU memory. In our main experiments, we retrieved initial passages offline.
We are planning to release our efficient sparse-dense retriever API endpoint used for the OpenScholar Demo publicly via Semantic Scholar API to accelerate research for LLMs for scientific literature synthesis. Stay tune!!d!
If you have any questions, please contact akari@cs.washington
. Note that I am currently applying for academic jobs so I may be slow to respond.
If you have any questions related to demo, please file your request from google form.
@article{openscholar,
title={{OpenScholar}: Synthesizing Scientific Literature with Retrieval-Augmented Language Models},
author={Asai, Akari and He*, Jacqueline and Shao*, Rulin and Shi, Weijia and Singh, Amanpreet and Chang, Joseph Chee and Lo, Kyle and Soldaini, Luca and Feldman, Tian, Sergey and Mike, D’arcy and Wadden, David and Latzke, Matt and Minyang and Ji, Pan and Liu, Shengyan and Tong, Hao and Wu, Bohao and Xiong, Yanyu and Zettlemoyer, Luke and Weld, Dan and Neubig, Graham and Downey, Doug and Yih, Wen-tau and Koh, Pang Wei and Hajishirzi, Hannaneh},
journal={Arxiv},
year={2024},
}