Short introduction CASPER is a sparse model for scientific search that utilizes keyphrases, alongside tokens as representation units (i.e. a representation unit correspond to a dimension in the sparse embedding space). This enables CASPER to represent queries and texts with granular features AND research concepts, which is very important as scientific search is known to revolves around research concepts [1] (we search for papers with specific research concepts we want to learn about).
We also propose FRIEREN, a framework that generates training data for training scientific search models. The key idea behind FRIEREN is to leverage scholarly references (i.e. which we define as signals that capture how research concepts of papers are expressed in different settings). We utilize four types of scholarly references, namely titles, citation contexts, author-assigned keyphrases, and co-citations, to augment user interaction data (we use the SciRepEval Search dataset in our work).
We build CASPER based on SPLADE repository. Shout out to the authors of SPLADE.
Note: We are still working to clean up this repository. We hope to have the clean code and all the instructions in 1 or 2 weeks (Mid September 2025).
The project uses Python 3.10.14
cat requirements.txt | xargs -n 1 pip install
Please refer to inference_casper.ipynb for a quick test run
The scientific corpus
We employ ERU-KG, an unsupervised keyphrase generation/extraction model for extraction of keyphrases for all documents in concept_vocab/keyphrase_extraction/create_phrase_vocab_from_s2orc.py as follows
# The scientific corpus is splitted into slices (indicated by num_slices)
# In this example, we run the script with current_slice_index = 0 to 7
# One can run the scripts sequentially, or in parallel in different screens
CUDA_VISIBLE_DEVICES=*** python create_phrase_vocab_from_s2orc.py \
--num_slices 8 \
--current_slice_index 0 \
--output_folder /keyphrase/extraction/output/folder \
--top_k_candidates 5 # number of keyphrases extracted for each documentOnce this is done, one can find the extracted keyphrases in /keyphrase/extraction/output/folder, which are the used to build the keyphrase vocabulary
ls /keyphrase/extraction/output/folder
>>> 0.json 1.json 2.json 3.json 4.json 5.json 6.json 7.jsonTo build the keyphrase vocabulary concept_vocab/greedy_vocab_builder.py as follows
python greedy_vocab_builder.py \
--input_folder /keyphrase/extraction/output/folder \
--num_phrases 30000 \
--output_file keyphrase/vocab.jsonPlease refer to branch mlm_training of this repo.
After continuous pretraining, we train CASPER as follows
SPLADE_CONFIG_NAME="config_phrase_splade" \
CUDA_VISIBLE_DEVICES=*** \
python -m splade.trainPlease make sure to adjust hyperparameters within the configuration file conf/config_phrase_splade.yaml
FRIEREN is a framework for generating training data for CASPER. The key idea behind FRIEREN, as mentioned above, is to leverage scholarly references as (pseudo) queries. We describe how to use it in this section.
[Download and preprocess S2ORC corpus]
# first go to frieren/casper/preprocessing
cd frieren/casper/preprocessing
python process_dataset.py \
--api_key YOUR_SEMANTIC_SCHOLAR_API_KEY \
--max_files 20 \
--output_folder /YOUR/PREPROCESSED/S2ORC/FOLDERThis script will download, in this example, 20 shards (out of ~350 shards in total in the S2ORC corpus). In addition, it will also extract necessary information from each of the shards
# the downloaded shards
ls /YOUR/PREPROCESSED/S2ORC/FOLDER/s2orc_temp
>>> 0.gz 10.gz 11.gz 12.gz 13.gz 14.gz 15.gz 16.gz 17.gz 18.gz 19.gz 1.gz 2.gz 3.gz 4.gz 5.gz 6.gz 7.gz 8.gz 9.gz
# the preprocessed data
ls /YOUR/PREPROCESSED/S2ORC/FOLDER/extracted_metadata
>>> 0.jsonl 10.jsonl 11.jsonl 12.jsonl 13.jsonl 14.jsonl 15.jsonl 16.jsonl 17.jsonl 18.jsonl 19.jsonl 1.jsonl 2.jsonl 3.jsonl 4.jsonl 5.jsonl 6.jsonl 7.jsonl 8.jsonl 9.jsonlTODO: Need to update instruction for getting paper metadata from Semantic Scholar API, i.e.
/YOUR/PREPROCESSED/S2ORC/FOLDER/metadata_from_api/metadata_from_api.jsonl
Next, we need to obtain the metadata from Semantic Scholar paper API. More specifically, we obtain the following info for all paper ids
- abstract
- title
- corpusId
- fieldsOfStudy
python get_metadata.py --extracted_metadata_path /YOUR/PREPROCESSED/S2ORC/FOLDER/extracted_metadata \
--semantic_scholar_api_key YOUR_SEMANTIC_SCHOLAR_API_KEY \
--output_file /YOUR/PREPROCESSED/S2ORC/FOLDER/metadata_from_api/metadata_from_api.jsonl \
--s2orc_raw_folder /YOUR/PREPROCESSED/S2ORC/FOLDER/s2orc_temp/[Process each data type]
- Citation contexts
cd frieren/casper/data_types/citation_contexts/s2orc
python process_dataset.py \
--input_folder /YOUR/PREPROCESSED/S2ORC/FOLDER/extracted_metadata \
--output_file /YOUR/PREPROCESSED/S2ORC/FOLDER/citation_contexts_triplets/triplets_intermediate.tsv
python prepare_training_dataset.py \
--input_file /YOUR/PREPROCESSED/S2ORC/FOLDER/citation_contexts_triplets/triplets_intermediate.tsv \
--metadata_file /YOUR/PREPROCESSED/S2ORC/FOLDER/metadata_from_api/metadata_from_api.jsonl \
--output_file /YOUR/PREPROCESSED/S2ORC/FOLDER/citation_contexts_triplets/raw.tsv- Co-citations
cd frieren/casper/data_types/cocit/s2orc
python process_dataset.py \
--input_folder /YOUR/PREPROCESSED/S2ORC/FOLDER/extracted_metadata \
--output_file /YOUR/PREPROCESSED/S2ORC/FOLDER/cocit_triplets/triplets_intermediate.tsv
python prepare_training_dataset.py \
--input_file /YOUR/PREPROCESSED/S2ORC/FOLDER/cocit_triplets/triplets_intermediate.tsv \
--metadata_file /YOUR/PREPROCESSED/S2ORC/FOLDER/metadata_from_api/metadata_from_api.jsonl \
--output_file /YOUR/PREPROCESSED/S2ORC/FOLDER/cocit_triplets/raw.tsv- Author-assigned keyphrases For author assigned keyphrases, we utilize two keyphrase generation dataset KP20K and KPBioMed
cd frieren/casper/data_types/kp
mkdir /YOUR/PREPROCESSED/S2ORC/FOLDER/kp_triplets
python kp_datasets.py \
--output_file /YOUR/PREPROCESSED/S2ORC/FOLDER/kp_triplets/raw.tsv \
--max_collections 1000000- Titles
cd frieren/casper/data_types/title/s2orc
python process_dataset.py \
--input_folder /YOUR/PREPROCESSED/S2ORC/FOLDER/extracted_metadata \
--output_file /YOUR/PREPROCESSED/S2ORC/FOLDER/title_abstract_triplets/triplets_intermediate.tsv
python prepare_training_dataset.py \
--input_file /YOUR/PREPROCESSED/S2ORC/FOLDER/title_abstract_triplets/triplets_intermediate.tsv \
--metadata_file /YOUR/PREPROCESSED/S2ORC/FOLDER/metadata_from_api/metadata_from_api.jsonl \
--output_file /YOUR/PREPROCESSED/S2ORC/FOLDER/title_abstract_triplets/raw.tsv- User interaction data We use SciRepEval Search
cd frieren/casper/user_interaction/scirepeval_search
python prepare_training_dataset.py \
--output_file /YOUR/PREPROCESSED/S2ORC/FOLDER/query_triplets/raw.tsvFinally, we combine the triplets from these sources to form the final training dataset
cd frieren/casper/combined_datasetWe need to first go to frieren/casper/combined_dataset/combine_dataset.py, and set
...
files = {
"kp": "/YOUR/PREPROCESSED/S2ORC/FOLDER/kp_triplets/raw.tsv",
"cocit": "/YOUR/PREPROCESSED/S2ORC/FOLDER/cocit_triplets/raw.tsv",
"title": "/YOUR/PREPROCESSED/S2ORC/FOLDER/title_abstract_triplets/raw.tsv",
"user_interaction": "/YOUR/PREPROCESSED/S2ORC/FOLDER/query_triplets/raw.tsv",
"cc": "/YOUR/PREPROCESSED/S2ORC/FOLDER/citation_contexts_triplets/raw.tsv",
}
max_documents = {
# full set
"kp": 1500000,
"cocit": 1500000,
"title": 1500000,
"user_interaction": 1500000,
"cc": 1500000
}
...
output_folder = f"/YOUR/PREPROCESSED/S2ORC/FOLDER/combined_training_data/combined_{data_types_to_include_str}"TODOs: Need to adjust
frieren/casper/combined_dataset/combine_dataset.pyto use configuration file instead of modifying the script directly.
Run the script to combine the triplets
mkdir /YOUR/PREPROCESSED/S2ORC/FOLDER/combined_training_data
python combine_dataset.pyThe path of the training data should be /YOUR/PREPROCESSED/S2ORC/FOLDER/combined_training_data/combined_cc+cocit+kp+title+user_interaction/raw.tsv
To run text retrieval evaluation:
-
Download the required data from [DATA_URL] and save it in the
data/folder. -
If you want to see our CASPER++ experiments, set this path accordingly BM25_MODEL_URL and update the path in
pyserini_evaluation/eval.sh: BM25_MODELS_FOLDER=your_path
Update the absolute path of OUTPUT_FOLDER in index_and_eval.sh: export OUT_FOLDER="your_full_path"
bash index_and_eval.shTo run keyphrase generation:
cd keyphrase_generation
bash infer_and_eval.shIf you find our work useful, please consider to cite it as
@article{do2025casper,
title={CASPER: Concept-integrated Sparse Representation for Scientific Retrieval},
author={Do, Lam Thanh and Van Nguyen, Linh and Fu, David and Chang, Kevin Chen-Chuan},
journal={arXiv preprint arXiv:2508.13394},
year={2025}
}
[1] Bramer, Wichor M., et al. "A systematic approach to searching: an efficient and complete method to develop literature searches." Journal of the Medical Library Association: JMLA 106.4 (2018): 531.

