CASPER: Concept-integrated Sparse Representation for Scientific Retrieval

Short introduction CASPER is a sparse model for scientific search that utilizes keyphrases, alongside tokens as representation units (i.e. a representation unit correspond to a dimension in the sparse embedding space). This enables CASPER to represent queries and texts with granular features AND research concepts, which is very important as scientific search is known to revolves around research concepts [1] (we search for papers with specific research concepts we want to learn about).

We also propose FRIEREN, a framework that generates training data for training scientific search models. The key idea behind FRIEREN is to leverage scholarly references (i.e. which we define as signals that capture how research concepts of papers are expressed in different settings). We utilize four types of scholarly references, namely titles, citation contexts, author-assigned keyphrases, and co-citations, to augment user interaction data (we use the SciRepEval Search dataset in our work).

We build CASPER based on SPLADE repository. Shout out to the authors of SPLADE.

Note: We are still working to clean up this repository. We hope to have the clean code and all the instructions in 1 or 2 weeks (Mid September 2025).

Installation

The project uses Python 3.10.14

cat requirements.txt | xargs -n 1 pip install

CASPER

Quick start

Please refer to inference_casper.ipynb for a quick test run

Scientific Corpus $\mathcal{D}$

The scientific corpus $\mathcal{D}$ is used to construct the keyphrase vocabulary $V_k$ and also to perform continuous pretraining of BERT (after keyphrases from $V_k$ are added to a BERT model as new "tokens")

$\mathcal{D}$ contains concatenated titles and abstracts of scientific articles from the S2ORC dataset. We use the preprocessed version provided by sentence-transformer. This version contain titles and abstracts of over 40M articles, from which we randomly sampled 10M to form $\mathcal{D}$ with random seed = 0.

Extracting Keyphrases

We employ ERU-KG, an unsupervised keyphrase generation/extraction model for extraction of keyphrases for all documents in $\mathcal{D}$. To do so, use the file concept_vocab/keyphrase_extraction/create_phrase_vocab_from_s2orc.py as follows

# The scientific corpus is splitted into slices (indicated by num_slices)
# In this example, we run the script with current_slice_index = 0 to 7
# One can run the scripts sequentially, or in parallel in different screens
CUDA_VISIBLE_DEVICES=*** python create_phrase_vocab_from_s2orc.py \
--num_slices 8 \
--current_slice_index 0 \
--output_folder /keyphrase/extraction/output/folder \
--top_k_candidates 5        # number of keyphrases extracted for each document

Once this is done, one can find the extracted keyphrases in /keyphrase/extraction/output/folder, which are the used to build the keyphrase vocabulary $V_k$

ls /keyphrase/extraction/output/folder
>>> 0.json  1.json  2.json  3.json  4.json  5.json  6.json  7.json

Keyphrase Vocabulary

To build the keyphrase vocabulary $V_k$, we run the script concept_vocab/greedy_vocab_builder.py as follows

python greedy_vocab_builder.py \
--input_folder /keyphrase/extraction/output/folder \
--num_phrases 30000 \
--output_file keyphrase/vocab.json

Continuous Pretraining

Please refer to branch mlm_training of this repo.

Training CASPER

After continuous pretraining, we train CASPER as follows

SPLADE_CONFIG_NAME="config_phrase_splade" \
CUDA_VISIBLE_DEVICES=*** \
python -m splade.train

Please make sure to adjust hyperparameters within the configuration file conf/config_phrase_splade.yaml

FRIEREN

FRIEREN is a framework for generating training data for CASPER. The key idea behind FRIEREN, as mentioned above, is to leverage scholarly references as (pseudo) queries. We describe how to use it in this section.

[Download and preprocess S2ORC corpus]

# first go to frieren/casper/preprocessing
cd frieren/casper/preprocessing

python process_dataset.py \
--api_key YOUR_SEMANTIC_SCHOLAR_API_KEY \
--max_files 20 \
--output_folder /YOUR/PREPROCESSED/S2ORC/FOLDER

This script will download, in this example, 20 shards (out of ~350 shards in total in the S2ORC corpus). In addition, it will also extract necessary information from each of the shards

# the downloaded shards
ls /YOUR/PREPROCESSED/S2ORC/FOLDER/s2orc_temp
>>> 0.gz  10.gz  11.gz  12.gz  13.gz  14.gz  15.gz  16.gz  17.gz  18.gz  19.gz  1.gz  2.gz  3.gz  4.gz  5.gz  6.gz  7.gz  8.gz  9.gz

# the preprocessed data
ls /YOUR/PREPROCESSED/S2ORC/FOLDER/extracted_metadata
>>> 0.jsonl  10.jsonl  11.jsonl  12.jsonl  13.jsonl  14.jsonl  15.jsonl  16.jsonl  17.jsonl  18.jsonl  19.jsonl  1.jsonl  2.jsonl  3.jsonl  4.jsonl  5.jsonl  6.jsonl  7.jsonl  8.jsonl  9.jsonl

TODO: Need to update instruction for getting paper metadata from Semantic Scholar API, i.e. /YOUR/PREPROCESSED/S2ORC/FOLDER/metadata_from_api/metadata_from_api.jsonl

Next, we need to obtain the metadata from Semantic Scholar paper API. More specifically, we obtain the following info for all paper ids

abstract
title
corpusId
fieldsOfStudy

python get_metadata.py --extracted_metadata_path /YOUR/PREPROCESSED/S2ORC/FOLDER/extracted_metadata \
--semantic_scholar_api_key YOUR_SEMANTIC_SCHOLAR_API_KEY \
--output_file /YOUR/PREPROCESSED/S2ORC/FOLDER/metadata_from_api/metadata_from_api.jsonl \
--s2orc_raw_folder /YOUR/PREPROCESSED/S2ORC/FOLDER/s2orc_temp/

[Process each data type]

Citation contexts

cd frieren/casper/data_types/citation_contexts/s2orc


python process_dataset.py \
--input_folder /YOUR/PREPROCESSED/S2ORC/FOLDER/extracted_metadata \
--output_file /YOUR/PREPROCESSED/S2ORC/FOLDER/citation_contexts_triplets/triplets_intermediate.tsv

python prepare_training_dataset.py \
--input_file /YOUR/PREPROCESSED/S2ORC/FOLDER/citation_contexts_triplets/triplets_intermediate.tsv \
--metadata_file /YOUR/PREPROCESSED/S2ORC/FOLDER/metadata_from_api/metadata_from_api.jsonl \
--output_file /YOUR/PREPROCESSED/S2ORC/FOLDER/citation_contexts_triplets/raw.tsv

Co-citations

cd frieren/casper/data_types/cocit/s2orc

python process_dataset.py \
--input_folder /YOUR/PREPROCESSED/S2ORC/FOLDER/extracted_metadata \
--output_file /YOUR/PREPROCESSED/S2ORC/FOLDER/cocit_triplets/triplets_intermediate.tsv

python prepare_training_dataset.py \
--input_file /YOUR/PREPROCESSED/S2ORC/FOLDER/cocit_triplets/triplets_intermediate.tsv \
--metadata_file /YOUR/PREPROCESSED/S2ORC/FOLDER/metadata_from_api/metadata_from_api.jsonl \
--output_file /YOUR/PREPROCESSED/S2ORC/FOLDER/cocit_triplets/raw.tsv

Author-assigned keyphrases For author assigned keyphrases, we utilize two keyphrase generation dataset KP20K and KPBioMed

cd frieren/casper/data_types/kp

mkdir /YOUR/PREPROCESSED/S2ORC/FOLDER/kp_triplets
python kp_datasets.py \
--output_file /YOUR/PREPROCESSED/S2ORC/FOLDER/kp_triplets/raw.tsv \
--max_collections 1000000

Titles

cd frieren/casper/data_types/title/s2orc

python process_dataset.py \
--input_folder /YOUR/PREPROCESSED/S2ORC/FOLDER/extracted_metadata \
--output_file /YOUR/PREPROCESSED/S2ORC/FOLDER/title_abstract_triplets/triplets_intermediate.tsv

python prepare_training_dataset.py \
--input_file /YOUR/PREPROCESSED/S2ORC/FOLDER/title_abstract_triplets/triplets_intermediate.tsv \
--metadata_file /YOUR/PREPROCESSED/S2ORC/FOLDER/metadata_from_api/metadata_from_api.jsonl \
--output_file /YOUR/PREPROCESSED/S2ORC/FOLDER/title_abstract_triplets/raw.tsv

User interaction data We use SciRepEval Search

cd frieren/casper/user_interaction/scirepeval_search

python prepare_training_dataset.py \
--output_file /YOUR/PREPROCESSED/S2ORC/FOLDER/query_triplets/raw.tsv

Finally, we combine the triplets from these sources to form the final training dataset

cd frieren/casper/combined_dataset

We need to first go to frieren/casper/combined_dataset/combine_dataset.py, and set

...
files = {
        "kp": "/YOUR/PREPROCESSED/S2ORC/FOLDER/kp_triplets/raw.tsv",
        "cocit": "/YOUR/PREPROCESSED/S2ORC/FOLDER/cocit_triplets/raw.tsv",
        "title": "/YOUR/PREPROCESSED/S2ORC/FOLDER/title_abstract_triplets/raw.tsv", 
        "user_interaction": "/YOUR/PREPROCESSED/S2ORC/FOLDER/query_triplets/raw.tsv",
        "cc": "/YOUR/PREPROCESSED/S2ORC/FOLDER/citation_contexts_triplets/raw.tsv",
}

max_documents = {
  # full set
  "kp": 1500000,
  "cocit": 1500000,
  "title": 1500000,
  "user_interaction": 1500000,
  "cc": 1500000
}

...

output_folder = f"/YOUR/PREPROCESSED/S2ORC/FOLDER/combined_training_data/combined_{data_types_to_include_str}"

TODOs: Need to adjust frieren/casper/combined_dataset/combine_dataset.py to use configuration file instead of modifying the script directly.

Run the script to combine the triplets

mkdir /YOUR/PREPROCESSED/S2ORC/FOLDER/combined_training_data
python combine_dataset.py

The path of the training data should be /YOUR/PREPROCESSED/S2ORC/FOLDER/combined_training_data/combined_cc+cocit+kp+title+user_interaction/raw.tsv

Evaluation

Text Retrieval

To run text retrieval evaluation:

Download the required data from [DATA_URL] and save it in the data/ folder.
If you want to see our CASPER++ experiments, set this path accordingly BM25_MODEL_URL and update the path in pyserini_evaluation/eval.sh: BM25_MODELS_FOLDER=your_path

Update the absolute path of OUTPUT_FOLDER in index_and_eval.sh: export OUT_FOLDER="your_full_path"

bash index_and_eval.sh

Keyphrase Generation

To run keyphrase generation:

cd keyphrase_generation
bash infer_and_eval.sh

Cite our paper

If you find our work useful, please consider to cite it as

@article{do2025casper,
  title={CASPER: Concept-integrated Sparse Representation for Scientific Retrieval},
  author={Do, Lam Thanh and Van Nguyen, Linh and Fu, David and Chang, Kevin Chen-Chuan},
  journal={arXiv preprint arXiv:2508.13394},
  year={2025}
}

References

[1] Bramer, Wichor M., et al. "A systematic approach to searching: an efficient and complete method to develop literature searches." Journal of the Medical Library Association: JMLA 106.4 (2018): 531.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
analysis		analysis
archive		archive
bash_scripts		bash_scripts
concept_vocab		concept_vocab
conf		conf
frieren		frieren
img		img
keyphrase_generation		keyphrase_generation
outputs		outputs
pyserini_evaluation		pyserini_evaluation
splade		splade
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conda_splade_env.yml		conda_splade_env.yml
index_and_eval.sh		index_and_eval.sh
inference_casper.ipynb		inference_casper.ipynb
inference_casperv2.ipynb		inference_casperv2.ipynb
splade_inference.py		splade_inference.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CASPER: Concept-integrated Sparse Representation for Scientific Retrieval

Installation

CASPER

Quick start

Scientific Corpus $\mathcal{D}$

Extracting Keyphrases

Keyphrase Vocabulary

Continuous Pretraining

Training CASPER

FRIEREN

Evaluation

Text Retrieval

Keyphrase Generation

Cite our paper

References

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

louisdo/CASPER

Folders and files

Latest commit

History

Repository files navigation

CASPER: Concept-integrated Sparse Representation for Scientific Retrieval

Installation

CASPER

Quick start

Scientific Corpus $\mathcal{D}$

Extracting Keyphrases

Keyphrase Vocabulary

Continuous Pretraining

Training CASPER

FRIEREN

Evaluation

Text Retrieval

Keyphrase Generation

Cite our paper

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages