Our implementation is based on the official ULTRA codebase which we extended to integrate the proposed dual-stream architecture, including the LLM-based relation enrichment, GrTEXT construction, and the structural-textual fusion module.
This repository is based on PyTorch 2.1 and PyTorch-Geometric 2.4
You may install the dependencies via either conda or pip. SEMMA is implemented with Python 3.9, PyTorch 2.1, and PyG 2.4 (CUDA 11.8 or later when running on GPUs).
-
Create and activate a conda environment (recommended):
conda create -n semma python=3.9 conda activate semma
-
Run the setup script: This script will install all dependencies from
requirements.txtand download thefb_mid2name.tsvfile.bash setup.sh
Make sure
setup.shis executable:chmod +x setup.sh. Thefb_mid2name.tsvfile is used for mapping Freebase MIDs to names.
The primary scripts for running experiments are located in the script/ directory, with example usage provided below:
To pretrain a model, you can use script/pretrain.py with a corresponding configuration file. For example:
python script/pretrain.py -c config/transductive/pretrain_3g.yaml --gpus [0]Refer to config/ for various pretraining setups.
For running inference (evaluation) on pretrained models:
-
Transductive Setting: Use
script/run.pyorscript/run_many.pywith a transductive config file. Example:python script/run.py -c config/transductive/inference-fb.yaml --dataset FB15k237_10 --epochs 0 --bpe null --ckpt <path_to_your_checkpoint.pth> --gpus [0]
-
Inductive Setting: Similarly, use
script/run.pyorscript/run_many.pywith an inductive config file. Example:python script/run.py -c config/inductive/inference.yaml --dataset Metafam --version 1 --ckpt <path_to_your_checkpoint.pth> --gpus [0]
The configuration files for different experiments (e.g., transductive, inductive, specific datasets) are located in the config/ directory.
We provide pretrained model checkpoints in the ckpts/ directory. Notably, semma.pth is the checkpoint for our proposed SEMMA model. You can use these checkpoints directly for inference or fine-tuning.
The openrouter/ directory contains scripts and resources for generating and utilizing LLM-based relation descriptions.
prompt.pyandprompt_async.py: Scripts to query LLMs for relation descriptions.descriptions/: Directory containing generated descriptions from different LLMs used in our study.relations/: Contains the list of relations of all the datasets invovled in this study.
If you intend to use the scripts in openrouter/ to query OpenAI (or other services via OpenRouter) for generating new relation descriptions, you will need to set up API access.
Create a .env file in the root of the project with your API key:
OPENAI_API_KEY="your_openai_api_key_here"
# or other relevant keys if using different providers through OpenRouterThe takeaway3/ directory is dedicated to experiments under a "harder" evaluation setting, involving unseen relations in test queries.
- It contains the dataset used for this harder setting (within the
mtdea/subdirectory). gen_split-1.pyandgen_split-2.py: Scripts for generating the specific data splits required for this setting.
The flags.yaml file controls various aspects of the SEMMA model and experimental runs. Here's a breakdown of the key flags:
run: Specifies the model to run. Can beultra(baseline) orsemma(our proposed model).- If
semmais chosen, the subsequent flags related to the SEMMA architecture are used.
- If
LLM: The Large Language Model used for relation enrichment.- Options:
gpt4o,qwen3-32b,deepseekv3.
- Options:
rg2_embedding: Defines how the textual relation embeddings (GrTEXT) are constructed.- Options:
combined: Takes the avg of the embeddings obtained fromno llm,llm nameandllm descriptioncombined-sum: Takes the sum of the embeddings obtained fromno llm,llm nameandllm descriptionno llm: Excludes LLM-generated features.llm name: Uses only the relation name embedding.llm description: Uses only the relation description embedding.
- Options:
model_embed: The embedding model used to encode relation names/descriptions.- Options:
sentbert(Sentence-BERT),jinaai(Jina AI embeddings).
- Options:
topx: A float (0 to 1) indicating the top x% of all relation pairs (based on textual similarity) for which to consider adding an edge in GrTEXT.0might imply using a threshold.threshold: A float (e.g., 0.8). The cosine similarity threshold for constructing GrTEXTembedding_combiner: Method used to combine structural and textual embeddings in the fusion module.- Options:
mlp(Multi-Layer Perceptron),concat(concatenation),attention.
- Options:
eval_on_valid: Boolean (True/False). IfTrue, evaluation is also performed on the validation set during training or a inference run.use_cos_sim_weights: Boolean (True/False). IfTrue, the 5th type edges (textual similarity edges) are weighted by their cosine similarity scores.gpus: Specifies the GPU ID(s) to use for training/inference (e.g.,0,[0, 1]).harder_setting: Boolean (True/False). IfTrue, the model is configured for the "harder" evaluation setting, using data fromtakeaway3/which might involve new relations not seen during pretraining.
Adjust these flags in flags.yaml to configure your experiments according to your needs.
- Add SEMMA Hybrid code
@misc{arun2025semmasemanticawareknowledge,
title={SEMMA: A Semantic Aware Knowledge Graph Foundation Model},
author={Arvindh Arun and Sumit Kumar and Mojtaba Nayyeri and Bo Xiong and Ponnurangam Kumaraguru and Antonio Vergari and Steffen Staab},
year={2025},
eprint={2505.20422},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.20422},
}