SEMMA: A Semantic Aware Knowledge Graph Foundation Model

Our implementation is based on the official ULTRA codebase which we extended to integrate the proposed dual-stream architecture, including the LLM-based relation enrichment, G_r^TEXT construction, and the structural-textual fusion module.

This repository is based on PyTorch 2.1 and PyTorch-Geometric 2.4

Setup

You may install the dependencies via either conda or pip. SEMMA is implemented with Python 3.9, PyTorch 2.1, and PyG 2.4 (CUDA 11.8 or later when running on GPUs).

Create and activate a conda environment (recommended):
```
conda create -n semma python=3.9
conda activate semma
```
Run the setup script: This script will install all dependencies from requirements.txt and download the fb_mid2name.tsv file.
```
bash setup.sh
```
Make sure setup.sh is executable: chmod +x setup.sh. The fb_mid2name.tsv file is used for mapping Freebase MIDs to names.

How to Use

The primary scripts for running experiments are located in the script/ directory, with example usage provided below:

Pretraining

To pretrain a model, you can use script/pretrain.py with a corresponding configuration file. For example:

python script/pretrain.py -c config/transductive/pretrain_3g.yaml --gpus [0]

Refer to config/ for various pretraining setups.

Inference

For running inference (evaluation) on pretrained models:

Transductive Setting: Use script/run.py or script/run_many.py with a transductive config file. Example:

python script/run.py -c config/transductive/inference-fb.yaml --dataset FB15k237_10 --epochs 0 --bpe null --ckpt <path_to_your_checkpoint.pth> --gpus [0]

Inductive Setting: Similarly, use script/run.py or script/run_many.py with an inductive config file. Example:

python script/run.py -c config/inductive/inference.yaml --dataset Metafam --version 1 --ckpt <path_to_your_checkpoint.pth> --gpus [0]

The configuration files for different experiments (e.g., transductive, inductive, specific datasets) are located in the config/ directory.

Checkpoints

We provide pretrained model checkpoints in the ckpts/ directory. Notably, semma.pth is the checkpoint for our proposed SEMMA model. You can use these checkpoints directly for inference or fine-tuning.

Relation Descriptions (OpenRouter Integration)

The openrouter/ directory contains scripts and resources for generating and utilizing LLM-based relation descriptions.

prompt.py and prompt_async.py: Scripts to query LLMs for relation descriptions.
descriptions/: Directory containing generated descriptions from different LLMs used in our study.
relations/: Contains the list of relations of all the datasets invovled in this study.

If you intend to use the scripts in openrouter/ to query OpenAI (or other services via OpenRouter) for generating new relation descriptions, you will need to set up API access. Create a .env file in the root of the project with your API key:

OPENAI_API_KEY="your_openai_api_key_here"
# or other relevant keys if using different providers through OpenRouter

Harder Setting Dataset (takeaway3/)

The takeaway3/ directory is dedicated to experiments under a "harder" evaluation setting, involving unseen relations in test queries.

It contains the dataset used for this harder setting (within the mtdea/ subdirectory).
gen_split-1.py and gen_split-2.py: Scripts for generating the specific data splits required for this setting.

Configuration Flags (`flags.yaml`)

The flags.yaml file controls various aspects of the SEMMA model and experimental runs. Here's a breakdown of the key flags:

run: Specifies the model to run. Can be ultra (baseline) or semma (our proposed model).
- If semma is chosen, the subsequent flags related to the SEMMA architecture are used.
LLM: The Large Language Model used for relation enrichment.
- Options: gpt4o, qwen3-32b, deepseekv3.
rg2_embedding: Defines how the textual relation embeddings (G_r^TEXT) are constructed.
- Options:
  - combined: Takes the avg of the embeddings obtained from no llm, llm name and llm description
  - combined-sum: Takes the sum of the embeddings obtained from no llm, llm name and llm description
  - no llm: Excludes LLM-generated features.
  - llm name: Uses only the relation name embedding.
  - llm description: Uses only the relation description embedding.
model_embed: The embedding model used to encode relation names/descriptions.
- Options: sentbert (Sentence-BERT), jinaai (Jina AI embeddings).
topx: A float (0 to 1) indicating the top x% of all relation pairs (based on textual similarity) for which to consider adding an edge in G_r^TEXT. 0 might imply using a threshold.
threshold: A float (e.g., 0.8). The cosine similarity threshold for constructing G_r^TEXT
embedding_combiner: Method used to combine structural and textual embeddings in the fusion module.
- Options: mlp (Multi-Layer Perceptron), concat (concatenation), attention.
eval_on_valid: Boolean (True/False). If True, evaluation is also performed on the validation set during training or a inference run.
use_cos_sim_weights: Boolean (True/False). If True, the 5th type edges (textual similarity edges) are weighted by their cosine similarity scores.
gpus: Specifies the GPU ID(s) to use for training/inference (e.g., 0, [0, 1]).
harder_setting: Boolean (True/False). If True, the model is configured for the "harder" evaluation setting, using data from takeaway3/ which might involve new relations not seen during pretraining.

Adjust these flags in flags.yaml to configure your experiments according to your needs.

Todos

Add SEMMA Hybrid code

Citation

@misc{arun2025semmasemanticawareknowledge,
      title={SEMMA: A Semantic Aware Knowledge Graph Foundation Model}, 
      author={Arvindh Arun and Sumit Kumar and Mojtaba Nayyeri and Bo Xiong and Ponnurangam Kumaraguru and Antonio Vergari and Steffen Staab},
      year={2025},
      eprint={2505.20422},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.20422}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SEMMA: A Semantic Aware Knowledge Graph Foundation Model

Setup

How to Use

Pretraining

Inference

Checkpoints

Relation Descriptions (OpenRouter Integration)

Harder Setting Dataset (takeaway3/)

Configuration Flags (`flags.yaml`)

Todos

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ckpts		ckpts
config		config
openrouter		openrouter
script		script
takeaway3		takeaway3
ultra		ultra
README.md		README.md
flags.yaml		flags.yaml
requirements.txt		requirements.txt
setup.sh		setup.sh

arvindh75/semma

Folders and files

Latest commit

History

Repository files navigation

SEMMA: A Semantic Aware Knowledge Graph Foundation Model

Setup

How to Use

Pretraining

Inference

Checkpoints

Relation Descriptions (OpenRouter Integration)

Harder Setting Dataset (takeaway3/)

Configuration Flags (flags.yaml)

Todos

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Configuration Flags (`flags.yaml`)

Packages