DatasetSAGE (LLM-Agent Open-Source Implementation)

This repository contains a concise, reviewer-friendly implementation of: DatasetSAGE: A Structure-aware Multi-Agent Framework for Scientific Dataset Discovery with Grounded Evidence.

This version is a agent pipeline and keeps the paper's key mechanisms while remaining readable.

What This Implementation Preserves

Usage-record evidence cards as the retrieval and reasoning unit.
Four agents:
- Retrieval Agent
- Reason Agent
- Review Agent
- Re-ranking Agent
Three retrieval channels:
- Dense RAG retrieval over evidence-card embeddings
- Structure-aware graph-contrastive retrieval from paper-dataset usage graph
- Open-world LLM generation channel
Two execution modes:
- Open-loop: Retrieval -> Reason -> Re-ranking
- Closed-loop: Retrieval -> Reason -> Review -> Retrieval ... -> Re-ranking

Repository Layout

datasetsage/models.py: core data contracts.
datasetsage/retrieval.py: multi-channel retrieval logic.
datasetsage/agents.py: LLM-driven reason/review/re-rank agents.
datasetsage/prompts.py: paper-appendix prompt templates.
datasetsage/llm.py: OpenAI-compatible client wrapper.
datasetsage/pipeline.py: open-loop and closed-loop orchestration.
scripts/run_demo.py: runnable demo entrypoint.
data/examples/recommendation_samples_20.json: 20-paper sampled result showcase with ground truth and final recommendations.

Quickstart

Run from repository root:

export OPENAI_API_KEY="your_key"
python scripts/run_demo.py \
  --records /path/to/usage_records.json \
  --query /path/to/query_paper.json

Optional arguments:

python scripts/run_demo.py \
  --rounds 2 \
  --top-k 5 \
  --chat-model gpt-4o-mini \
  --embedding-model text-embedding-3-small \
  --output demo_result.json

Input Formats

usage_records.json (list of usage edges):

{
  "record_id": "rec_001",
  "source_paper_id": "p_001",
  "source_title": "Paper title",
  "dataset_entity": "coco",
  "dataset_name": "COCO",
  "task": "image captioning",
  "modality": "image",
  "evidence": "Paper usage evidence sentence"
}

query_paper.json:

{
  "paper_id": "q_001",
  "title": "Your query paper title",
  "abstract": "Your query paper abstract",
  "task_goal": "Optional explicit goal"
}

Real Agent Mechanics

RAG (Dense): embeds query/evidence-card texts and retrieves by cosine similarity.
Structure-aware retrieval (Graph): uses LightGCN-style graph propagation to build dataset structural embeddings, then trains a query projection with InfoNCE-style contrastive learning and retrieves top-$L$ structural matches.
Open-world channel: uses LLM JSON output to propose extra candidates and maps them to canonical dataset entities.
Reason/Review/Re-ranking: all three are LLM calls that return structured JSON decisions.
Closed-loop behavior: round 0 uses Dense + Graph + Open-World; round 1+ re-invokes only Dense with review-refined query (paper-aligned).

Paper-Aligned Defaults

Dense Top-$K$: 50
Graph Top-$K$: 30
Open-World Top-$K$: 20
Reason threshold: 0.2
Closed-loop rounds: 2
Final output Top-$K$: 30
Embedding model: text-embedding-3-small
LLM model: qwen-plus
LLM temperature: 0.3

Notes

In the future, we also consider packaging this into skills and releasing it to the agent community.

Requires valid OpenAI-compatible API credentials.
This implementation is designed for reproducibility and readability in peer review while preserving real agent behavior.
This is the current open-source implementation for the paper; we plan to release a more complete production-grade system in the future.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data/examples		data/examples
datasetsage		datasetsage
pic		pic
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DatasetSAGE (LLM-Agent Open-Source Implementation)

What This Implementation Preserves

Repository Layout

Quickstart

Input Formats

Real Agent Mechanics

Paper-Aligned Defaults

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DatasetSAGE (LLM-Agent Open-Source Implementation)

What This Implementation Preserves

Repository Layout

Quickstart

Input Formats

Real Agent Mechanics

Paper-Aligned Defaults

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages