Skip to content

Commit 27d20bb

Browse files
lqdevCopilot
andcommitted
feat: complete project scaffold, extraction pipeline, tests, API, and CI
Phase 1: Repo scaffold - ontology/ (context.jsonld, kb.ttl, shapes.ttl) - Sample article (content/2026/04/what-is-a-knowledge-graph.md) - Project config (pyproject.toml, requirements.txt files) - Azure Functions + SWA config Phase 2: Extraction pipeline - tools/chunker.py: Deterministic Markdown chunking with sha256 IDs - tools/llm_client.py: GitHub Models client (OpenAI SDK, caching, backoff) - tools/postprocess.py: Entity canonicalization, dedup, JSON-LD/Turtle output - tools/kg_build.py: Build orchestrator (git diff detection, batching) - tools/prompts/extract_rdf_v1.txt: System prompt with ontology + few-shot Phase 3: Tests (30/30 passing) - test_chunker.py: Frontmatter, determinism, token targets - test_postprocess.py: Slugify, canonicalize, dedup, serialization - test_golden.py: End-to-end sample article chunking - test_shacl.py: SHACL validation (valid + invalid graphs) Phase 4: Azure deployment - api/function_app.py: SPARQL endpoint (RDFLib, module-level caching) Phase 5: CI/CD workflows - kg-build.yml: Extract KG on push (models:read, contents:write) - deploy-swa.yml: Deploy to Azure SWA Phase 6: Documentation - README.md with quickstart, SPARQL examples, architecture Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 02688b6 commit 27d20bb

27 files changed

Lines changed: 1837 additions & 0 deletions

.github/workflows/deploy-swa.yml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
name: Deploy to Azure Static Web Apps
2+
on:
3+
push:
4+
branches: [main]
5+
paths:
6+
- 'app/**'
7+
- 'api/**'
8+
- 'graph/**'
9+
workflow_dispatch:
10+
11+
permissions:
12+
contents: read
13+
14+
jobs:
15+
deploy:
16+
runs-on: ubuntu-latest
17+
steps:
18+
- name: Checkout
19+
uses: actions/checkout@v5
20+
21+
- name: Deploy to Azure SWA
22+
uses: Azure/static-web-apps-deploy@v1
23+
with:
24+
azure_static_web_apps_api_token: ${{ secrets.AZURE_STATIC_WEB_APPS_API_TOKEN }}
25+
repo_token: ${{ secrets.GITHUB_TOKEN }}
26+
action: upload
27+
app_location: /app
28+
api_location: /api
29+
output_location: ""
30+
skip_app_build: true
31+
skip_api_build: false

.github/workflows/kg-build.yml

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
name: KG Build — Extract Knowledge Graph
2+
on:
3+
push:
4+
branches: [main]
5+
paths:
6+
- 'content/**'
7+
- 'ontology/**'
8+
- 'tools/**'
9+
- 'tests/**'
10+
workflow_dispatch:
11+
12+
permissions:
13+
contents: write
14+
models: read
15+
16+
jobs:
17+
build-graph:
18+
runs-on: ubuntu-latest
19+
steps:
20+
- name: Checkout
21+
uses: actions/checkout@v5
22+
with:
23+
fetch-depth: 0
24+
25+
- name: Set up Python
26+
uses: actions/setup-python@v5
27+
with:
28+
python-version: '3.11'
29+
cache: 'pip'
30+
31+
- name: Install dependencies
32+
run: pip install -r tools/requirements.txt
33+
34+
- name: Run tests
35+
run: python -m pytest tests/ -v --tb=short
36+
37+
- name: Build knowledge graph
38+
env:
39+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
40+
LLM_MODEL: openai/gpt-4o-mini
41+
run: python -m tools.kg_build --repo-root . --base-url https://example.com
42+
43+
- name: Check for changes
44+
id: changes
45+
run: |
46+
git add graph/
47+
if git diff --cached --quiet; then
48+
echo "changed=false" >> $GITHUB_OUTPUT
49+
else
50+
echo "changed=true" >> $GITHUB_OUTPUT
51+
fi
52+
53+
- name: Commit graph artifacts
54+
if: steps.changes.outputs.changed == 'true'
55+
run: |
56+
git config user.name "github-actions[bot]"
57+
git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
58+
git add graph/
59+
git commit -m "chore: update knowledge graph [skip ci]"
60+
git push

README.md

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Markdown-LD Knowledge Bank
2+
3+
A Git-based knowledge bank where human-authored Markdown articles are processed by an LLM CI pipeline to extract Linked Data (RDF/JSON-LD), served as static content on Azure Static Web Apps, with a serverless SPARQL endpoint.
4+
5+
## Architecture
6+
7+
```
8+
content/*.md → GitHub Actions → LLM (GitHub Models) → graph/*.jsonld + *.ttl
9+
10+
Azure Static Web Apps
11+
├── Static site
12+
├── Graph files
13+
└── SPARQL API (RDFLib)
14+
```
15+
16+
## Quick Start
17+
18+
### Prerequisites
19+
20+
- Python 3.11+
21+
- Git
22+
- Azure CLI (for deployment)
23+
24+
### Local Development
25+
26+
```bash
27+
# Install dependencies
28+
pip install -r tools/requirements.txt
29+
30+
# Run tests
31+
python -m pytest tests/ -v
32+
33+
# Dry run (chunk only, no LLM)
34+
python -m tools.kg_build --dry-run
35+
36+
# Full build (requires GITHUB_TOKEN)
37+
export GITHUB_TOKEN=your_token
38+
python -m tools.kg_build --repo-root . --base-url https://example.com
39+
```
40+
41+
### Writing Articles
42+
43+
Create Markdown files in `content/` with YAML frontmatter:
44+
45+
```markdown
46+
---
47+
title: "Your Article Title"
48+
date_published: "2026-04-15"
49+
tags:
50+
- knowledge-graphs
51+
- rdf
52+
entity_hints:
53+
- label: "RDF"
54+
type: "schema:Thing"
55+
sameAs: "https://www.wikidata.org/entity/Q54872"
56+
---
57+
58+
# Your Content Here
59+
60+
Write naturally. The LLM pipeline extracts entities and relationships.
61+
Use [[wikilinks]] to link between articles.
62+
```
63+
64+
### Example SPARQL Queries
65+
66+
**Find all entities mentioned in an article:**
67+
```sparql
68+
PREFIX schema: <https://schema.org/>
69+
SELECT ?entity ?name WHERE {
70+
<https://example.com/2026/04/what-is-a-knowledge-graph/> schema:mentions ?entity .
71+
?entity schema:name ?name .
72+
}
73+
```
74+
75+
**Find all articles about a topic:**
76+
```sparql
77+
PREFIX schema: <https://schema.org/>
78+
SELECT ?article ?title WHERE {
79+
?article a schema:Article ;
80+
schema:mentions <https://example.com/id/knowledge-graph> ;
81+
schema:name ?title .
82+
}
83+
```
84+
85+
**Find connections between entities:**
86+
```sparql
87+
PREFIX schema: <https://schema.org/>
88+
SELECT ?subject ?predicate ?object WHERE {
89+
?subject ?predicate ?object .
90+
FILTER(?predicate != rdf:type)
91+
}
92+
LIMIT 50
93+
```
94+
95+
## Project Structure
96+
97+
```
98+
├── content/ # Markdown articles (human-authored)
99+
├── ontology/ # JSON-LD context, vocabulary, SHACL shapes
100+
├── tools/ # Extraction pipeline (chunker, LLM client, post-processor)
101+
├── graph/ # Generated artifacts (committed by CI)
102+
│ ├── articles/ # Per-article JSON-LD and Turtle
103+
│ ├── views/ # Precomputed JSON views
104+
│ ├── cache/ # Per-chunk extraction cache
105+
│ └── manifest.json # Build metadata
106+
├── api/ # Azure Function (SPARQL endpoint)
107+
├── app/ # Static web app
108+
├── tests/ # Test suite
109+
└── .github/workflows/
110+
├── kg-build.yml # KG extraction pipeline
111+
└── deploy-swa.yml # Azure SWA deployment
112+
```
113+
114+
## Key Design Decisions
115+
116+
| Decision | Choice | Rationale |
117+
|----------|--------|-----------|
118+
| LLM Provider | GitHub Models (free) | Zero cost, GITHUB_TOKEN auth |
119+
| LLM Model | `openai/gpt-4o-mini` | Best quality/limit ratio (150 req/day) |
120+
| SPARQL Engine | RDFLib | Pure Python, small footprint, built-in JSON-LD |
121+
| Validation | pySHACL | Standard W3C SHACL, works with RDFLib |
122+
| Batching | 3-5 chunks/request | Stay under 8K input token limit |
123+
124+
## Rate Limits
125+
126+
GitHub Models free tier (GPT-4o-mini): 150 requests/day, 8K input tokens.
127+
The pipeline batches 3-5 chunks per request and caches results to stay within limits.
128+
129+
## License
130+
131+
MIT

api/function_app.py

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
"""Azure Function: SPARQL endpoint using RDFLib.
2+
3+
Loads all .ttl files from the graph/articles/ directory into a combined
4+
RDFLib Dataset, then serves SPARQL queries via HTTP GET/POST.
5+
"""
6+
7+
import json
8+
import os
9+
import logging
10+
from pathlib import Path
11+
12+
import azure.functions as func
13+
import rdflib
14+
from rdflib import Dataset, Graph
15+
16+
app = func.FunctionApp(http_auth_level=func.AuthLevel.ANONYMOUS)
17+
18+
# Module-level cache: load graph once per cold start
19+
_dataset: Dataset | None = None
20+
21+
22+
def _load_dataset() -> Dataset:
23+
"""Load all Turtle files into an RDFLib Dataset."""
24+
global _dataset
25+
if _dataset is not None:
26+
return _dataset
27+
28+
ds = Dataset()
29+
graph_dir = Path(__file__).parent.parent / "graph" / "articles"
30+
31+
if graph_dir.exists():
32+
for ttl_file in graph_dir.glob("*.ttl"):
33+
try:
34+
g = Graph()
35+
g.parse(str(ttl_file), format="turtle")
36+
for triple in g:
37+
ds.add(triple)
38+
logging.info(f"Loaded {len(g)} triples from {ttl_file.name}")
39+
except Exception as e:
40+
logging.error(f"Failed to parse {ttl_file.name}: {e}")
41+
42+
logging.info(f"Total triples loaded: {len(ds)}")
43+
_dataset = ds
44+
return _dataset
45+
46+
47+
@app.route(route="sparql", methods=["GET", "POST"])
48+
def sparql_endpoint(req: func.HttpRequest) -> func.HttpResponse:
49+
"""Handle SPARQL queries per W3C SPARQL 1.1 Protocol."""
50+
# Extract query
51+
query = None
52+
if req.method == "GET":
53+
query = req.params.get("query")
54+
elif req.method == "POST":
55+
content_type = req.headers.get("Content-Type", "")
56+
if "application/sparql-query" in content_type:
57+
query = req.get_body().decode("utf-8")
58+
elif "application/x-www-form-urlencoded" in content_type:
59+
query = req.params.get("query") or req.form.get("query")
60+
else:
61+
# Try body as raw query
62+
query = req.get_body().decode("utf-8")
63+
64+
if not query:
65+
return func.HttpResponse(
66+
json.dumps({"error": "Missing 'query' parameter"}),
67+
status_code=400,
68+
mimetype="application/json",
69+
)
70+
71+
# Safety: block mutating queries
72+
query_upper = query.strip().upper()
73+
if any(kw in query_upper for kw in ["INSERT", "DELETE", "LOAD", "CLEAR", "DROP", "CREATE"]):
74+
return func.HttpResponse(
75+
json.dumps({"error": "Only SELECT and ASK queries are allowed"}),
76+
status_code=403,
77+
mimetype="application/json",
78+
)
79+
80+
# Execute query
81+
try:
82+
ds = _load_dataset()
83+
result = ds.query(query)
84+
serialized = result.serialize(format="json")
85+
if isinstance(serialized, bytes):
86+
serialized = serialized.decode("utf-8")
87+
88+
return func.HttpResponse(
89+
serialized,
90+
mimetype="application/sparql-results+json",
91+
headers={
92+
"Access-Control-Allow-Origin": "*",
93+
"Cache-Control": "public, max-age=300",
94+
},
95+
)
96+
except Exception as e:
97+
logging.error(f"SPARQL query error: {e}")
98+
return func.HttpResponse(
99+
json.dumps({"error": f"Query execution failed: {str(e)}"}),
100+
status_code=400,
101+
mimetype="application/json",
102+
)

api/host.json

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"version": "2.0",
3+
"extensionBundle": {
4+
"id": "Microsoft.Azure.Functions.ExtensionBundle",
5+
"version": "[4.*, 5.0.0)"
6+
}
7+
}

api/requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
rdflib>=7.1.1,<8.0

app/index.html

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
<!DOCTYPE html>
2+
<html lang="en">
3+
<head>
4+
<meta charset="UTF-8">
5+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
6+
<title>Knowledge Bank</title>
7+
<style>
8+
body { font-family: system-ui, sans-serif; max-width: 720px; margin: 2rem auto; padding: 0 1rem; line-height: 1.6; color: #333; }
9+
h1 { border-bottom: 2px solid #0366d6; padding-bottom: 0.3rem; }
10+
a { color: #0366d6; }
11+
code { background: #f6f8fa; padding: 0.2em 0.4em; border-radius: 3px; }
12+
pre { background: #f6f8fa; padding: 1rem; border-radius: 6px; overflow-x: auto; }
13+
</style>
14+
</head>
15+
<body>
16+
<h1>📚 Knowledge Bank</h1>
17+
<p>A Git-based knowledge bank powered by Markdown articles and Linked Data.</p>
18+
19+
<h2>Resources</h2>
20+
<ul>
21+
<li><a href="/graph/views/entities.json">Entity Index</a> (JSON)</li>
22+
<li><a href="/graph/views/articles_by_tag.json">Articles by Tag</a> (JSON)</li>
23+
<li><a href="/graph/dataset.trig">Full Dataset</a> (TriG/RDF)</li>
24+
</ul>
25+
26+
<h2>SPARQL Endpoint</h2>
27+
<p>Query the knowledge graph at <code>/sparql?query=...</code></p>
28+
<pre>PREFIX schema: &lt;https://schema.org/&gt;
29+
SELECT ?article ?title WHERE {
30+
?article a schema:Article ;
31+
schema:name ?title .
32+
} LIMIT 50</pre>
33+
34+
<h2>About</h2>
35+
<p>
36+
Articles are written in Markdown under <code>/content</code>.
37+
A CI pipeline extracts entities and relations using an LLM,
38+
producing JSON-LD and Turtle files under <code>/graph</code>.
39+
This site is hosted on Azure Static Web Apps (Free plan).
40+
</p>
41+
</body>
42+
</html>

app/staticwebapp.config.json

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
{
2+
"routes": [
3+
{ "route": "/sparql", "rewrite": "/api/sparql" }
4+
],
5+
"navigationFallback": {
6+
"rewrite": "/index.html"
7+
},
8+
"globalHeaders": {
9+
"Cache-Control": "public, max-age=300"
10+
},
11+
"platform": {
12+
"apiRuntime": "python:3.11"
13+
}
14+
}

0 commit comments

Comments
 (0)