|
| 1 | +# Markdown-LD Knowledge Bank |
| 2 | + |
| 3 | +A Git-based knowledge bank where human-authored Markdown articles are processed by an LLM CI pipeline to extract Linked Data (RDF/JSON-LD), served as static content on Azure Static Web Apps, with a serverless SPARQL endpoint. |
| 4 | + |
| 5 | +## Architecture |
| 6 | + |
| 7 | +``` |
| 8 | +content/*.md → GitHub Actions → LLM (GitHub Models) → graph/*.jsonld + *.ttl |
| 9 | + ↓ |
| 10 | + Azure Static Web Apps |
| 11 | + ├── Static site |
| 12 | + ├── Graph files |
| 13 | + └── SPARQL API (RDFLib) |
| 14 | +``` |
| 15 | + |
| 16 | +## Quick Start |
| 17 | + |
| 18 | +### Prerequisites |
| 19 | + |
| 20 | +- Python 3.11+ |
| 21 | +- Git |
| 22 | +- Azure CLI (for deployment) |
| 23 | + |
| 24 | +### Local Development |
| 25 | + |
| 26 | +```bash |
| 27 | +# Install dependencies |
| 28 | +pip install -r tools/requirements.txt |
| 29 | + |
| 30 | +# Run tests |
| 31 | +python -m pytest tests/ -v |
| 32 | + |
| 33 | +# Dry run (chunk only, no LLM) |
| 34 | +python -m tools.kg_build --dry-run |
| 35 | + |
| 36 | +# Full build (requires GITHUB_TOKEN) |
| 37 | +export GITHUB_TOKEN=your_token |
| 38 | +python -m tools.kg_build --repo-root . --base-url https://example.com |
| 39 | +``` |
| 40 | + |
| 41 | +### Writing Articles |
| 42 | + |
| 43 | +Create Markdown files in `content/` with YAML frontmatter: |
| 44 | + |
| 45 | +```markdown |
| 46 | +--- |
| 47 | +title: "Your Article Title" |
| 48 | +date_published: "2026-04-15" |
| 49 | +tags: |
| 50 | + - knowledge-graphs |
| 51 | + - rdf |
| 52 | +entity_hints: |
| 53 | + - label: "RDF" |
| 54 | + type: "schema:Thing" |
| 55 | + sameAs: "https://www.wikidata.org/entity/Q54872" |
| 56 | +--- |
| 57 | + |
| 58 | +# Your Content Here |
| 59 | + |
| 60 | +Write naturally. The LLM pipeline extracts entities and relationships. |
| 61 | +Use [[wikilinks]] to link between articles. |
| 62 | +``` |
| 63 | + |
| 64 | +### Example SPARQL Queries |
| 65 | + |
| 66 | +**Find all entities mentioned in an article:** |
| 67 | +```sparql |
| 68 | +PREFIX schema: <https://schema.org/> |
| 69 | +SELECT ?entity ?name WHERE { |
| 70 | + <https://example.com/2026/04/what-is-a-knowledge-graph/> schema:mentions ?entity . |
| 71 | + ?entity schema:name ?name . |
| 72 | +} |
| 73 | +``` |
| 74 | + |
| 75 | +**Find all articles about a topic:** |
| 76 | +```sparql |
| 77 | +PREFIX schema: <https://schema.org/> |
| 78 | +SELECT ?article ?title WHERE { |
| 79 | + ?article a schema:Article ; |
| 80 | + schema:mentions <https://example.com/id/knowledge-graph> ; |
| 81 | + schema:name ?title . |
| 82 | +} |
| 83 | +``` |
| 84 | + |
| 85 | +**Find connections between entities:** |
| 86 | +```sparql |
| 87 | +PREFIX schema: <https://schema.org/> |
| 88 | +SELECT ?subject ?predicate ?object WHERE { |
| 89 | + ?subject ?predicate ?object . |
| 90 | + FILTER(?predicate != rdf:type) |
| 91 | +} |
| 92 | +LIMIT 50 |
| 93 | +``` |
| 94 | + |
| 95 | +## Project Structure |
| 96 | + |
| 97 | +``` |
| 98 | +├── content/ # Markdown articles (human-authored) |
| 99 | +├── ontology/ # JSON-LD context, vocabulary, SHACL shapes |
| 100 | +├── tools/ # Extraction pipeline (chunker, LLM client, post-processor) |
| 101 | +├── graph/ # Generated artifacts (committed by CI) |
| 102 | +│ ├── articles/ # Per-article JSON-LD and Turtle |
| 103 | +│ ├── views/ # Precomputed JSON views |
| 104 | +│ ├── cache/ # Per-chunk extraction cache |
| 105 | +│ └── manifest.json # Build metadata |
| 106 | +├── api/ # Azure Function (SPARQL endpoint) |
| 107 | +├── app/ # Static web app |
| 108 | +├── tests/ # Test suite |
| 109 | +└── .github/workflows/ |
| 110 | + ├── kg-build.yml # KG extraction pipeline |
| 111 | + └── deploy-swa.yml # Azure SWA deployment |
| 112 | +``` |
| 113 | + |
| 114 | +## Key Design Decisions |
| 115 | + |
| 116 | +| Decision | Choice | Rationale | |
| 117 | +|----------|--------|-----------| |
| 118 | +| LLM Provider | GitHub Models (free) | Zero cost, GITHUB_TOKEN auth | |
| 119 | +| LLM Model | `openai/gpt-4o-mini` | Best quality/limit ratio (150 req/day) | |
| 120 | +| SPARQL Engine | RDFLib | Pure Python, small footprint, built-in JSON-LD | |
| 121 | +| Validation | pySHACL | Standard W3C SHACL, works with RDFLib | |
| 122 | +| Batching | 3-5 chunks/request | Stay under 8K input token limit | |
| 123 | + |
| 124 | +## Rate Limits |
| 125 | + |
| 126 | +GitHub Models free tier (GPT-4o-mini): 150 requests/day, 8K input tokens. |
| 127 | +The pipeline batches 3-5 chunks per request and caches results to stay within limits. |
| 128 | + |
| 129 | +## License |
| 130 | + |
| 131 | +MIT |
0 commit comments