This document describes how to import python-github-backup output into the SQLite schema.
Use this backup for development and smoke tests:
C:\CodeBlocks\ggml-org-backup\backup
Git Bash path:
/c/CodeBlocks/ggml-org-backup/backup
Useful sample files:
C:\CodeBlocks\ggml-org-backup\backup\repositories\ggml\issues\1.json
C:\CodeBlocks\ggml-org-backup\backup\repositories\llama.cpp\pulls\10001.json
C:\CodeBlocks\ggml-org-backup\backup\repositories\ggml\discussions\32.json
C:\CodeBlocks\ggml-org-backup\backup\repositories\llama.cpp\releases\b1046.json
C:\CodeBlocks\ggml-org-backup\backup\repositories\ggml\issues\attachments\1010\manifest.json
Backup tool source, if needed:
C:\CodeBlocks\python-github-backup
Current backup shape:
backup/
backup.log
repositories/
ggml/
discussions/*.json
discussions/attachments/*/manifest.json
issues/*.json
issues/attachments/*/manifest.json
labels/labels.json
milestones/
pulls/*.json
pulls/attachments/*/manifest.json
releases/*.json
llama.cpp/
... same shape ...
Important: attachment directories can contain downloaded files with .json extensions. Do not scan every *.json below attachments/. Only read manifest.json in attachment directories.
Do not require --org. Derive owner/repo from JSON URL fields.
Implement helper functions like:
def repo_from_api_url(url: str) -> tuple[str, str] | None:
# Handles https://api.github.com/repos/ggml-org/ggml/issues/1
# Handles GitHub Enterprise hosts too by looking at path parts.
def repo_from_web_url(url: str) -> tuple[str, str] | None:
# Handles https://github.com/ggml-org/ggml/issues/1
# Handles https://github.com/ggml-org/llama.cpp/discussions/32
# Handles GitHub Enterprise hosts by taking first two path segments.Use urllib.parse.urlparse; do not hardcode only github.com.
For API URLs, look for the path segment repos and take the next two segments:
/repos/{owner}/{repo}/...
For web URLs, take the first two path segments:
/{owner}/{repo}/...
Issues:
repository_urlhtml_urlurlcomments_url
Pull requests:
base.repo.full_namehtml_urlurlissue_urlbase.repo.html_url
Discussions:
urlresourcePath, e.g./ggml-org/ggml/discussions/32- comment URLs if needed
Releases:
html_url, e.g.https://github.com/ggml-org/llama.cpp/releases/tag/b1046urlassets_urltarball_url/zipball_url
Labels:
- label
url, e.g.https://api.github.com/repos/ggml-org/llama.cpp/labels/bug
Fallback:
- Use the repository directory name as repo name.
- Reuse an owner already discovered for the same repo directory.
- If still unknown, fail the import with a helpful message rather than silently importing
unknown/repo.
The current test data should provide enough URLs to infer ggml-org/ggml and ggml-org/llama.cpp.
Warning: do not trust the repository field in attachment manifests for owner/repo derivation. In the current backup, at least one manifest has a suspicious value like ggml-org/ggml-org/ggml.
Given a backup root, normalize it so repositories_dir = backup_root / "repositories".
For each repo_dir under repositories_dir:
- import
labels/labels.jsonif present; - import
issues/[0-9]*.json; - import
pulls/[0-9]*.json; - import
discussions/[0-9]*.json; - import
releases/*.json; - import attachment manifests:
issues/attachments/*/manifest.jsonpulls/attachments/*/manifest.jsondiscussions/attachments/*/manifest.json
Do not recurse blindly under item-type directories because of downloaded attachments.
GitHub JSON uses different user shapes.
def user_login(obj):
if not obj:
return None
if isinstance(obj, dict):
return obj.get("login")
return NoneFor GraphQL discussion authors:
author = data.get("author") or {}
login = author.get("login")REST JSON often has:
"reactions": { "total_count": 2 }GraphQL discussion JSON often has:
"reactionGroups": [{"content": "THUMBS_UP", "reactors": {"totalCount": 1}}]MVP can store only total count:
def reaction_count(data):
if isinstance(data.get("reactions"), dict):
return data["reactions"].get("total_count")
groups = data.get("reactionGroups") or []
return sum((g.get("reactors") or {}).get("totalCount") or 0 for g in groups)def clean_text(value):
return value if isinstance(value, str) else ""Do not aggressively transform Markdown during import. Store Markdown-ish text and plain text where available.
Source pattern:
{repo_dir}/issues/[0-9]*.json
Expected fields:
numbertitlebodystatestate_reasonuser.loginauthor_associationcreated_atupdated_atclosed_athtml_urllabels[]commentscomment_data[]event_data[](ignore for MVP)
Normalize to items:
kind = "issue"
item_key = str(data["number"])
number = int(data["number"])
title = data.get("title") or ""
body = data.get("body") or ""
body_text = body
state = data.get("state")
state_reason = data.get("state_reason")
author = user_login(data.get("user"))
url = data.get("html_url") or data.get("url")
labels = [label["name"] for label in data.get("labels") or [] if label.get("name")]Normalize comment_data[] to comments:
kind = "issue_comment"
external_id = str(comment.get("id") or comment.get("node_id") or comment.get("html_url"))
sequence = original order starting at 1
author = user_login(comment.get("user"))
body = comment.get("body") or ""
created_at = comment.get("created_at")
updated_at = comment.get("updated_at")
url = comment.get("html_url") or comment.get("url")Source pattern:
{repo_dir}/pulls/[0-9]*.json
Expected fields:
numbertitlebodystateuser.logincreated_atupdated_atclosed_atmerged_atdraftbase.refhead.refmerge_commit_shalabels[]comment_regular_data[]: regular issue-style comments on PRscomment_data[]: review comments/diff commentsreview_data[]: reviewscommit_data[]: commits; optional/stretch for MVP
Normalize PR to items:
kind = "pull"
item_key = str(data["number"])
number = int(data["number"])
title = data.get("title") or ""
body = data.get("body") or ""
body_text = body
state = data.get("state")
author = user_login(data.get("user"))
url = data.get("html_url") or data.get("url")
merged_at = data.get("merged_at")
is_draft = bool(data.get("draft"))
base_ref = (data.get("base") or {}).get("ref")
head_ref = (data.get("head") or {}).get("ref")
merge_commit_sha = data.get("merge_commit_sha")Normalize PR regular comments:
source = data.get("comment_regular_data") or []
kind = "pull_comment"
body = comment.get("body") or ""
author = user_login(comment.get("user"))
created_at = comment.get("created_at")
updated_at = comment.get("updated_at")
url = comment.get("html_url") or comment.get("url")Normalize PR review comments:
source = data.get("comment_data") or []
kind = "pull_review_comment"
body = comment.get("body") or ""
author = user_login(comment.get("user"))
path = comment.get("path")
diff_hunk = comment.get("diff_hunk")
commit_id = comment.get("commit_id")
created_at = comment.get("created_at")
updated_at = comment.get("updated_at")
url = comment.get("html_url") or comment.get("url")Normalize PR reviews:
source = data.get("review_data") or []
kind = "pull_review"
body = review.get("body") or ""
author = user_login(review.get("user"))
state = review.get("state")
commit_id = review.get("commit_id")
submitted_at = review.get("submitted_at")
url = review.get("html_url") or review.get("pull_request_url")For MVP, PR commits may be ignored. If easy, add a future pull_commits table or add commit messages as search_docs with doc_type='pull_commit'.
Source pattern:
{repo_dir}/discussions/[0-9]*.json
Discussion JSON is GraphQL-shaped and uses camelCase fields.
Expected fields:
numbertitlebodybodyTextstateReasonauthor.loginauthorAssociationcreatedAtupdatedAtclosedAtpublishedAturllabels.nodes[]comment_data[]- comment
reply_data[]
Normalize discussion to items:
kind = "discussion"
item_key = str(data["number"])
number = int(data["number"])
title = data.get("title") or ""
body = data.get("body") or data.get("bodyText") or ""
body_text = data.get("bodyText") or data.get("body") or ""
state = "closed" if data.get("closed") else "open"
state_reason = data.get("stateReason")
author = user_login(data.get("author"))
created_at = data.get("createdAt")
updated_at = data.get("updatedAt")
closed_at = data.get("closedAt")
published_at = data.get("publishedAt")
url = data.get("url")
labels = [node["name"] for node in (data.get("labels") or {}).get("nodes") or []]Normalize discussion comments:
kind = "discussion_comment"
external_id = str(comment.get("databaseId") or comment.get("id") or comment.get("url"))
body = comment.get("body") or comment.get("bodyText") or ""
body_text = comment.get("bodyText") or comment.get("body") or ""
author = user_login(comment.get("author"))
created_at = comment.get("createdAt")
updated_at = comment.get("updatedAt")
published_at = comment.get("publishedAt")
url = comment.get("url")Normalize nested discussion replies from each comment's reply_data[]:
kind = "discussion_reply"
parent_external_id = parent_comment_external_id
external_id = str(reply.get("databaseId") or reply.get("id") or reply.get("url"))
body = reply.get("body") or reply.get("bodyText") or ""
body_text = reply.get("bodyText") or reply.get("body") or ""
author = user_login(reply.get("author"))
created_at = reply.get("createdAt")
updated_at = reply.get("updatedAt")
published_at = reply.get("publishedAt")
url = reply.get("url")Source pattern:
{repo_dir}/releases/*.json
Expected fields:
tag_namenamebodydraftprereleaseauthor.logincreated_atpublished_atupdated_athtml_urltarget_commitishassets[]
Normalize release to items:
kind = "release"
tag_name = data.get("tag_name") or path.stem
item_key = tag_name
title = data.get("name") or tag_name
body = data.get("body") or ""
body_text = body
state = "draft" if data.get("draft") else ("prerelease" if data.get("prerelease") else "published")
author = user_login(data.get("author"))
created_at = data.get("created_at")
updated_at = data.get("updated_at")
published_at = data.get("published_at")
url = data.get("html_url") or data.get("url")
is_prerelease = bool(data.get("prerelease"))
is_draft_release = bool(data.get("draft"))
target_commitish = data.get("target_commitish")Normalize assets[] to release_assets:
name = asset.get("name")
label = asset.get("label")
content_type = asset.get("content_type")
size_bytes = asset.get("size")
download_count = asset.get("download_count")
state = asset.get("state")
browser_download_url = asset.get("browser_download_url")
created_at = asset.get("created_at")
updated_at = asset.get("updated_at")Do not download or import asset contents.
Source:
{repo_dir}/labels/labels.json
The file contains a JSON list.
Normalize each label to labels:
name = label.get("name")
color = label.get("color")
description = label.get("description")
is_default = bool(label.get("default"))
url = label.get("url")Also insert labels found directly on items into labels if missing, because an item may reference labels even if labels/labels.json is absent.
Source patterns:
{repo_dir}/issues/attachments/{number}/manifest.json
{repo_dir}/pulls/attachments/{number}/manifest.json
{repo_dir}/discussions/attachments/{number}/manifest.json
Manifest shape example:
{
"item_number": 1010,
"item_type": "issue",
"attachments": [
{
"url": "https://github.com/user-attachments/assets/...",
"success": true,
"http_status": 200,
"content_type": "image/png",
"original_filename": "...png",
"size_bytes": 7326,
"downloaded_at": "2026-04-26T14:39:04.570373+00:00",
"error": null,
"saved_as": "...png"
}
]
}Map directory name to item kind:
issues/attachments->issuepulls/attachments->pulldiscussions/attachments->discussion
Map item number using:
- manifest
item_number, if present; - attachment directory name, if numeric.
Find parent item by (repo_id, kind, item_key=str(number)). If missing, skip with a warning count, do not fail the whole import.
Normalize each attachment entry to attachments:
manifest_path = path relative to backup root, or absolute if simpler
local_path = manifest.parent / saved_as, stored relative to backup root when possible
original_url = attachment.get("url")
original_filename = attachment.get("original_filename")
saved_as = attachment.get("saved_as")
content_type = attachment.get("content_type")
size_bytes = attachment.get("size_bytes")
success = bool(attachment.get("success"))
http_status = attachment.get("http_status")
downloaded_at = attachment.get("downloaded_at")
error = attachment.get("error")Do not open/read the actual downloaded attachment file. The import should only parse manifest.json.
The importer should be tolerant of individual malformed files but visible about problems.
Recommended behavior:
- Fail fast for missing backup root or missing
repositories/. - For malformed item JSON, log a warning and continue.
- For missing required identifiers (
numberfor issues/PRs/discussions,tag_namefor releases), log a warning and skip that item. - For owner/repo derivation failure, fail for that repo directory unless another item in the same dir established the repo identity.
- For attachment manifests whose parent item is missing, count and warn.
At the end of import, print:
- repositories imported;
- items by kind;
- comments by kind;
- labels;
- attachments metadata rows;
- release assets;
- skipped malformed files;
- skipped attachment manifests/rows.
@dataclass
class NormalizedRepo:
owner: str
name: str
@dataclass
class NormalizedItem:
kind: str
item_key: str
number: int | None
tag_name: str | None
title: str
body: str
body_text: str
# ... metadata fields ...
labels: list[str]
comments: list[NormalizedComment]
release_assets: list[NormalizedReleaseAsset]
@dataclass
class NormalizedComment:
kind: str
external_id: str | None
parent_external_id: str | None
sequence: int
author: str | None
body: str
body_text: str
# ... metadata fields ...Keep JSON parsing/normalization separate from DB insertion. This makes tests easy.
Use the sample files listed at the top of this document.
Test cases:
ggml/issues/1.jsonnormalizes to repoggml-org/ggml, kindissue, number1, with comments.llama.cpp/pulls/10001.jsonnormalizes to repoggml-org/llama.cpp, kindpull, number10001, with review comments/reviews.ggml/discussions/32.jsonnormalizes to repoggml-org/ggml, kinddiscussion, number32.llama.cpp/releases/b1046.jsonnormalizes to repoggml-org/llama.cpp, kindrelease, item keyb1046, with release assets.ggml/issues/attachments/1010/manifest.jsonimports metadata without reading PNG files.