Importer Plan

This document describes how to import python-github-backup output into the SQLite schema.

Test backup location

Use this backup for development and smoke tests:

C:\CodeBlocks\ggml-org-backup\backup

Git Bash path:

/c/CodeBlocks/ggml-org-backup/backup

Useful sample files:

C:\CodeBlocks\ggml-org-backup\backup\repositories\ggml\issues\1.json
C:\CodeBlocks\ggml-org-backup\backup\repositories\llama.cpp\pulls\10001.json
C:\CodeBlocks\ggml-org-backup\backup\repositories\ggml\discussions\32.json
C:\CodeBlocks\ggml-org-backup\backup\repositories\llama.cpp\releases\b1046.json
C:\CodeBlocks\ggml-org-backup\backup\repositories\ggml\issues\attachments\1010\manifest.json

Backup tool source, if needed:

C:\CodeBlocks\python-github-backup

Backup layout

Current backup shape:

backup/
  backup.log
  repositories/
    ggml/
      discussions/*.json
      discussions/attachments/*/manifest.json
      issues/*.json
      issues/attachments/*/manifest.json
      labels/labels.json
      milestones/
      pulls/*.json
      pulls/attachments/*/manifest.json
      releases/*.json
    llama.cpp/
      ... same shape ...

Important: attachment directories can contain downloaded files with .json extensions. Do not scan every *.json below attachments/. Only read manifest.json in attachment directories.

Repository detection

Do not require --org. Derive owner/repo from JSON URL fields.

URL parsing helpers

Implement helper functions like:

def repo_from_api_url(url: str) -> tuple[str, str] | None:
    # Handles https://api.github.com/repos/ggml-org/ggml/issues/1
    # Handles GitHub Enterprise hosts too by looking at path parts.


def repo_from_web_url(url: str) -> tuple[str, str] | None:
    # Handles https://github.com/ggml-org/ggml/issues/1
    # Handles https://github.com/ggml-org/llama.cpp/discussions/32
    # Handles GitHub Enterprise hosts by taking first two path segments.

Use urllib.parse.urlparse; do not hardcode only github.com.

For API URLs, look for the path segment repos and take the next two segments:

/repos/{owner}/{repo}/...

For web URLs, take the first two path segments:

/{owner}/{repo}/...

Priority order by item type

Issues:

repository_url
html_url
url
comments_url

Pull requests:

base.repo.full_name
html_url
url
issue_url
base.repo.html_url

Discussions:

url
resourcePath, e.g. /ggml-org/ggml/discussions/32
comment URLs if needed

Releases:

html_url, e.g. https://github.com/ggml-org/llama.cpp/releases/tag/b1046
url
assets_url
tarball_url / zipball_url

Labels:

label url, e.g. https://api.github.com/repos/ggml-org/llama.cpp/labels/bug

Fallback:

Use the repository directory name as repo name.
Reuse an owner already discovered for the same repo directory.
If still unknown, fail the import with a helpful message rather than silently importing unknown/repo.

The current test data should provide enough URLs to infer ggml-org/ggml and ggml-org/llama.cpp.

Warning: do not trust the repository field in attachment manifests for owner/repo derivation. In the current backup, at least one manifest has a suspicious value like ggml-org/ggml-org/ggml.

File scanning

Given a backup root, normalize it so repositories_dir = backup_root / "repositories".

For each repo_dir under repositories_dir:

import labels/labels.json if present;
import issues/[0-9]*.json;
import pulls/[0-9]*.json;
import discussions/[0-9]*.json;
import releases/*.json;
import attachment manifests:
- issues/attachments/*/manifest.json
- pulls/attachments/*/manifest.json
- discussions/attachments/*/manifest.json

Do not recurse blindly under item-type directories because of downloaded attachments.

Common helpers

User login

GitHub JSON uses different user shapes.

def user_login(obj):
    if not obj:
        return None
    if isinstance(obj, dict):
        return obj.get("login")
    return None

For GraphQL discussion authors:

author = data.get("author") or {}
login = author.get("login")

Reactions count

REST JSON often has:

"reactions": { "total_count": 2 }

GraphQL discussion JSON often has:

"reactionGroups": [{"content": "THUMBS_UP", "reactors": {"totalCount": 1}}]

MVP can store only total count:

def reaction_count(data):
    if isinstance(data.get("reactions"), dict):
        return data["reactions"].get("total_count")
    groups = data.get("reactionGroups") or []
    return sum((g.get("reactors") or {}).get("totalCount") or 0 for g in groups)

Text fields

def clean_text(value):
    return value if isinstance(value, str) else ""

Do not aggressively transform Markdown during import. Store Markdown-ish text and plain text where available.

Issues

Source pattern:

{repo_dir}/issues/[0-9]*.json

Expected fields:

number
title
body
state
state_reason
user.login
author_association
created_at
updated_at
closed_at
html_url
labels[]
comments
comment_data[]
event_data[] (ignore for MVP)

Normalize to items:

kind = "issue"
item_key = str(data["number"])
number = int(data["number"])
title = data.get("title") or ""
body = data.get("body") or ""
body_text = body
state = data.get("state")
state_reason = data.get("state_reason")
author = user_login(data.get("user"))
url = data.get("html_url") or data.get("url")
labels = [label["name"] for label in data.get("labels") or [] if label.get("name")]

Normalize comment_data[] to comments:

kind = "issue_comment"
external_id = str(comment.get("id") or comment.get("node_id") or comment.get("html_url"))
sequence = original order starting at 1
author = user_login(comment.get("user"))
body = comment.get("body") or ""
created_at = comment.get("created_at")
updated_at = comment.get("updated_at")
url = comment.get("html_url") or comment.get("url")

Pull requests

Source pattern:

{repo_dir}/pulls/[0-9]*.json

Expected fields:

number
title
body
state
user.login
created_at
updated_at
closed_at
merged_at
draft
base.ref
head.ref
merge_commit_sha
labels[]
comment_regular_data[]: regular issue-style comments on PRs
comment_data[]: review comments/diff comments
review_data[]: reviews
commit_data[]: commits; optional/stretch for MVP

Normalize PR to items:

kind = "pull"
item_key = str(data["number"])
number = int(data["number"])
title = data.get("title") or ""
body = data.get("body") or ""
body_text = body
state = data.get("state")
author = user_login(data.get("user"))
url = data.get("html_url") or data.get("url")
merged_at = data.get("merged_at")
is_draft = bool(data.get("draft"))
base_ref = (data.get("base") or {}).get("ref")
head_ref = (data.get("head") or {}).get("ref")
merge_commit_sha = data.get("merge_commit_sha")

Normalize PR regular comments:

source = data.get("comment_regular_data") or []
kind = "pull_comment"
body = comment.get("body") or ""
author = user_login(comment.get("user"))
created_at = comment.get("created_at")
updated_at = comment.get("updated_at")
url = comment.get("html_url") or comment.get("url")

Normalize PR review comments:

source = data.get("comment_data") or []
kind = "pull_review_comment"
body = comment.get("body") or ""
author = user_login(comment.get("user"))
path = comment.get("path")
diff_hunk = comment.get("diff_hunk")
commit_id = comment.get("commit_id")
created_at = comment.get("created_at")
updated_at = comment.get("updated_at")
url = comment.get("html_url") or comment.get("url")

Normalize PR reviews:

source = data.get("review_data") or []
kind = "pull_review"
body = review.get("body") or ""
author = user_login(review.get("user"))
state = review.get("state")
commit_id = review.get("commit_id")
submitted_at = review.get("submitted_at")
url = review.get("html_url") or review.get("pull_request_url")

For MVP, PR commits may be ignored. If easy, add a future pull_commits table or add commit messages as search_docs with doc_type='pull_commit'.

Discussions

Source pattern:

{repo_dir}/discussions/[0-9]*.json

Discussion JSON is GraphQL-shaped and uses camelCase fields.

Expected fields:

number
title
body
bodyText
stateReason
author.login
authorAssociation
createdAt
updatedAt
closedAt
publishedAt
url
labels.nodes[]
comment_data[]
comment reply_data[]

Normalize discussion to items:

kind = "discussion"
item_key = str(data["number"])
number = int(data["number"])
title = data.get("title") or ""
body = data.get("body") or data.get("bodyText") or ""
body_text = data.get("bodyText") or data.get("body") or ""
state = "closed" if data.get("closed") else "open"
state_reason = data.get("stateReason")
author = user_login(data.get("author"))
created_at = data.get("createdAt")
updated_at = data.get("updatedAt")
closed_at = data.get("closedAt")
published_at = data.get("publishedAt")
url = data.get("url")
labels = [node["name"] for node in (data.get("labels") or {}).get("nodes") or []]

Normalize discussion comments:

kind = "discussion_comment"
external_id = str(comment.get("databaseId") or comment.get("id") or comment.get("url"))
body = comment.get("body") or comment.get("bodyText") or ""
body_text = comment.get("bodyText") or comment.get("body") or ""
author = user_login(comment.get("author"))
created_at = comment.get("createdAt")
updated_at = comment.get("updatedAt")
published_at = comment.get("publishedAt")
url = comment.get("url")

Normalize nested discussion replies from each comment's reply_data[]:

kind = "discussion_reply"
parent_external_id = parent_comment_external_id
external_id = str(reply.get("databaseId") or reply.get("id") or reply.get("url"))
body = reply.get("body") or reply.get("bodyText") or ""
body_text = reply.get("bodyText") or reply.get("body") or ""
author = user_login(reply.get("author"))
created_at = reply.get("createdAt")
updated_at = reply.get("updatedAt")
published_at = reply.get("publishedAt")
url = reply.get("url")

Releases

Source pattern:

{repo_dir}/releases/*.json

Expected fields:

tag_name
name
body
draft
prerelease
author.login
created_at
published_at
updated_at
html_url
target_commitish
assets[]

Normalize release to items:

kind = "release"
tag_name = data.get("tag_name") or path.stem
item_key = tag_name
title = data.get("name") or tag_name
body = data.get("body") or ""
body_text = body
state = "draft" if data.get("draft") else ("prerelease" if data.get("prerelease") else "published")
author = user_login(data.get("author"))
created_at = data.get("created_at")
updated_at = data.get("updated_at")
published_at = data.get("published_at")
url = data.get("html_url") or data.get("url")
is_prerelease = bool(data.get("prerelease"))
is_draft_release = bool(data.get("draft"))
target_commitish = data.get("target_commitish")

Normalize assets[] to release_assets:

name = asset.get("name")
label = asset.get("label")
content_type = asset.get("content_type")
size_bytes = asset.get("size")
download_count = asset.get("download_count")
state = asset.get("state")
browser_download_url = asset.get("browser_download_url")
created_at = asset.get("created_at")
updated_at = asset.get("updated_at")

Do not download or import asset contents.

Labels

Source:

{repo_dir}/labels/labels.json

The file contains a JSON list.

Normalize each label to labels:

name = label.get("name")
color = label.get("color")
description = label.get("description")
is_default = bool(label.get("default"))
url = label.get("url")

Also insert labels found directly on items into labels if missing, because an item may reference labels even if labels/labels.json is absent.

Attachment metadata

Source patterns:

{repo_dir}/issues/attachments/{number}/manifest.json
{repo_dir}/pulls/attachments/{number}/manifest.json
{repo_dir}/discussions/attachments/{number}/manifest.json

Manifest shape example:

{
  "item_number": 1010,
  "item_type": "issue",
  "attachments": [
    {
      "url": "https://github.com/user-attachments/assets/...",
      "success": true,
      "http_status": 200,
      "content_type": "image/png",
      "original_filename": "...png",
      "size_bytes": 7326,
      "downloaded_at": "2026-04-26T14:39:04.570373+00:00",
      "error": null,
      "saved_as": "...png"
    }
  ]
}

Map directory name to item kind:

issues/attachments -> issue
pulls/attachments -> pull
discussions/attachments -> discussion

Map item number using:

manifest item_number, if present;
attachment directory name, if numeric.

Find parent item by (repo_id, kind, item_key=str(number)). If missing, skip with a warning count, do not fail the whole import.

Normalize each attachment entry to attachments:

manifest_path = path relative to backup root, or absolute if simpler
local_path = manifest.parent / saved_as, stored relative to backup root when possible
original_url = attachment.get("url")
original_filename = attachment.get("original_filename")
saved_as = attachment.get("saved_as")
content_type = attachment.get("content_type")
size_bytes = attachment.get("size_bytes")
success = bool(attachment.get("success"))
http_status = attachment.get("http_status")
downloaded_at = attachment.get("downloaded_at")
error = attachment.get("error")

Do not open/read the actual downloaded attachment file. The import should only parse manifest.json.

Error handling

The importer should be tolerant of individual malformed files but visible about problems.

Recommended behavior:

Fail fast for missing backup root or missing repositories/.
For malformed item JSON, log a warning and continue.
For missing required identifiers (number for issues/PRs/discussions, tag_name for releases), log a warning and skip that item.
For owner/repo derivation failure, fail for that repo directory unless another item in the same dir established the repo identity.
For attachment manifests whose parent item is missing, count and warn.

At the end of import, print:

repositories imported;
items by kind;
comments by kind;
labels;
attachments metadata rows;
release assets;
skipped malformed files;
skipped attachment manifests/rows.

Suggested importer structure

@dataclass
class NormalizedRepo:
    owner: str
    name: str

@dataclass
class NormalizedItem:
    kind: str
    item_key: str
    number: int | None
    tag_name: str | None
    title: str
    body: str
    body_text: str
    # ... metadata fields ...
    labels: list[str]
    comments: list[NormalizedComment]
    release_assets: list[NormalizedReleaseAsset]

@dataclass
class NormalizedComment:
    kind: str
    external_id: str | None
    parent_external_id: str | None
    sequence: int
    author: str | None
    body: str
    body_text: str
    # ... metadata fields ...

Keep JSON parsing/normalization separate from DB insertion. This makes tests easy.

Minimal normalizer tests

Use the sample files listed at the top of this document.

Test cases:

ggml/issues/1.json normalizes to repo ggml-org/ggml, kind issue, number 1, with comments.
llama.cpp/pulls/10001.json normalizes to repo ggml-org/llama.cpp, kind pull, number 10001, with review comments/reviews.
ggml/discussions/32.json normalizes to repo ggml-org/ggml, kind discussion, number 32.
llama.cpp/releases/b1046.json normalizes to repo ggml-org/llama.cpp, kind release, item key b1046, with release assets.
ggml/issues/attachments/1010/manifest.json imports metadata without reading PNG files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importer Plan

Test backup location

Backup layout

Repository detection

URL parsing helpers

Priority order by item type

File scanning

Common helpers

User login

Reactions count

Text fields

Issues

Pull requests

Discussions

Releases

Labels

Attachment metadata

Error handling

Suggested importer structure

Minimal normalizer tests

FilesExpand file tree

importer.md

Latest commit

History

importer.md

File metadata and controls

Importer Plan

Test backup location

Backup layout

Repository detection

URL parsing helpers

Priority order by item type

File scanning

Common helpers

User login

Reactions count

Text fields

Issues

Pull requests

Discussions

Releases

Labels

Attachment metadata

Error handling

Suggested importer structure

Minimal normalizer tests