Skip to content

Latest commit

 

History

History
645 lines (503 loc) · 15.7 KB

File metadata and controls

645 lines (503 loc) · 15.7 KB

Importer Plan

This document describes how to import python-github-backup output into the SQLite schema.

Test backup location

Use this backup for development and smoke tests:

C:\CodeBlocks\ggml-org-backup\backup

Git Bash path:

/c/CodeBlocks/ggml-org-backup/backup

Useful sample files:

C:\CodeBlocks\ggml-org-backup\backup\repositories\ggml\issues\1.json
C:\CodeBlocks\ggml-org-backup\backup\repositories\llama.cpp\pulls\10001.json
C:\CodeBlocks\ggml-org-backup\backup\repositories\ggml\discussions\32.json
C:\CodeBlocks\ggml-org-backup\backup\repositories\llama.cpp\releases\b1046.json
C:\CodeBlocks\ggml-org-backup\backup\repositories\ggml\issues\attachments\1010\manifest.json

Backup tool source, if needed:

C:\CodeBlocks\python-github-backup

Backup layout

Current backup shape:

backup/
  backup.log
  repositories/
    ggml/
      discussions/*.json
      discussions/attachments/*/manifest.json
      issues/*.json
      issues/attachments/*/manifest.json
      labels/labels.json
      milestones/
      pulls/*.json
      pulls/attachments/*/manifest.json
      releases/*.json
    llama.cpp/
      ... same shape ...

Important: attachment directories can contain downloaded files with .json extensions. Do not scan every *.json below attachments/. Only read manifest.json in attachment directories.

Repository detection

Do not require --org. Derive owner/repo from JSON URL fields.

URL parsing helpers

Implement helper functions like:

def repo_from_api_url(url: str) -> tuple[str, str] | None:
    # Handles https://api.github.com/repos/ggml-org/ggml/issues/1
    # Handles GitHub Enterprise hosts too by looking at path parts.


def repo_from_web_url(url: str) -> tuple[str, str] | None:
    # Handles https://github.com/ggml-org/ggml/issues/1
    # Handles https://github.com/ggml-org/llama.cpp/discussions/32
    # Handles GitHub Enterprise hosts by taking first two path segments.

Use urllib.parse.urlparse; do not hardcode only github.com.

For API URLs, look for the path segment repos and take the next two segments:

/repos/{owner}/{repo}/...

For web URLs, take the first two path segments:

/{owner}/{repo}/...

Priority order by item type

Issues:

  1. repository_url
  2. html_url
  3. url
  4. comments_url

Pull requests:

  1. base.repo.full_name
  2. html_url
  3. url
  4. issue_url
  5. base.repo.html_url

Discussions:

  1. url
  2. resourcePath, e.g. /ggml-org/ggml/discussions/32
  3. comment URLs if needed

Releases:

  1. html_url, e.g. https://github.com/ggml-org/llama.cpp/releases/tag/b1046
  2. url
  3. assets_url
  4. tarball_url / zipball_url

Labels:

  1. label url, e.g. https://api.github.com/repos/ggml-org/llama.cpp/labels/bug

Fallback:

  • Use the repository directory name as repo name.
  • Reuse an owner already discovered for the same repo directory.
  • If still unknown, fail the import with a helpful message rather than silently importing unknown/repo.

The current test data should provide enough URLs to infer ggml-org/ggml and ggml-org/llama.cpp.

Warning: do not trust the repository field in attachment manifests for owner/repo derivation. In the current backup, at least one manifest has a suspicious value like ggml-org/ggml-org/ggml.

File scanning

Given a backup root, normalize it so repositories_dir = backup_root / "repositories".

For each repo_dir under repositories_dir:

  • import labels/labels.json if present;
  • import issues/[0-9]*.json;
  • import pulls/[0-9]*.json;
  • import discussions/[0-9]*.json;
  • import releases/*.json;
  • import attachment manifests:
    • issues/attachments/*/manifest.json
    • pulls/attachments/*/manifest.json
    • discussions/attachments/*/manifest.json

Do not recurse blindly under item-type directories because of downloaded attachments.

Common helpers

User login

GitHub JSON uses different user shapes.

def user_login(obj):
    if not obj:
        return None
    if isinstance(obj, dict):
        return obj.get("login")
    return None

For GraphQL discussion authors:

author = data.get("author") or {}
login = author.get("login")

Reactions count

REST JSON often has:

"reactions": { "total_count": 2 }

GraphQL discussion JSON often has:

"reactionGroups": [{"content": "THUMBS_UP", "reactors": {"totalCount": 1}}]

MVP can store only total count:

def reaction_count(data):
    if isinstance(data.get("reactions"), dict):
        return data["reactions"].get("total_count")
    groups = data.get("reactionGroups") or []
    return sum((g.get("reactors") or {}).get("totalCount") or 0 for g in groups)

Text fields

def clean_text(value):
    return value if isinstance(value, str) else ""

Do not aggressively transform Markdown during import. Store Markdown-ish text and plain text where available.

Issues

Source pattern:

{repo_dir}/issues/[0-9]*.json

Expected fields:

  • number
  • title
  • body
  • state
  • state_reason
  • user.login
  • author_association
  • created_at
  • updated_at
  • closed_at
  • html_url
  • labels[]
  • comments
  • comment_data[]
  • event_data[] (ignore for MVP)

Normalize to items:

kind = "issue"
item_key = str(data["number"])
number = int(data["number"])
title = data.get("title") or ""
body = data.get("body") or ""
body_text = body
state = data.get("state")
state_reason = data.get("state_reason")
author = user_login(data.get("user"))
url = data.get("html_url") or data.get("url")
labels = [label["name"] for label in data.get("labels") or [] if label.get("name")]

Normalize comment_data[] to comments:

kind = "issue_comment"
external_id = str(comment.get("id") or comment.get("node_id") or comment.get("html_url"))
sequence = original order starting at 1
author = user_login(comment.get("user"))
body = comment.get("body") or ""
created_at = comment.get("created_at")
updated_at = comment.get("updated_at")
url = comment.get("html_url") or comment.get("url")

Pull requests

Source pattern:

{repo_dir}/pulls/[0-9]*.json

Expected fields:

  • number
  • title
  • body
  • state
  • user.login
  • created_at
  • updated_at
  • closed_at
  • merged_at
  • draft
  • base.ref
  • head.ref
  • merge_commit_sha
  • labels[]
  • comment_regular_data[]: regular issue-style comments on PRs
  • comment_data[]: review comments/diff comments
  • review_data[]: reviews
  • commit_data[]: commits; optional/stretch for MVP

Normalize PR to items:

kind = "pull"
item_key = str(data["number"])
number = int(data["number"])
title = data.get("title") or ""
body = data.get("body") or ""
body_text = body
state = data.get("state")
author = user_login(data.get("user"))
url = data.get("html_url") or data.get("url")
merged_at = data.get("merged_at")
is_draft = bool(data.get("draft"))
base_ref = (data.get("base") or {}).get("ref")
head_ref = (data.get("head") or {}).get("ref")
merge_commit_sha = data.get("merge_commit_sha")

Normalize PR regular comments:

source = data.get("comment_regular_data") or []
kind = "pull_comment"
body = comment.get("body") or ""
author = user_login(comment.get("user"))
created_at = comment.get("created_at")
updated_at = comment.get("updated_at")
url = comment.get("html_url") or comment.get("url")

Normalize PR review comments:

source = data.get("comment_data") or []
kind = "pull_review_comment"
body = comment.get("body") or ""
author = user_login(comment.get("user"))
path = comment.get("path")
diff_hunk = comment.get("diff_hunk")
commit_id = comment.get("commit_id")
created_at = comment.get("created_at")
updated_at = comment.get("updated_at")
url = comment.get("html_url") or comment.get("url")

Normalize PR reviews:

source = data.get("review_data") or []
kind = "pull_review"
body = review.get("body") or ""
author = user_login(review.get("user"))
state = review.get("state")
commit_id = review.get("commit_id")
submitted_at = review.get("submitted_at")
url = review.get("html_url") or review.get("pull_request_url")

For MVP, PR commits may be ignored. If easy, add a future pull_commits table or add commit messages as search_docs with doc_type='pull_commit'.

Discussions

Source pattern:

{repo_dir}/discussions/[0-9]*.json

Discussion JSON is GraphQL-shaped and uses camelCase fields.

Expected fields:

  • number
  • title
  • body
  • bodyText
  • stateReason
  • author.login
  • authorAssociation
  • createdAt
  • updatedAt
  • closedAt
  • publishedAt
  • url
  • labels.nodes[]
  • comment_data[]
  • comment reply_data[]

Normalize discussion to items:

kind = "discussion"
item_key = str(data["number"])
number = int(data["number"])
title = data.get("title") or ""
body = data.get("body") or data.get("bodyText") or ""
body_text = data.get("bodyText") or data.get("body") or ""
state = "closed" if data.get("closed") else "open"
state_reason = data.get("stateReason")
author = user_login(data.get("author"))
created_at = data.get("createdAt")
updated_at = data.get("updatedAt")
closed_at = data.get("closedAt")
published_at = data.get("publishedAt")
url = data.get("url")
labels = [node["name"] for node in (data.get("labels") or {}).get("nodes") or []]

Normalize discussion comments:

kind = "discussion_comment"
external_id = str(comment.get("databaseId") or comment.get("id") or comment.get("url"))
body = comment.get("body") or comment.get("bodyText") or ""
body_text = comment.get("bodyText") or comment.get("body") or ""
author = user_login(comment.get("author"))
created_at = comment.get("createdAt")
updated_at = comment.get("updatedAt")
published_at = comment.get("publishedAt")
url = comment.get("url")

Normalize nested discussion replies from each comment's reply_data[]:

kind = "discussion_reply"
parent_external_id = parent_comment_external_id
external_id = str(reply.get("databaseId") or reply.get("id") or reply.get("url"))
body = reply.get("body") or reply.get("bodyText") or ""
body_text = reply.get("bodyText") or reply.get("body") or ""
author = user_login(reply.get("author"))
created_at = reply.get("createdAt")
updated_at = reply.get("updatedAt")
published_at = reply.get("publishedAt")
url = reply.get("url")

Releases

Source pattern:

{repo_dir}/releases/*.json

Expected fields:

  • tag_name
  • name
  • body
  • draft
  • prerelease
  • author.login
  • created_at
  • published_at
  • updated_at
  • html_url
  • target_commitish
  • assets[]

Normalize release to items:

kind = "release"
tag_name = data.get("tag_name") or path.stem
item_key = tag_name
title = data.get("name") or tag_name
body = data.get("body") or ""
body_text = body
state = "draft" if data.get("draft") else ("prerelease" if data.get("prerelease") else "published")
author = user_login(data.get("author"))
created_at = data.get("created_at")
updated_at = data.get("updated_at")
published_at = data.get("published_at")
url = data.get("html_url") or data.get("url")
is_prerelease = bool(data.get("prerelease"))
is_draft_release = bool(data.get("draft"))
target_commitish = data.get("target_commitish")

Normalize assets[] to release_assets:

name = asset.get("name")
label = asset.get("label")
content_type = asset.get("content_type")
size_bytes = asset.get("size")
download_count = asset.get("download_count")
state = asset.get("state")
browser_download_url = asset.get("browser_download_url")
created_at = asset.get("created_at")
updated_at = asset.get("updated_at")

Do not download or import asset contents.

Labels

Source:

{repo_dir}/labels/labels.json

The file contains a JSON list.

Normalize each label to labels:

name = label.get("name")
color = label.get("color")
description = label.get("description")
is_default = bool(label.get("default"))
url = label.get("url")

Also insert labels found directly on items into labels if missing, because an item may reference labels even if labels/labels.json is absent.

Attachment metadata

Source patterns:

{repo_dir}/issues/attachments/{number}/manifest.json
{repo_dir}/pulls/attachments/{number}/manifest.json
{repo_dir}/discussions/attachments/{number}/manifest.json

Manifest shape example:

{
  "item_number": 1010,
  "item_type": "issue",
  "attachments": [
    {
      "url": "https://github.com/user-attachments/assets/...",
      "success": true,
      "http_status": 200,
      "content_type": "image/png",
      "original_filename": "...png",
      "size_bytes": 7326,
      "downloaded_at": "2026-04-26T14:39:04.570373+00:00",
      "error": null,
      "saved_as": "...png"
    }
  ]
}

Map directory name to item kind:

  • issues/attachments -> issue
  • pulls/attachments -> pull
  • discussions/attachments -> discussion

Map item number using:

  1. manifest item_number, if present;
  2. attachment directory name, if numeric.

Find parent item by (repo_id, kind, item_key=str(number)). If missing, skip with a warning count, do not fail the whole import.

Normalize each attachment entry to attachments:

manifest_path = path relative to backup root, or absolute if simpler
local_path = manifest.parent / saved_as, stored relative to backup root when possible
original_url = attachment.get("url")
original_filename = attachment.get("original_filename")
saved_as = attachment.get("saved_as")
content_type = attachment.get("content_type")
size_bytes = attachment.get("size_bytes")
success = bool(attachment.get("success"))
http_status = attachment.get("http_status")
downloaded_at = attachment.get("downloaded_at")
error = attachment.get("error")

Do not open/read the actual downloaded attachment file. The import should only parse manifest.json.

Error handling

The importer should be tolerant of individual malformed files but visible about problems.

Recommended behavior:

  • Fail fast for missing backup root or missing repositories/.
  • For malformed item JSON, log a warning and continue.
  • For missing required identifiers (number for issues/PRs/discussions, tag_name for releases), log a warning and skip that item.
  • For owner/repo derivation failure, fail for that repo directory unless another item in the same dir established the repo identity.
  • For attachment manifests whose parent item is missing, count and warn.

At the end of import, print:

  • repositories imported;
  • items by kind;
  • comments by kind;
  • labels;
  • attachments metadata rows;
  • release assets;
  • skipped malformed files;
  • skipped attachment manifests/rows.

Suggested importer structure

@dataclass
class NormalizedRepo:
    owner: str
    name: str

@dataclass
class NormalizedItem:
    kind: str
    item_key: str
    number: int | None
    tag_name: str | None
    title: str
    body: str
    body_text: str
    # ... metadata fields ...
    labels: list[str]
    comments: list[NormalizedComment]
    release_assets: list[NormalizedReleaseAsset]

@dataclass
class NormalizedComment:
    kind: str
    external_id: str | None
    parent_external_id: str | None
    sequence: int
    author: str | None
    body: str
    body_text: str
    # ... metadata fields ...

Keep JSON parsing/normalization separate from DB insertion. This makes tests easy.

Minimal normalizer tests

Use the sample files listed at the top of this document.

Test cases:

  • ggml/issues/1.json normalizes to repo ggml-org/ggml, kind issue, number 1, with comments.
  • llama.cpp/pulls/10001.json normalizes to repo ggml-org/llama.cpp, kind pull, number 10001, with review comments/reviews.
  • ggml/discussions/32.json normalizes to repo ggml-org/ggml, kind discussion, number 32.
  • llama.cpp/releases/b1046.json normalizes to repo ggml-org/llama.cpp, kind release, item key b1046, with release assets.
  • ggml/issues/attachments/1010/manifest.json imports metadata without reading PNG files.