Skip to content

feat: add support for private GitHub repository cloning with OAuth authentication #193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -173,3 +173,6 @@ Caddyfile

# ignore default output directory
tmp/*

#Qodo
.qodo/
30 changes: 23 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,23 +25,28 @@ You can also replace `hub` with `ingest` in any GitHub URL to access the corresp
- Token count
- **CLI tool**: Run it as a shell command
- **Python package**: Import it in your code
- **Private Repository**: Support via GitHub OAuth:
- Login with GitHub: Private repositories can now be ingested when you log in using GitHub.
- Once logged in, Gitingest uses your GitHub token (stored securely in your session) to clone and process your private repository.

## 📚 Requirements

- Python 3.7+

## 📦 Installation

``` bash
```bash
pip install gitingest
```

## 🧩 Browser Extension Usage

<!-- markdownlint-disable MD033 -->

<a href="https://chromewebstore.google.com/detail/adfjahbijlkjfoicpjkhjicpjpjfaood" target="_blank" title="Get Gitingest Extension from Chrome Web Store"><img height="48" src="https://github.com/user-attachments/assets/20a6e44b-fd46-4e6c-8ea6-aad436035753" alt="Available in the Chrome Web Store" /></a>
<a href="https://addons.mozilla.org/firefox/addon/gitingest" target="_blank" title="Get Gitingest Extension from Firefox Add-ons"><img height="48" src="https://github.com/user-attachments/assets/c0e99e6b-97cf-4af2-9737-099db7d3538b" alt="Get The Add-on for Firefox" /></a>
<a href="https://microsoftedge.microsoft.com/addons/detail/nfobhllgcekbmpifkjlopfdfdmljmipf" target="_blank" title="Get Gitingest Extension from Firefox Add-ons"><img height="48" src="https://github.com/user-attachments/assets/204157eb-4cae-4c0e-b2cb-db514419fd9e" alt="Get from the Edge Add-ons" /></a>

<!-- markdownlint-enable MD033 -->

The extension is open source at [lcandy2/gitingest-extension](https://github.com/lcandy2/gitingest-extension).
Expand Down Expand Up @@ -103,24 +108,35 @@ This is because Jupyter notebooks are asynchronous by default.

1. Build the image:

``` bash
```bash
docker build -t gitingest .
```

2. Run the container:

``` bash
```bash
docker run -d --name gitingest -p 8000:8000 gitingest
```

The application will be available at `http://localhost:8000`.

If you are hosting it on a domain, you can specify the allowed hostnames via env variable `ALLOWED_HOSTS`.

```bash
# Default: "gitingest.com, *.gitingest.com, localhost, 127.0.0.1".
ALLOWED_HOSTS="example.com, localhost, 127.0.0.1"
```
```bash
# Default: "gitingest.com, *.gitingest.com, localhost, 127.0.0.1".
ALLOWED_HOSTS="example.com, localhost, 127.0.0.1"
```

## 🔐 Important for Private Repos

In **production**, the OAuth credentials (`GITHUB_CLIENT_ID` and `GITHUB_CLIENT_SECRET`) are **configured securely on the server**, allowing end users to simply click **"Login with GitHub"** to access their private repositories.

When **running locally** (for testing), you must provide these credentials via environment variables:

```bash
export GITHUB_CLIENT_ID=your_client_id
export GITHUB_CLIENT_SECRET=your_client_secret
```

## 🤝 Contributing

Expand Down
2 changes: 2 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,5 @@ slowapi
starlette
tiktoken
uvicorn
Authlib
itsdangerous
134 changes: 73 additions & 61 deletions src/gitingest/repository_clone.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,71 +37,63 @@ class CloneConfig:
branch: Optional[str] = None


@async_timeout(TIMEOUT)
async def clone_repo(config: CloneConfig) -> Tuple[bytes, bytes]:
"""
Clone a repository to a local path based on the provided configuration.

This function handles the process of cloning a Git repository to the local file system.
It can clone a specific branch or commit if provided, and it raises exceptions if
any errors occur during the cloning process.

Parameters
----------
config : CloneConfig
A dictionary containing the following keys:
- url (str): The URL of the repository.
- local_path (str): The local path to clone the repository to.
- commit (str, optional): The specific commit hash to checkout.
- branch (str, optional): The branch to clone. Defaults to 'main' or 'master' if not provided.

Returns
-------
Tuple[bytes, bytes]
A tuple containing the stdout and stderr of the Git commands executed.

Raises
------
ValueError
If the 'url' or 'local_path' parameters are missing, or if the repository is not found.
OSError
If there is an error creating the parent directory structure.
"""
# Extract and validate query parameters
async def clone_repo(config: CloneConfig, token: dict = None) -> Tuple[bytes, bytes]:
url: str = config.url
local_path: str = config.local_path
commit: Optional[str] = config.commit
branch: Optional[str] = config.branch

if not url:
raise ValueError("The 'url' parameter is required.")

if not local_path:
raise ValueError("The 'local_path' parameter is required.")

# Create parent directory if it doesn't exist
# 1) Extract user’s GitHub OAuth token if present
if token:
# The OAuth token from your session
auth_token = token.get("access_token", "")
else:
# fallback: environment variable for local testing
auth_token = os.getenv("GIT_AUTH_TOKEN", "")

# 2) Check if user is trying to ingest a private repo but has no token
if ("github.com" in url.lower()) and not auth_token:
raise ValueError(
"This repository appears to be private on GitHub, but you're not logged in. "
"Please log in with GitHub to access private repos."
)

# 3) Check repo existence using the correct token
if not await _check_repo_exists(url, token=auth_token):
raise ValueError(
"We could not find or access this repository. "
"Either it doesn't exist, or you don't have permission, or your token is invalid."
)

# 4) Construct token-embedded URL if it's GitHub
if auth_token and "github.com" in url.lower() and url.startswith("https://"):
remainder = url[len("https://"):]
token_url = f"https://x-access-token:{auth_token}@{remainder}"
else:
token_url = url

# Make sure parent directories exist
parent_dir = Path(local_path).parent

try:
os.makedirs(parent_dir, exist_ok=True)

except OSError as e:
raise OSError(f"Failed to create parent directory {parent_dir}: {e}") from e

# Check if the repository exists
if not await _check_repo_exists(url):
raise ValueError("Repository not found, make sure it is public")

# 5) Actually clone + checkout
if commit:
# Scenario 1: Clone and checkout a specific commit
# Clone the repository without depth to ensure full history for checkout
clone_cmd = ["git", "clone", "--recurse-submodules", "--single-branch", url, local_path]
clone_cmd = ["git", "clone", "--recurse-submodules", "--single-branch", token_url, local_path]
await _run_git_command(*clone_cmd)

# Checkout the specific commit
checkout_cmd = ["git", "-C", local_path, "checkout", commit]
return await _run_git_command(*checkout_cmd)

if branch and branch.lower() not in ("main", "master"):
# Scenario 2: Clone a specific branch with shallow depth
clone_cmd = [
"git",
"clone",
Expand All @@ -110,38 +102,56 @@ async def clone_repo(config: CloneConfig) -> Tuple[bytes, bytes]:
"--single-branch",
"--branch",
branch,
url,
token_url,
local_path,
]
return await _run_git_command(*clone_cmd)

# Scenario 3: Clone the default branch with shallow depth
clone_cmd = ["git", "clone", "--recurse-submodules", "--depth=1", "--single-branch", url, local_path]
clone_cmd = ["git", "clone", "--recurse-submodules", "--depth=1", "--single-branch", token_url, local_path]
return await _run_git_command(*clone_cmd)


async def _check_repo_exists(url: str) -> bool:
async def _check_repo_exists(url: str, token: str = None) -> bool:
"""
Check if a Git repository exists at the provided URL.
Uses the GitHub API for github.com URLs, or tries HEAD for others.
"""
import os

Parameters
----------
url : str
The URL of the Git repository to check.
Returns
-------
bool
True if the repository exists, False otherwise.
headers = ["-H", "User-Agent: Gitingest"]

# If we got a token from the user's session, use it
if token:
headers += ["-H", f"Authorization: token {token}"]

else:
# fallback to environment variable
env_token = os.getenv("GIT_AUTH_TOKEN", "")

if env_token:
headers += ["-H", f"Authorization: token {env_token}"]

# If it's a GitHub URL, transform it to the GitHub API URL:
if "github.com" in url:
parts = url.split("/")

if len(parts) >= 5:
owner = parts[3]
repo = parts[4].replace(".git", "")
url_to_check = f"https://api.github.com/repos/{owner}/{repo}"

else:
url_to_check = url

else:
url_to_check = url

Raises
------
RuntimeError
If the curl command returns an unexpected status code.
"""
proc = await asyncio.create_subprocess_exec(
"curl",
"-I",
url,
"-L",
*headers,
url_to_check,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
Expand All @@ -162,6 +172,8 @@ async def _check_repo_exists(url: str) -> bool:
raise RuntimeError(f"Unexpected status code: {status_code}")




@async_timeout(TIMEOUT)
async def fetch_remote_branch_list(url: str) -> List[str]:
"""
Expand Down
7 changes: 7 additions & 0 deletions src/server/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,12 @@
from fastapi.staticfiles import StaticFiles
from slowapi.errors import RateLimitExceeded
from starlette.middleware.trustedhost import TrustedHostMiddleware
from starlette.middleware.sessions import SessionMiddleware

from server.routers import download, dynamic, index
from server.server_config import templates
from server.server_utils import lifespan, limiter, rate_limit_exception_handler
from server.oauth import router as oauth_router

# Load environment variables from .env file
load_dotenv()
Expand All @@ -22,9 +24,14 @@
app = FastAPI(lifespan=lifespan)
app.state.limiter = limiter

# Add session middleware for cookie-based sessions with a secret key
app.add_middleware(SessionMiddleware, secret_key=os.getenv("SESSION_SECRET_KEY", "your-default-secret"))

# Register the custom exception handler for rate limits
app.add_exception_handler(RateLimitExceeded, rate_limit_exception_handler)

# Include the OAuth route
app.include_router(oauth_router, prefix="/oauth")

# Mount static files dynamically to serve CSS, JS, and other static assets
static_dir = Path(__file__).parent.parent / "static"
Expand Down
41 changes: 41 additions & 0 deletions src/server/oauth.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
import os
from fastapi import APIRouter, Request, HTTPException
from fastapi.responses import RedirectResponse
from authlib.integrations.starlette_client import OAuth, OAuthError

router = APIRouter()

oauth = OAuth()
oauth.register(
name="github",
client_id=os.getenv("GITHUB_CLIENT_ID"),
client_secret=os.getenv("GITHUB_CLIENT_SECRET"),
access_token_url="https://github.com/login/oauth/access_token",
authorize_url="https://github.com/login/oauth/authorize",
api_base_url="https://api.github.com/",
client_kwargs={"scope": "read:user repo"},
)

@router.get("/login")
async def login(request: Request):
redirect_uri = request.url_for("auth")
return await oauth.github.authorize_redirect(request, redirect_uri)

@router.get("/auth")
async def auth(request: Request):
try:
token = await oauth.github.authorize_access_token(request)
except OAuthError as error:
raise HTTPException(status_code=400, detail=str(error))
# Get the user's GitHub profile
user_resp = await oauth.github.get("user", token=token)
profile = user_resp.json()
# Store the token in the session so later endpoints can use it
request.session["github_token"] = token
# For demonstration, redirect back to home or return profile info
return RedirectResponse(url="/")

@router.get("/logout")
async def logout(request: Request):
request.session.pop("github_token", None)
return RedirectResponse(url="/")
8 changes: 7 additions & 1 deletion src/server/query_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,10 +90,16 @@ async def process_query(
commit=parsed_query.commit,
branch=parsed_query.branch,
)
await clone_repo(clone_config)
# Retrieve the user's GitHub token from the session (set via OAuth)
token = request.session.get("github_token")

# Pass the token to clone_repo so private repos can be cloned on the user's behalf
await clone_repo(clone_config, token=token)
summary, tree, content = run_ingest_query(parsed_query)

with open(f"{clone_config.local_path}.txt", "w", encoding="utf-8") as f:
f.write(tree + "\n" + content)

except Exception as e:
# hack to print error message when query is not defined
if "query" in locals() and parsed_query is not None and isinstance(parsed_query, dict):
Expand Down
2 changes: 1 addition & 1 deletion src/server/server_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@
{"name": "ApiAnalytics", "url": "https://github.com/tom-draper/api-analytics"},
]

templates = Jinja2Templates(directory="server/templates")
templates = Jinja2Templates(directory="src/server/templates")
Loading