A powerful tool to scrape all files from a GitHub repository and convert them into JSON, Token-Oriented Object Notation (TOON), or RepoScript (an LLM-first transcript format).
Install the package globally using npm:
npm install -g git-repo-parserOr add it to your project as a dependency:
npm install git-repo-parserThis package provides three CLI commands:
git-repo-to-json: Scrapes a GitHub repository and saves the result as a JSON file.git-repo-to-toon: Scrapes a GitHub repository and saves the result as a TOON file.git-repo-to-text: Scrapes a GitHub repository and saves the result as a RepoScript transcript (formerly the “plain text” output).
# JSON and TOON exports (existing behaviour)
git-repo-to-json https://github.com/username/repo-name.git
git-repo-to-toon https://github.com/username/repo-name.git
# RepoScript transcript without metadata (legacy plain-text behaviour)
git-repo-to-text https://github.com/username/repo-name.git --format=transcript
# RepoScript transcript with metadata lines and token count
git-repo-to-text https://github.com/username/repo-name.git --format=transcript --meta --tokens
# Alternate syntaxes
git-repo-to-text https://github.com/username/repo-name.git --format=json
git-repo-to-text https://github.com/username/repo-name.git --format=toonThe scraped data will be saved as files.json, files.toon, or files.txt in your current directory. When --tokens (or --token, --token-count, -t) is supplied, the CLI also prints the token count using the CL100K vocabulary for any export format. Use --meta / --no-meta to toggle RepoScript metadata lines (default is no metadata).
Run the bundled benchmark to evaluate scrape runtime and token usage across multiple public repositories:
npm run build
npm run benchmarkResults are saved under benchmark/:
benchmark/results.json– machine-readable summary (durations, token counts, output sizes)benchmark/results.md– markdown report per repository/formatbenchmark/*.preview.txt– first 100 lines of each export for spot-checking
You can also use the package in your Node.js projects:
import {
scrapeRepositoryToJson,
scrapeRepositoryToToon,
scrapeRepositoryToTranscript,
scrapeRepositoryToJsonWithTokenCount,
scrapeRepositoryToToonWithTokenCount,
scrapeRepositoryToPlainTextWithTokenCount,
type TranscriptFormatOptions,
countTokens,
} from 'git-repo-parser';
const repoUrl = 'https://github.com/username/repo-name.git';
// JSON output
const jsonResult = await scrapeRepositoryToJson(repoUrl);
// TOON output
const toonResult = await scrapeRepositoryToToon(repoUrl);
// RepoScript transcript (no metadata; equivalent to legacy plain text)
const transcript = await scrapeRepositoryToTranscript(repoUrl);
// RepoScript with metadata lines
const transcriptOptions: TranscriptFormatOptions = { includeMeta: true };
const richTranscript = await scrapeRepositoryToTranscript(repoUrl, transcriptOptions);
// Token-aware helpers
const { json, tokenCount: jsonTokens } = await scrapeRepositoryToJsonWithTokenCount(repoUrl);
const { toon, tokenCount: toonTokens } = await scrapeRepositoryToToonWithTokenCount(repoUrl);
const { text, tokenCount: transcriptTokens } = await scrapeRepositoryToPlainTextWithTokenCount(
repoUrl,
undefined,
transcriptOptions
);
// Standalone token counting helper (uses gpt-tokenizer + cl100k_base)
const tokens = countTokens(toon);Scrapes the given GitHub repository and returns a promise that resolves to an array of FileData objects.
scrapeRepositoryToJsonWithTokenCount(repoUrl: string, indent = 2, tokenOptions?: TokenCountOptions): Promise<{ files: FileData[]; json: string; tokenCount: number }>
Scrapes the repository, returns the raw FileData[], a pretty-printed JSON string, and the corresponding CL100K token consumption.
Scrapes the given GitHub repository and returns a promise that resolves to a TOON-formatted string. You can pass EncodeOptions directly to customise indentation, delimiter, or length markers.
scrapeRepositoryToToonWithTokenCount(repoUrl: string, encodeOptions?: EncodeOptions, tokenOptions?: TokenCountOptions): Promise<{ toon: string; tokenCount: number }>
Generates the TOON-formatted output and returns both the encoded string and its token count as measured by gpt-tokenizer using the default CL100K vocabulary.
Scrapes the given GitHub repository and returns a RepoScript v1 transcript string. TranscriptFormatOptions currently supports { includeMeta?: boolean } (default: false). The legacy scrapeRepositoryToPlainText export delegates to this helper with metadata disabled.
scrapeRepositoryToPlainTextWithTokenCount(repoUrl: string, tokenOptions?: TokenCountOptions, transcriptOptions?: TranscriptFormatOptions): Promise<{ text: string; tokenCount: number }>
Scrapes the repository to RepoScript while reporting the token footprint of the generated transcript. Supply transcriptOptions to mirror CLI behaviour (e.g. { includeMeta: true }).
RepoScript is a deterministic, LLM-friendly transcript of a repository (formerly the “plain text” output).
- Deterministic ordering
- Directories are emitted in lexical order of their full POSIX paths.
- Within a directory, files are listed in lexical order by filename.
- Marker grammar
- Markers always begin at column 0 and follow
[TAG] <path>(single space). - Tags in use:
[DIR_START],[DIR_END],[FILE_START],[FILE_END]. - Paths are POSIX (e.g.
src/index.ts) and never contain newlines.
- Markers always begin at column 0 and follow
- Optional metadata
- When
includeMetais enabled, files receive lines likemeta: lang=ts size=1234. - Metadata lines appear immediately after
[FILE_START] <path>and before file contents.
- When
- Reserved tags
[COMMENT],[CHUNK], and[META]are reserved for future use and MUST NOT appear unless escaped or emitted intentionally once semantics are defined.
REPOSCRIPT version=1
repo: https://github.com/user/project
commit: abc123
[FILE_START] src/index.ts
meta: lang=ts size=123
import { foo } from './foo';
[FILE_END] src/index.ts
Note: The current CLI/API emit the
[..._START]/[..._END]markers (with optional metadata). The header lines shown above are illustrative and may be added via tooling or future options.
The FileData interface represents the structure of files and directories in the JSON output:
interface FileData {
name: string;
path: string;
type: 'file' | 'directory';
children?: FileData[];
content?: string;
}- Clones the repository locally (temporary)
- Ignores binary files and common non-source files
- Supports nested directory structures
- Provides both JSON and plain text output formats
- Cleans up cloned repository after scraping
The following file types and patterns are ignored during scraping:
- package-lock.json
- Binary files (pdf, png, jpg, jpeg, gif, ico, svg, woff, woff2, eot, ttf, otf)
- Media files (mp4, avi, webm, mov, mp3, wav, flac, ogg, webp)
- Debug and error logs (npm-debug, yarn-debug, yarn-error)
- Configuration files (tsconfig, jest.config)
- The
.gitdirectory
This project is licensed under the MIT License.
arnab2001
Contributions, issues, and feature requests are welcome. Feel free to check [issues page] if you want to contribute. Also Check Contribution Guide Open Source Community Conduct
We are committed to fostering a welcoming and inclusive open-source community. We expect all contributors to adhere to our Code of Conduct to create a respectful and collaborative environment.
Give a ⭐️ if this project helped you!