Skip to content

Enhance configuration options #385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,12 +135,60 @@ By default, the digest is written to a text file (`digest.txt`) in your current
- Use `--output/-o <filename>` to write to a specific file.
- Use `--output/-o -` to output directly to `STDOUT` (useful for piping to other tools).

### 🔧 Configure processing limits

```bash
# Set higher limits for large repositories
gitingest https://github.com/torvalds/linux \
--max-files 100000 \
--max-total-size 2147483648 \
--max-directory-depth 25

# Process only Python files up to 1MB each
gitingest /path/to/project \
--include-pattern "*.py" \
--max-size 1048576 \
--max-files 1000
```

See more options and usage details with:

```bash
gitingest --help
```

### Configuration via Environment Variables

You can configure various limits and settings using environment variables. All configuration environment variables start with the `GITINGEST_` prefix:

#### File Processing Configuration

- `GITINGEST_MAX_FILE_SIZE` - Maximum size of a single file to process *(default: 10485760 bytes, 10 MB)*
- `GITINGEST_MAX_FILES` - Maximum number of files to process *(default: 10000)*
- `GITINGEST_MAX_TOTAL_SIZE_BYTES` - Maximum size of output file *(default: 524288000 bytes, 500 MB)*
- `GITINGEST_MAX_DIRECTORY_DEPTH` - Maximum depth of directory traversal *(default: 20)*
- `GITINGEST_DEFAULT_TIMEOUT` - Default operation timeout in seconds *(default: 60)*
- `GITINGEST_OUTPUT_FILE_NAME` - Default output filename *(default: "digest.txt")*
- `GITINGEST_TMP_BASE_PATH` - Base path for temporary files *(default: system temp directory)*

#### Server Configuration (for self-hosting)

- `GITINGEST_MAX_DISPLAY_SIZE` - Maximum size of content to display in UI *(default: 300000 bytes)*
- `GITINGEST_DELETE_REPO_AFTER` - Repository cleanup timeout in seconds *(default: 3600, 1 hour)*
- `GITINGEST_MAX_FILE_SIZE_KB` - Maximum file size for UI slider in kB *(default: 102400, 100 MB)*
- `GITINGEST_MAX_SLIDER_POSITION` - Maximum slider position in UI *(default: 500)*

#### Example usage

```bash
# Configure for large scientific repositories
export GITINGEST_MAX_FILES=50000
export GITINGEST_MAX_FILE_SIZE=20971520 # 20 MB
export GITINGEST_MAX_TOTAL_SIZE_BYTES=1073741824 # 1 GB

gitingest https://github.com/some/large-repo
```

## 🐍 Python package usage

```python
Expand Down Expand Up @@ -169,6 +217,15 @@ summary, tree, content = ingest("https://github.com/username/private-repo")

# Include repository submodules
summary, tree, content = ingest("https://github.com/username/repo-with-submodules", include_submodules=True)

# Configure limits programmatically
summary, tree, content = ingest(
"https://github.com/username/large-repo",
max_file_size=20 * 1024 * 1024, # 20 MB per file
max_files=50000, # 50k files max
max_total_size_bytes=1024**3, # 1 GB total
max_directory_depth=30 # 30 levels deep
)
```

By default, this won't write a file but can be enabled with the `output` argument.
Expand Down
6 changes: 3 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -98,9 +98,9 @@ per-file-ignores = { "tests/**/*.py" = ["S101"] } # Skip the "assert used" warni
[tool.ruff.lint.pylint]
max-returns = 10

[tool.ruff.lint.isort]
order-by-type = true
case-sensitive = true
# [tool.ruff.lint.isort]
# order-by-type = true
# case-sensitive = true

[tool.pycln]
all = true
Expand Down
51 changes: 45 additions & 6 deletions src/gitingest/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,20 @@
import click
from typing_extensions import Unpack

from gitingest.config import MAX_FILE_SIZE, OUTPUT_FILE_NAME
from gitingest.config import MAX_DIRECTORY_DEPTH, MAX_FILE_SIZE, MAX_FILES, MAX_TOTAL_SIZE_BYTES, OUTPUT_FILE_NAME
from gitingest.entrypoint import ingest_async


class _CLIArgs(TypedDict):
source: str
max_size: int
max_files: int
max_total_size: int
max_directory_depth: int
exclude_pattern: tuple[str, ...]
include_pattern: tuple[str, ...]
branch: str | None
tag: str | None
include_gitignored: bool
include_submodules: bool
token: str | None
Expand All @@ -34,6 +38,24 @@ class _CLIArgs(TypedDict):
show_default=True,
help="Maximum file size to process in bytes",
)
@click.option(
"--max-files",
default=MAX_FILES,
show_default=True,
help="Maximum number of files to process",
)
@click.option(
"--max-total-size",
default=MAX_TOTAL_SIZE_BYTES,
show_default=True,
help="Maximum total size of all files in bytes",
)
@click.option(
"--max-directory-depth",
default=MAX_DIRECTORY_DEPTH,
show_default=True,
help="Maximum depth of directory traversal",
)
@click.option("--exclude-pattern", "-e", multiple=True, help="Shell-style patterns to exclude.")
@click.option(
"--include-pattern",
Expand All @@ -42,6 +64,7 @@ class _CLIArgs(TypedDict):
help="Shell-style patterns to include.",
)
@click.option("--branch", "-b", default=None, help="Branch to clone and ingest")
@click.option("--tag", default=None, help="Tag to clone and ingest")
@click.option(
"--include-gitignored",
is_flag=True,
Expand Down Expand Up @@ -98,7 +121,7 @@ def main(**cli_kwargs: Unpack[_CLIArgs]) -> None:
$ gitingest --include-pattern "*.js" --exclude-pattern "node_modules/*"

Private repositories:
$ gitingest https://github.com/user/private-repo -t ghp_token
$ gitingest https://github.com/user/private-repo --token ghp_token
$ GITHUB_TOKEN=ghp_token gitingest https://github.com/user/private-repo

Include submodules:
Expand All @@ -112,9 +135,13 @@ async def _async_main(
source: str,
*,
max_size: int = MAX_FILE_SIZE,
max_files: int = MAX_FILES,
max_total_size: int = MAX_TOTAL_SIZE_BYTES,
max_directory_depth: int = MAX_DIRECTORY_DEPTH,
exclude_pattern: tuple[str, ...] | None = None,
include_pattern: tuple[str, ...] | None = None,
branch: str | None = None,
tag: str | None = None,
include_gitignored: bool = False,
include_submodules: bool = False,
token: str | None = None,
Expand All @@ -132,21 +159,29 @@ async def _async_main(
A directory path or a Git repository URL.
max_size : int
Maximum file size in bytes to ingest (default: 10 MB).
max_files : int
Maximum number of files to ingest (default: 10,000).
max_total_size : int
Maximum total size of output file in bytes (default: 500 MB).
max_directory_depth : int
Maximum depth of directory traversal (default: 20).
exclude_pattern : tuple[str, ...] | None
Glob patterns for pruning the file set.
include_pattern : tuple[str, ...] | None
Glob patterns for including files in the output.
branch : str | None
Git branch to ingest. If ``None``, the repository's default branch is used.
Git branch to clone and ingest (default: the default branch).
tag : str | None
Git tag to clone and ingest. If ``None``, no tag is used.
include_gitignored : bool
If ``True``, also ingest files matched by ``.gitignore`` or ``.gitingestignore`` (default: ``False``).
If ``True``, include files ignored by ``.gitignore`` and ``.gitingestignore`` (default: ``False``).
include_submodules : bool
If ``True``, recursively include all Git submodules within the repository (default: ``False``).
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
Can also be set via the ``GITHUB_TOKEN`` environment variable.
output : str | None
The path where the output file will be written (default: ``digest.txt`` in current directory).
The path where the output file is written (default: ``digest.txt`` in current directory).
Use ``"-"`` to write to ``stdout``.

Raises
Expand All @@ -170,9 +205,13 @@ async def _async_main(
summary, _, _ = await ingest_async(
source,
max_file_size=max_size,
include_patterns=include_patterns,
max_files=max_files,
max_total_size_bytes=max_total_size,
max_directory_depth=max_directory_depth,
exclude_patterns=exclude_patterns,
include_patterns=include_patterns,
branch=branch,
tag=tag,
include_gitignored=include_gitignored,
include_submodules=include_submodules,
token=token,
Expand Down
24 changes: 17 additions & 7 deletions src/gitingest/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,22 @@
import tempfile
from pathlib import Path

MAX_FILE_SIZE = 10 * 1024 * 1024 # Maximum size of a single file to process (10 MB)
MAX_DIRECTORY_DEPTH = 20 # Maximum depth of directory traversal
MAX_FILES = 10_000 # Maximum number of files to process
MAX_TOTAL_SIZE_BYTES = 500 * 1024 * 1024 # Maximum size of output file (500 MB)
DEFAULT_TIMEOUT = 60 # seconds
from gitingest.utils.config_utils import _get_env_var

OUTPUT_FILE_NAME = "digest.txt"
def _get_int_env_var(key: str, default: int) -> int:
"""Get environment variable as integer with fallback to default."""
try:
return int(_get_env_var(key, str(default)))
except ValueError:
print(f"Warning: Invalid value for GITINGEST_{key}. Using default: {default}")
return default

TMP_BASE_PATH = Path(tempfile.gettempdir()) / "gitingest"
MAX_FILE_SIZE = _get_int_env_var("MAX_FILE_SIZE", 10 * 1024 * 1024) # Max file size to process in bytes (10 MB)
MAX_FILES = _get_int_env_var("MAX_FILES", 10_000) # Max number of files to process
MAX_TOTAL_SIZE_BYTES = _get_int_env_var("MAX_TOTAL_SIZE_BYTES", 500 * 1024 * 1024) # Max output file size (500 MB)
MAX_DIRECTORY_DEPTH = _get_int_env_var("MAX_DIRECTORY_DEPTH", 20) # Max depth of directory traversal

DEFAULT_TIMEOUT = _get_int_env_var("DEFAULT_TIMEOUT", 60) # Default timeout for git operations in seconds

OUTPUT_FILE_NAME = _get_env_var("OUTPUT_FILE_NAME", "digest.txt")
TMP_BASE_PATH = Path(_get_env_var("TMP_BASE_PATH", tempfile.gettempdir())) / "gitingest"
64 changes: 44 additions & 20 deletions src/gitingest/entrypoint.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,11 @@ async def ingest_async(
source: str,
*,
max_file_size: int = MAX_FILE_SIZE,
include_patterns: str | set[str] | None = None,
max_files: int | None = None,
max_total_size_bytes: int | None = None,
max_directory_depth: int | None = None,
exclude_patterns: str | set[str] | None = None,
include_patterns: str | set[str] | None = None,
branch: str | None = None,
tag: str | None = None,
include_gitignored: bool = False,
Expand All @@ -40,17 +43,23 @@ async def ingest_async(
Parameters
----------
source : str
The source to analyze, which can be a URL (for a Git repository) or a local directory path.
A directory path or a Git repository URL.
max_file_size : int
Maximum allowed file size for file ingestion. Files larger than this size are ignored (default: 10 MB).
include_patterns : str | set[str] | None
Pattern or set of patterns specifying which files to include. If ``None``, all files are included.
Maximum file size in bytes to ingest (default: 10 MB).
max_files : int | None
Maximum number of files to ingest (default: 10,000).
max_total_size_bytes : int | None
Maximum total size of output file in bytes (default: 500 MB).
max_directory_depth : int | None
Maximum depth of directory traversal (default: 20).
exclude_patterns : str | set[str] | None
Pattern or set of patterns specifying which files to exclude. If ``None``, no files are excluded.
Glob patterns for pruning the file set.
include_patterns : str | set[str] | None
Glob patterns for including files in the output.
branch : str | None
The branch to clone and ingest (default: the default branch).
Git branch to clone and ingest (default: the default branch).
tag : str | None
The tag to clone and ingest. If ``None``, no tag is used.
Git tag to to clone and ingest. If ``None``, no tag is used.
include_gitignored : bool
If ``True``, include files ignored by ``.gitignore`` and ``.gitingestignore`` (default: ``False``).
include_submodules : bool
Expand All @@ -59,7 +68,7 @@ async def ingest_async(
GitHub personal access token (PAT) for accessing private repositories.
Can also be set via the ``GITHUB_TOKEN`` environment variable.
output : str | None
File path where the summary and content should be written.
File path where the summary and content is written.
If ``"-"`` (dash), the results are written to ``stdout``.
If ``None``, the results are not written to a file.

Expand All @@ -77,9 +86,12 @@ async def ingest_async(
query: IngestionQuery = await parse_query(
source=source,
max_file_size=max_file_size,
max_files=max_files,
max_total_size_bytes=max_total_size_bytes,
max_directory_depth=max_directory_depth,
from_web=False,
include_patterns=include_patterns,
ignore_patterns=exclude_patterns,
include_patterns=include_patterns,
token=token,
)

Expand All @@ -101,8 +113,11 @@ def ingest(
source: str,
*,
max_file_size: int = MAX_FILE_SIZE,
include_patterns: str | set[str] | None = None,
max_files: int | None = None,
max_total_size_bytes: int | None = None,
max_directory_depth: int | None = None,
exclude_patterns: str | set[str] | None = None,
include_patterns: str | set[str] | None = None,
branch: str | None = None,
tag: str | None = None,
include_gitignored: bool = False,
Expand All @@ -119,17 +134,23 @@ def ingest(
Parameters
----------
source : str
The source to analyze, which can be a URL (for a Git repository) or a local directory path.
A directory path or a Git repository URL.
max_file_size : int
Maximum allowed file size for file ingestion. Files larger than this size are ignored (default: 10 MB).
include_patterns : str | set[str] | None
Pattern or set of patterns specifying which files to include. If ``None``, all files are included.
Maximum file size in bytes to ingest (default: 10 MB).
max_files : int | None
Maximum number of files to ingest (default: 10,000).
max_total_size_bytes : int | None
Maximum total size of output file in bytes (default: 500 MB).
max_directory_depth : int | None
Maximum depth of directory traversal (default: 20).
exclude_patterns : str | set[str] | None
Pattern or set of patterns specifying which files to exclude. If ``None``, no files are excluded.
Glob patterns for pruning the file set.
include_patterns : str | set[str] | None
Glob patterns for including files in the output.
branch : str | None
The branch to clone and ingest (default: the default branch).
Git branch to clone and ingest (default: the default branch).
tag : str | None
The tag to clone and ingest. If ``None``, no tag is used.
Git tag to to clone and ingest. If ``None``, no tag is used.
include_gitignored : bool
If ``True``, include files ignored by ``.gitignore`` and ``.gitingestignore`` (default: ``False``).
include_submodules : bool
Expand All @@ -138,7 +159,7 @@ def ingest(
GitHub personal access token (PAT) for accessing private repositories.
Can also be set via the ``GITHUB_TOKEN`` environment variable.
output : str | None
File path where the summary and content should be written.
File path where the summary and content is written.
If ``"-"`` (dash), the results are written to ``stdout``.
If ``None``, the results are not written to a file.

Expand All @@ -159,8 +180,11 @@ def ingest(
ingest_async(
source=source,
max_file_size=max_file_size,
include_patterns=include_patterns,
max_files=max_files,
max_total_size_bytes=max_total_size_bytes,
max_directory_depth=max_directory_depth,
exclude_patterns=exclude_patterns,
include_patterns=include_patterns,
branch=branch,
tag=tag,
include_gitignored=include_gitignored,
Expand Down
Loading