Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
532107d
📝 Add docstrings to `feature/github-integration`
coderabbitai[bot] Aug 28, 2025
4076f48
Merge pull request #7 from r3d91ll/coderabbitai/docstrings/67e8628
r3d91ll Aug 28, 2025
8b48ffd
feat: Complete GitHub repository integration with Tree-sitter (Issue #5)
rd91ll Aug 29, 2025
5308b95
feat: Enhance ArXiv paper collection and processing scripts
rd91ll Aug 29, 2025
accd4a6
feat: Implement word2vec evolution experiment with theory-practice br…
rd91ll Sep 2, 2025
7c1ea1f
feat: Add PDF size update script and orchestration module
rd91ll Sep 2, 2025
9b82f1c
fix: Remove embedding files from git and update .gitignore
rd91ll Sep 2, 2025
d9243e8
fix: Critical security and import path fixes from CodeRabbit review
rd91ll Sep 2, 2025
2ccbfe7
fix: Additional CodeRabbit issues - DB manager and extractors
rd91ll Sep 2, 2025
aaa97aa
Merge branch 'main' into feature/arxiv-enhancements
r3d91ll Sep 2, 2025
144f6ef
feat: Update .gitignore to include logs and new experiment directories
rd91ll Sep 2, 2025
80b0875
Merge branch 'feature/github_arxiv_tooling' into feature/arxiv-enhanc…
r3d91ll Sep 2, 2025
bc144e0
Merge pull request #15 from r3d91ll/feature/arxiv-enhancements
r3d91ll Sep 2, 2025
8146def
Update core/framework/extractors/code_extractor.py
r3d91ll Sep 3, 2025
db0931a
Update core/framework/extractors/code_extractor.py
r3d91ll Sep 3, 2025
ebe5277
Update docs/implementation/arxiv_metadata_service_implementation.md
r3d91ll Sep 3, 2025
a2f29bb
Update experiments/word2vec_evolution/process_papers.py
r3d91ll Sep 3, 2025
4d96755
Update experiments/word2vec_evolution/process_papers.py
r3d91ll Sep 3, 2025
e756110
fix: Remove result files from git and update .gitignore
rd91ll Sep 3, 2025
8736363
fix: Add deduplication for existing collection papers
rd91ll Sep 3, 2025
4fbc448
fix: Enhance tokenizer interface validation in TokenBasedChunking
rd91ll Sep 3, 2025
80596db
fix: Format YAML configuration for improved readability
rd91ll Sep 3, 2025
2af1332
fix: Refactor repository processing to use GitHubPipelineManager and …
rd91ll Sep 3, 2025
4d99984
fix: Add newline at the end of experiment_config.yaml for consistency
rd91ll Sep 3, 2025
12d75b0
fix: Address CodeRabbit review comments from PR #17
rd91ll Sep 3, 2025
638dabd
Merge pull request #18 from r3d91ll/pr-17
r3d91ll Sep 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ cython_debug/
# VS Code
.vscode/
*.code-workspace
.codex/

# Local History for Visual Studio Code
.history/
Expand Down Expand Up @@ -361,9 +362,24 @@ experiments/*/logs/
experiments/*/.cache/
experiments/*/checkpoints/
experiments/*/*.checkpoint
experiments/*/embeddings/*_embeddings.json

Acheron/
Acheron/*
logs/
experiments/word2vec_evolution/embeddings/
experiments/word2vec_evolution/extracted_papers/
experiments/word2vec_evolution/*_results.json
experiments/word2vec_evolution/*_analysis.json
# Test artifacts only, not test source files
tests/*.log
tests/.cache/
tests/__pycache__/
tests/*.pyc
tools/arxiv/scripts/data/*
tools/arxiv/scripts/data/arxiv_collections/
# Pipeline result files
*_results_*.json
tools/arxiv/pipelines/*_results_*.json
tools/github/*_results_*.json
arxiv_pipeline_v2_results_*.json
66 changes: 50 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,22 +21,30 @@ HADES-Lab/
│ ├── mcp_server/ # MCP interface for Claude integration
│ ├── framework/ # Shared framework
│ │ ├── embedders.py # Jina v4 embeddings
│ │ ├── extractors/ # Docling PDF extraction
│ │ ├── extractors/ # Content extraction
│ │ │ ├── docling_extractor.py # PDF extraction
│ │ │ ├── code_extractor.py # Code file extraction
│ │ │ └── tree_sitter_extractor.py # Symbol extraction
│ │ └── storage.py # ArangoDB management
│ ├── processors/ # Base processor classes
│ ├── utils/ # Core utilities
│ └── logs/ # Core system logs
├── tools/ # Processing tools (data sources)
│ └── arxiv/ # ArXiv paper processing
│ ├── pipelines/ # ACID-compliant pipelines
│ ├── monitoring/ # Real-time monitoring
│ ├── database/ # Database utilities
│ ├── scripts/ # Utility scripts
│ ├── utils/ # Database utilities
│ ├── tests/ # Integration tests
│ ├── configs/ # Pipeline configurations
│ └── logs/ # ArXiv processing logs
│ ├── arxiv/ # ArXiv paper processing
│ │ ├── pipelines/ # ACID-compliant pipelines
│ │ ├── monitoring/ # Real-time monitoring
│ │ ├── database/ # Database utilities
│ │ ├── scripts/ # Utility scripts
│ │ ├── utils/ # Database utilities
│ │ ├── tests/ # Integration tests
│ │ ├── configs/ # Pipeline configurations
│ │ └── logs/ # ArXiv processing logs
│ └── github/ # GitHub repository processing
│ ├── configs/ # GitHub pipeline configurations
│ ├── github_pipeline_manager.py # Graph-based processing
│ ├── setup_github_graph.py # Graph collection setup
│ └── test_treesitter_simple.py # Tree-sitter testing
├── experiments/ # Research and experimentation
│ ├── README.md # Experiment guidelines
Expand Down Expand Up @@ -76,6 +84,9 @@ HADES-Lab/
- **ACID Pipeline**: 11.3 papers/minute with 100% success rate (validated on 1000+ papers)
- **Phase-Separated Architecture**: Extract (GPU-accelerated Docling) → Embed (Jina v4)
- **Direct PDF Processing**: No database dependencies, processes from local filesystem
- **GitHub Repository Processing**: Clone, extract, embed code with Tree-sitter symbol extraction
- **Graph-Based Storage**: Repository → File → Chunk → Embedding relationships in ArangoDB
- **Tree-sitter Integration**: Symbol extraction for Python, JavaScript, TypeScript, Java, C/C++, Go, Rust
- **Jina v4 Late Chunking**: Context-aware embeddings preserving document structure (32k tokens)
- **Multi-collection Storage**: Separate ArangoDB collections for embeddings, equations, tables, images
- **SQLite Caching**: Optional local cache for PDF indexing and metadata
Expand All @@ -84,9 +95,9 @@ HADES-Lab/

### In Development

- **Additional Data Sources**: GitHub repositories, web documentation
- **Cross-source Graphs**: Theory-practice bridges across multiple sources
- **Enhanced Embeddings**: Domain-specific fine-tuning
- **Cross-Repository Analysis**: Theory-practice bridge detection across repositories
- **Enhanced Config Understanding**: Leveraging Jina v4's coding LoRA for config semantics
- **Incremental Repository Updates**: Only process changed files
- **Active Monitoring**: Real-time pipeline monitoring system

## Installation
Expand Down Expand Up @@ -125,6 +136,27 @@ python arxiv_pipeline.py \
tail -f tools/arxiv/logs/acid_phased.log
```

### GitHub Repository Processing

```bash
# Setup graph collections (first time only)
cd tools/github/
python setup_github_graph.py

# Process a single repository
python github_pipeline_manager.py --repo "owner/repo"

# Example: Process word2vec repository
python github_pipeline_manager.py --repo "dav/word2vec"

# Query processed repositories (in ArangoDB)
# Find all embeddings for a repository:
# FOR v, e, p IN 1..3 OUTBOUND 'github_repositories/owner_repo'
# GRAPH 'github_graph'
# FILTER IS_SAME_COLLECTION('github_embeddings', v)
# RETURN v
```

### Creating Experiments

```bash
Expand Down Expand Up @@ -191,9 +223,11 @@ python -c "import json; papers = json.load(open('../datasets/cs_papers.json'))"
1. **Streamlined Architecture**: Direct PDF processing with ArangoDB storage, optional SQLite caching
2. **Late Chunking**: Process full documents (32k tokens) before chunking for context preservation
3. **Multi-source Integration**: Unified framework for ArXiv, GitHub, and Web data
4. **Information Reconstructionism**: Implementing WHERE × WHAT × CONVEYANCE × TIME theory
5. **Experiments Framework**: Structured research environment with infrastructure separation
6. **Archaeological Code Preservation**: Acheron protocol maintains complete development history
4. **Graph-Based Code Storage**: Repository → File → Chunk → Embedding relationships enable cross-repo analysis
5. **Tree-sitter Symbol Extraction**: AST-based symbol extraction without semantic interpretation (let Jina handle that)
6. **Information Reconstructionism**: Implementing WHERE × WHAT × CONVEYANCE × TIME theory
7. **Experiments Framework**: Structured research environment with infrastructure separation
8. **Archaeological Code Preservation**: Acheron protocol maintains complete development history

## License

Expand Down
Loading