A comprehensive knowledge graph built from digimon.net/reference to analyze relationships between Digimon based on their characteristics, evolution patterns, and shared attributes.
This project creates a searchable, analyzable network of all Digimon and their relationships by:
- Collecting - Scraping comprehensive data from the official Japanese Digimon reference
- Translating - Converting Japanese content to English for accessibility
- Structuring - Parsing unstructured HTML into organized data
- Connecting - Building a graph database of relationships
- Analyzing - Discovering patterns and insights through network analysis
- Comprehensive Data Collection: Capture all 1,249+ Digimon with their complete profiles
- Relationship Mapping: Identify evolution chains, type similarities, and shared attributes
- Pattern Discovery: Uncover hidden connections and clustering in the Digimon universe
- Research Platform: Provide a queryable database for fans and researchers
- Technical Demonstration: Showcase modern data engineering practices
- Complete Digimon Database: Neo4j graph with all Digimon as nodes
- Relationship Network: Edges representing evolutions, shared types, attributes, and moves
- Analytical Insights: Statistics on type distributions, evolution patterns, and network centrality
- Visual Reports: Network visualizations and analysis charts
- Query Interface: Cypher queries for exploring specific relationships
- Analysis Specification: Comprehensive specification for the 8-notebook analysis suite
- Methodology Guide: Detailed statistical methods, algorithms, and ML approaches
- Visualization Guide: Complete specifications for 30+ visualizations
- Insights Summary: Expected findings, metrics, and practical applications
- Data Exploration & Profiling: Dataset statistics and quality assessment
- Evolution Network Analysis: Evolution chains and branching patterns
- Type-Attribute Correlation: Statistical relationships and pattern mining
- Move Network Analysis: Move-based connections and clustering
- Community Detection: Graph clustering and natural groupings
- Centrality & Influence: Network importance metrics
- Machine Learning: Predictive models with 85%+ accuracy
- Recommendation System: Similarity metrics and team optimization
The system follows a modular pipeline architecture where each component has a specific responsibility in the data processing flow.
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β β β β
β digimon.net ββββββΆβ Scraper ββββββΆβ Raw HTML β
β (Data Source) β β (Async) β β Storage β
β β β β β β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β β β β
β Translation βββββββ Parser βββββββ Structured β
β (Google API) β β (BS4) β β JSON Data β
β β β β β β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β β β β
β Neo4j Graph βββββββ Loader β β Analysis β
β Database β β (py2neo) ββββββΆβ (NetworkX) β
β β β β β β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
This diagram shows how data flows through the system from source to analysis, including all intermediate storage layers.
flowchart LR
subgraph DS["Data Sources"]
A[digimon.net/reference]
end
subgraph DP["Data Pipeline"]
B[Scraper<br/>BeautifulSoup4]
C[Parser<br/>HTML β JSON]
D[Translator<br/>JP β EN]
E[Loader<br/>JSON β Neo4j]
end
subgraph ST["Storage"]
F[(Raw HTML<br/>Files)]
G[(Parsed JSON<br/>Files)]
H[(Translated<br/>JSON)]
I[(Neo4j<br/>Graph DB)]
end
subgraph AN["Analysis"]
J[NetworkX<br/>Analyzer]
K[Notebooks<br/>& Visualizations]
end
A -->|HTTP Requests| B
B -->|Save| F
F -->|Read| C
C -->|Save| G
G -->|Read| D
D -->|Cache| H
H -->|Read| E
E -->|Import| I
I -->|Query| J
J -->|Generate| K
style DS fill:#666,stroke:#333,stroke-width:2px,color:#fff
style DP fill:#666,stroke:#333,stroke-width:2px,color:#fff
style ST fill:#666,stroke:#333,stroke-width:2px,color:#fff
style AN fill:#666,stroke:#333,stroke-width:2px,color:#fff
style A fill:#2a2a2a,stroke:#888,stroke-width:2px,color:#fff
style B fill:#2a2a2a,stroke:#888,stroke-width:2px,color:#fff
style C fill:#2a2a2a,stroke:#888,stroke-width:2px,color:#fff
style D fill:#2a2a2a,stroke:#888,stroke-width:2px,color:#fff
style E fill:#2a2a2a,stroke:#888,stroke-width:2px,color:#fff
style F fill:#444,stroke:#666,stroke-width:1px,color:#ccc
style G fill:#444,stroke:#666,stroke-width:1px,color:#ccc
style H fill:#444,stroke:#666,stroke-width:1px,color:#ccc
style I fill:#2a2a2a,stroke:#888,stroke-width:2px,color:#fff
style J fill:#2a2a2a,stroke:#888,stroke-width:2px,color:#fff
style K fill:#444,stroke:#666,stroke-width:1px,color:#ccc
This diagram illustrates the modular architecture showing how the CLI interface connects to core modules and infrastructure.
graph TB
subgraph CI["CLI Interface"]
CLI[ygg CLI<br/>Click Framework]
end
subgraph CM["Core Modules"]
SCR[Scraper Module<br/>β’ Rate Limiting<br/>β’ Async Support<br/>β’ Error Handling]
PRS[Parser Module<br/>β’ BeautifulSoup4<br/>β’ CSS Selectors<br/>β’ Data Extraction]
TRN[Translator Module<br/>β’ Google Translate<br/>β’ Caching System<br/>β’ Batch Processing]
LDR[Loader Module<br/>β’ Neo4j Driver<br/>β’ Schema Creation<br/>β’ Relationship Building]
ANL[Analyzer Module<br/>β’ NetworkX<br/>β’ Graph Algorithms<br/>β’ Statistics]
end
subgraph IN["Infrastructure"]
NEO[Neo4j Database<br/>Community Edition]
FS[File System<br/>β’ HTML Storage<br/>β’ JSON Storage<br/>β’ Cache Files]
end
CLI --> SCR
CLI --> PRS
CLI --> TRN
CLI --> LDR
CLI --> ANL
SCR --> FS
PRS --> FS
TRN --> FS
LDR --> NEO
ANL --> NEO
style CI fill:#666,stroke:#333,stroke-width:2px,color:#fff
style CM fill:#666,stroke:#333,stroke-width:2px,color:#fff
style IN fill:#666,stroke:#333,stroke-width:2px,color:#fff
style CLI fill:#2a2a2a,stroke:#888,stroke-width:2px,color:#fff
style SCR fill:#444,stroke:#666,stroke-width:1px,color:#ccc
style PRS fill:#444,stroke:#666,stroke-width:1px,color:#ccc
style TRN fill:#444,stroke:#666,stroke-width:1px,color:#ccc
style LDR fill:#444,stroke:#666,stroke-width:1px,color:#ccc
style ANL fill:#444,stroke:#666,stroke-width:1px,color:#ccc
style NEO fill:#2a2a2a,stroke:#888,stroke-width:2px,color:#fff
style FS fill:#444,stroke:#666,stroke-width:1px,color:#ccc
-
Data Collection Phase
- API fetcher retrieves list of all Digimon URLs
- Async scraper downloads HTML pages with rate limiting
- Raw HTML and images stored locally
-
Processing Phase
- Parser extracts structured data from HTML
- Identifies Japanese/English names, types, attributes, moves
- Saves as JSON with consistent schema
-
Translation Phase
- Translates Japanese profile text to English
- Uses caching to avoid duplicate API calls
- Preserves original Japanese for reference
-
Graph Construction Phase
- Creates nodes for Digimon, Types, Attributes, Moves
- Establishes relationships between entities
- Indexes for efficient querying
-
Analysis Phase
- Network analysis identifies central Digimon
- Community detection finds clusters
- Evolution chain analysis
- Statistical reports generation
- Scraper (
src/scraper/): Async web scraping with robots.txt compliance - Parser (
src/parser/): BeautifulSoup-based HTML parsing - Translator (
src/processor/): Google Translate API integration with caching - Graph Loader (
src/graph/): Neo4j database population - Analyzer (
src/analysis/): NetworkX-based graph analysis - CLI (
yggdrasil_cli.py): Unified command-line interface
# Clone the repository
git clone https://github.com/yourusername/project-yggdrasil.git
cd project-yggdrasil
# Enter Nix development environment
nix develop
# Install the CLI
pip install -e .
# Start Neo4j and run full pipeline
ygg start
ygg runThat's it! These commands start Neo4j and run the entire pipeline.
- Docker & Docker Compose
- Python 3.11+
- One of: Nix (recommended), Poetry, or standard pip/venv
# Install Nix if you haven't already
curl -L https://nixos.org/nix/install | sh
# Enable flakes (add to ~/.config/nix/nix.conf)
experimental-features = nix-command flakes
# Enter development shell
nix develop
# Or with direnv
direnv allow# Install Poetry
curl -sSL https://install.python-poetry.org | python3 -
# Install dependencies
poetry install
# Activate shell
poetry shell# Install Python 3.11 with pyenv
pyenv install 3.11.8
pyenv local 3.11.8
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Create virtual environment
python3.11 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Run everything at once
ygg run
# Or run individual steps
ygg scrape # Scrape data
ygg parse # Parse HTML to JSON
ygg translate # Translate to English
ygg load # Load into Neo4j
ygg analyze # Run analysis# 1. Clone and enter project
git clone https://github.com/yourusername/project-yggdrasil.git
cd project-yggdrasil
# 2. Enter Nix environment (installs Python, dependencies, etc.)
nix develop
# 3. Install the CLI tool
pip install -e .
# 4. Start Neo4j
ygg start
# 5. Run the full pipeline
ygg run# 1. Enter project and Nix environment
cd project-yggdrasil
nix develop # or use direnv
# 2. Check current status
ygg status
# 3. Start Neo4j if needed
ygg start
# 4. Continue where you left off
ygg run # or specific step like 'ygg translate'# Check what was scraped
ygg status
# Clean up partial data
ygg prune --keep-cache
# Restart scraping
ygg scrape --fetch-api# Scrape just a few pages for testing
python -m src.scraper.main --limit 10
# Then run the rest of the pipeline
ygg parse
ygg translate
ygg load
ygg analyze# Stop Neo4j
ygg stop
# Clean everything including Neo4j database
ygg prune --include-neo4j
# Start fresh
ygg start
ygg run# Make sure Neo4j is running
ygg start
# Open Neo4j Browser
# Go to: http://localhost:7474
# Login: neo4j / digimon123
# Example queries:
# - MATCH (d:Digimon) RETURN d LIMIT 25
# - MATCH (d:Digimon {name_en: "Agumon"})-[r]->(other) RETURN d, r, otherIssue: "command not found: ygg"
# Make sure you're in Nix environment
nix develop
# Reinstall the CLI
pip install -e .Issue: Scraping shows "success=0"
# The save_html fix might not be applied
pip install -e . --force-reinstall --no-deps
# Clean and restart
ygg prune --keep-cache
ygg scrape --fetch-apiIssue: Neo4j won't start
# Check if Docker is running
docker ps
# Check logs
ygg logs
# Try manual start
docker-compose up -dIssue: Translation taking too long
# Translation uses caching, so you can safely interrupt (Ctrl+C)
# and resume later - it won't retranslate cached items
ygg translate- Scraping: ~40-50 minutes for all 1,249 Digimon
- Parsing: ~5 minutes
- Translation: ~60-90 minutes (first time, much faster with cache)
- Loading: ~5 minutes
- Analysis: ~1 minute
- Total: ~2-3 hours for complete pipeline
- Chi-Square Tests: Testing independence between type and attribute distributions
- CramΓ©r's V: Measuring association strength in categorical variables
- Markov Chains: Modeling evolution transition probabilities
- Permutation Tests: Validating network properties against random models
- Centrality Measures: Degree, Betweenness, Closeness, Eigenvector, PageRank
- Community Detection: Louvain, Label Propagation, Spectral Clustering
- Path Analysis: Shortest paths, evolution chains, cycle detection
- Graph Embeddings: Node2Vec, DeepWalk for similarity computation
- Classification: Random Forest, XGBoost, Neural Networks for type/attribute prediction
- Link Prediction: Graph Neural Networks for evolution prediction
- Feature Engineering: Graph features, text embeddings, move similarity
- Model Validation: Cross-validation, learning curves, SHAP interpretability
- Network Properties: Small-world network with diameter 6-10, scale-free distribution
- Evolution Patterns: 2-4 paths per Digimon, 72% type stability through evolution
- Community Structure: 8-12 natural communities aligned with thematic groups
- Predictive Power: 85%+ accuracy in type prediction using graph features
project-yggdrasil/
βββ src/ # Source code
β βββ scraper/ # Web scraping & API integration
β β βββ fetcher.py # Async HTML scraper
β β βββ api_fetcher.py # API endpoint discovery
β β βββ robots_checker.py # Robots.txt compliance
β βββ parser/ # HTML parsing & data extraction
β β βββ html_parser.py # BeautifulSoup parser
β β βββ main.py # Parser orchestration
β βββ processor/ # Data processing & translation
β β βββ translator.py # Google Translate integration
β β βββ main.py # Processing pipeline
β βββ graph/ # Neo4j database layer
β β βββ loader.py # Graph construction
β β βββ main.py # Database operations
β βββ analysis/ # Network analysis & insights
β β βββ main.py # NetworkX analysis
β βββ utils/ # Shared utilities
β βββ config.py # Configuration management
β βββ cache.py # Translation caching
β βββ logger.py # Logging setup
β
βββ data/ # Data storage
β βββ raw/ # Original scraped content
β β βββ html/ # HTML pages
β β βββ images/ # Digimon images
β βββ processed/ # Parsed JSON data
β βββ translated/ # English translations
β βββ cache/ # Translation cache
β
βββ notebooks/ # Analysis notebooks
β βββ 01_data_exploration.ipynb
β βββ 02_evolution_analysis.ipynb
β βββ 03_type_correlation.ipynb
β βββ 04_move_network.ipynb
β βββ 05_community_detection.ipynb
β βββ 06_centrality_analysis.ipynb
β βββ 07_machine_learning.ipynb
β βββ 08_recommendations.ipynb
β
βββ docs/ # Documentation
β βββ analysis-specification.md
β βββ methodology.md
β βββ visualization-guide.md
β βββ insights-summary.md
β
βββ yggdrasil_cli.py # CLI interface (ygg command)
βββ docker-compose.yml # Neo4j container setup
βββ config.yaml # Application configuration
βββ requirements.txt # Python dependencies
βββ pyproject.toml # Poetry/packaging config
βββ flake.nix # Nix development environment
Edit .env file:
# Scraping settings
SCRAPE_DELAY=1.0 # Be respectful!
MAX_RETRIES=3
# Neo4j connection
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=digimon123graph TD
subgraph NT["Node Types"]
D[Digimon<br/>β’ name_jp<br/>β’ name_en<br/>β’ profile<br/>β’ image_url]
L[Level<br/>β’ name<br/>β’ order]
T[Type<br/>β’ name]
A[Attribute<br/>β’ name]
M[Move<br/>β’ name<br/>β’ description]
end
D -->|HAS_LEVEL| L
D -->|HAS_TYPE| T
D -->|HAS_ATTRIBUTE| A
D -->|CAN_USE| M
D -->|RELATED_TO| D
subgraph SR["Similarity Relationships"]
D2[Digimon] -.->|SHARES_TYPE| D3[Digimon]
D2 -.->|SHARES_LEVEL| D3
D2 -.->|SHARES_ATTRIBUTE| D3
D2 -.->|SHARES_MOVE| D3
end
style NT fill:#666,stroke:#333,stroke-width:2px,color:#fff
style SR fill:#666,stroke:#333,stroke-width:2px,color:#fff
style D fill:#2a2a2a,stroke:#888,stroke-width:2px,color:#fff
style L fill:#444,stroke:#666,stroke-width:1px,color:#ccc
style T fill:#444,stroke:#666,stroke-width:1px,color:#ccc
style A fill:#444,stroke:#666,stroke-width:1px,color:#ccc
style M fill:#444,stroke:#666,stroke-width:1px,color:#ccc
style D2 fill:#444,stroke:#666,stroke-width:1px,color:#ccc
style D3 fill:#444,stroke:#666,stroke-width:1px,color:#ccc
Nodes:
βββ Digimon (Primary Entity)
β βββ name_jp: Japanese name
β βββ name_en: English name
β βββ profile_jp: Original description
β βββ profile_en: Translated description
β βββ image_url: Character image
β
βββ Level (Evolution Stage)
β βββ name: Baby, Rookie, Champion, Ultimate, Mega, etc.
β
βββ Type (Species Classification)
β βββ name: Dragon, Machine, Beast, Angel, Demon, etc.
β
βββ Attribute (Alignment)
β βββ name: Vaccine, Virus, Data, Free, Variable
β
βββ Move (Special Attacks)
βββ name: Attack/technique name
Relationships:
βββ (Digimon)-[:HAS_LEVEL]->(Level)
βββ (Digimon)-[:HAS_TYPE]->(Type)
βββ (Digimon)-[:HAS_ATTRIBUTE]->(Attribute)
βββ (Digimon)-[:CAN_USE]->(Move)
βββ (Digimon)-[:EVOLVES_FROM]->(Digimon)
βββ (Digimon)-[:RELATED_TO]->(Digimon)
βββ (Digimon)-[:SHARES_TYPE]->(Digimon)
βββ (Digimon)-[:SHARES_LEVEL]->(Digimon)
βββ (Digimon)-[:SHARES_ATTRIBUTE]->(Digimon)
βββ (Digimon)-[:SHARES_MOVE]->(Digimon)
After analyzing the complete graph, the system discovers:
- Most Connected Digimon - Network hubs that share many relationships
- Evolution Chains - Complete paths from Baby to Mega level
- Type Clusters - Groups of similar Digimon based on shared characteristics
- Rare Combinations - Unique type/attribute pairings
- Move Popularity - Most common special attacks across species
// Find all Dragon-type Mega level Digimon
MATCH (d:Digimon)-[:HAS_TYPE]->(t:Type {name: "Dragon Type"})
MATCH (d)-[:HAS_LEVEL]->(l:Level {name: "Mega"})
RETURN d.name_en, d.name_jp
ORDER BY d.name_en;
// Discover evolution paths to a specific Digimon
MATCH path = (start:Digimon)-[:EVOLVES_FROM*]->(end:Digimon {name_en: "Omegamon"})
RETURN path;
// Find Digimon that share the most moves with Agumon
MATCH (agumon:Digimon {name_en: "Agumon"})-[:CAN_USE]->(m:Move)
MATCH (other:Digimon)-[:CAN_USE]->(m)
WHERE other <> agumon
RETURN other.name_en, COUNT(m) as shared_moves
ORDER BY shared_moves DESC
LIMIT 10;
// Identify type distribution by level
MATCH (d:Digimon)-[:HAS_LEVEL]->(l:Level)
MATCH (d)-[:HAS_TYPE]->(t:Type)
RETURN l.name as Level, t.name as Type, COUNT(d) as Count
ORDER BY Level, Count DESC;
// Find the shortest path between two Digimon
MATCH path = shortestPath(
(d1:Digimon {name_en: "Agumon"})-[*]-(d2:Digimon {name_en: "Gabumon"})
)
RETURN path;ygg start # Start Neo4j database
ygg stop # Stop Neo4j database
ygg status # Check pipeline progress
ygg run # Run complete pipeline
ygg prune # Clean up data files
ygg prune --include-neo4j # Clean data AND Neo4j
ygg --help # Show all commandspytest tests/black src/
ruff check src/mypy src/# Run locally after activating your Python environment
jupyter notebook
# Or with JupyterLab
jupyter lab- Neo4j: Graph database (ports 7474, 7687)
- Neo4j Browser: Web UI at http://localhost:7474
| Variable | Description | Default |
|---|---|---|
NEO4J_URI |
Neo4j connection string | bolt://localhost:7687 |
SCRAPE_DELAY |
Seconds between requests | 1.0 |
LOG_LEVEL |
Logging verbosity | INFO |
DEBUG |
Enable debug mode | false |
ygg start # Start Neo4j
ygg stop # Stop Neo4j
ygg status # Check progress
ygg run # Run full pipeline
ygg prune # Clean data filesygg scrape --fetch-api # 1. Scrape (40-50 min)
ygg parse # 2. Parse (5 min)
ygg translate # 3. Translate (60-90 min)
ygg load # 4. Load to Neo4j (5 min)
ygg analyze # 5. Analyze (1 min)ygg prune # Clean all data files
ygg prune --keep-cache # Keep translations
ygg prune --include-neo4j # Clean everything
ygg logs # View Neo4j logs
ygg db-status # Check database.env- Configurationdata/raw/html/- Scraped HTMLdata/processed/- Parsed JSONdata/translated/- English datadata/cache/translations.json- Translation cache
MIT License - see LICENSE file
Ricardo Ledan ricardoledan@proton.me
- Data source: digimon.net/reference
- Built with Claude, Neo4j, Python, and Cafe Bustelo coffee