MATILDA is a distributed system for collecting, extracting, and analyzing software projects, libraries, and their migration history between technologies (extraction of design decisions). The system crawls repositories, extracts design decisions, and provides recommendations based on migration and dependency analyses.
The system is built on a microservice architecture using Spring Boot, Kafka as a message broker, and MongoDB/PostgreSQL as databases. It processes GitHub projects automatically through various stages: Crawling, Extraction, Analysis, and Recommendation.
| Service | Port | Description |
|---|---|---|
| matilda-gateway | 8080 | API Gateway for all services |
| matilda-crawler | 8082 | Crawling GitHub repositories |
| matilda-dataextractor | 8083 | Extraction of dependencies and design decisions |
| matilda-analyzer | 8084 | Analysis and statistics on migrations |
| matilda-runner | - | Batch processing and runner tasks |
| matilda-auth | - | Authentication and authorization |
| matilda-state | - | State management |
| matilda-discovery | - | Service discovery (Eureka) |
| matilda-lib-manager | - | Library management |
- matilda-base: Common base classes and utilities
- matilda-persistence-jpa: JPA-based persistence (PostgreSQL)
- matilda-persistence-mongo: MongoDB-based persistence
- matilda-korpus-dependencies: Corpus for dependency analyses
- matilda-korpus-projects: Project corpus with analysis data
- matilda-korpus-libsim-ki: AI-based library similarity analysis (Python)
data_crawling_system/
├── pom.xml # Maven Parent POM
├── docker-compose.yml # Docker orchestration
├── docker-up.technologies.sh # Start infrastructure (MongoDB, Kafka, PostgreSQL)
├── docker-up.services.sh # Start core services
├── docker-up.processors.sh # Start processing services
├── docker-down.services.sh # Shutdown script
│
├── matilda-auth/ # Authentication service
├── matilda-analyzer/ # Analysis service with SpringBootTests
├── matilda-base/ # Common base library
├── matilda-crawler/ # GitHub crawler service
├── matilda-dataextractor/ # Data extraction service
├── matilda-discovery/ # Eureka service discovery
├── matilda-gateway/ # API Gateway
├── matilda-lib-manager/ # Library management service
├── matilda-runner/ # Batch runner service
├── matilda-state/ # State management service
│
├── matilda-persistence-jpa/ # JPA persistence layer (PostgreSQL)
├── matilda-persistence-mongo/ # MongoDB persistence layer
│
├── matilda-korpus-dependencies/ # Dependency corpus
├── matilda-korpus-projects/ # Project corpus with CSV exports
└── matilda-korpus-libsim-ki/ # Python: AI library similarity
- Docker & Docker Compose
- Java 11+
- Maven 3.5.4+
IMPORTANT: Before deploying to production, you MUST configure the following:
-
Change default passwords in
application.propertiesfiles:- Default admin password:
changeme - Default user password:
userPass
Set environment variables:
export ADMIN_PASSWORD=your-secure-password export ADMIN_USER=your-admin-username export REGULAR_PASSWORD=your-user-password export REGULAR_USER=your-username
- Default admin password:
-
Update hardcoded credentials in:
matilda-gateway/src/main/java/edu/hm/ccwi/matilda/gateway/WebSecurityConfig.java- All
application.propertiesfiles with${ADMIN_PASSWORD:changeme}
-
Security findings from code review:
- 9 deprecated classes should be removed
- Debug code (System.out, printStackTrace) should be replaced with proper logging
- Consider implementing proper authentication (OAuth2, JWT, database-backed)
-
Start infrastructure (MongoDB, Kafka, PostgreSQL):
sh docker-up.technologies.sh
⏳ Wait until all services are ready (~2-3 minutes)
-
Start core services:
sh docker-up.services.sh
-
Start processing services (optional):
docker-compose up --force-recreate --no-deps matilda-crawler matilda-crawler2 docker-compose up --force-recreate --no-deps matilda-dataextractor matilda-dataextractor2 docker-compose up --force-recreate --no-deps matilda-analyzer
docker-compose build --no-cache matilda-crawler matilda-dataextractor matilda-analyzer| Service | Endpoint | Description |
|---|---|---|
| MatildaAnalyzer | GET /libraries?categoryId={id} |
Retrieve libraries by category |
| MatildaAnalyzer | POST /technology/import?categoryId={id} |
Import technology |
| Gateway | /swagger-ui.html |
API documentation |
The following runners are executed as SpringBootTests under edu.hm.ccwi.matilda.analyzer.service.runner:
Note: All runners are marked with @Disabled and must be enabled manually for execution.
Cleans up orphaned data in MongoDB collections that have no references.
- Mode:
WRITE_MODE = false(analysis only) ortrue(cleanup) - Prerequisite: Docker-Technologies and Docker-Services running
Creates CSV datasets about documents and used libraries for all revisions.
- Output:
target/folder - Prerequisite: Docker-Technologies and Docker-Services running
Cleans up inconsistent dependency categories in ExtractedDesignDecisions (PostgreSQL).
- Prerequisite: Docker-Technologies and Docker-Services running
analyzeGeneralStatsRunner(): General statisticsanalyzeCategoriesOfMigrationsAndCommitsRunner(): Migration analysesanalyzeCategoriesOfMigrationsAndProjectAgeRunner(): Project age analysesanalyzeProjectCommitAgeAmountOfProjectMapRunner(): Commit age analysesanalyzeProjectCommitAgeDesignDecisionMapRunner(): Design decision analyses
Prerequisite: Docker-Technologies and Docker-Services running
Note: Integration tests (AnalyzerSpringIT, RecommenderSpringIT) are disabled as they require full infrastructure (MongoDB, Kafka). Enable them only for integration testing with live services.
Under edu.hm.ccwi.matilda.analyzer.service.library:
OneShot service for initial data migration:
- Persists all LibCategories from enum
- Persists characteristic types
- Links ExtractedDesignDecision entries with categories
🔬 Prototyping phase
- GACategoryTagManualTagsToTotalEnricherRunner: Enrichment of category tags
- LibSimClassificationRunner: ML classification for library similarity
- Install Java 11
- Install Maven 3.5.4+
- Install and start MongoDB (required for persistence)
- Install Confluent Platform (Kafka + Zookeeper) or Apache Kafka
- Install PostgreSQL (required for JPA persistence)
Before first start, configure authentication:
export ADMIN_PASSWORD=your-secure-admin-password
export ADMIN_USER=admin
export REGULAR_PASSWORD=your-secure-user-password
export REGULAR_USER=user# Clean build all modules
mvn clean install
# Start individual services
cd matilda-gateway
mvn spring-boot:run -Drun.jvmArguments="-Xms2048m -Xmx4096m"confluent start
# Create topics for crawled, extracted and analyzed projects
kafka-topics --create --topic matildaAnalyzerTopic --bootstrap-server localhost:9092
kafka-topics --create --topic matildaRecommenderTopic --bootstrap-server localhost:9092- Integration Tests: AnalyzerSpringIT and RecommenderSpringIT are disabled by default as they require full infrastructure
- Security: Default passwords are hardcoded and must be changed before production deployment
- Test Data: Some test classes reference non-existent status codes (RECOMMENDATION_FOUND → use FINISHED_ANALYZING_PROJECT instead)
- Deprecated Code: 9 classes marked @Deprecated should be reviewed for removal
- Swagger UI: http://localhost:8080/swagger-ui.html
- Admin Dashboard: http://localhost:8081
- API Gateway: http://localhost:8080
- Backend: Java 11, Spring Boot, Spring Cloud
- ML/AI: Python (scikit-learn, pandas)
- Messaging: Apache Kafka
- Databases: MongoDB, PostgreSQL
- Service Discovery: Eureka
- Containerization: Docker, Docker Compose