Name	Name	Last commit message	Last commit date
parent directory ..
matilda-analyzer	matilda-analyzer
matilda-auth	matilda-auth
matilda-base	matilda-base
matilda-crawler	matilda-crawler
matilda-dataextractor	matilda-dataextractor
matilda-discovery	matilda-discovery
matilda-gateway	matilda-gateway
matilda-korpus-dependencies	matilda-korpus-dependencies
matilda-korpus-libsim-ki	matilda-korpus-libsim-ki
matilda-korpus-projects	matilda-korpus-projects
matilda-lib-manager	matilda-lib-manager
matilda-persistence-jpa	matilda-persistence-jpa
matilda-persistence-mongo	matilda-persistence-mongo
matilda-runner	matilda-runner
matilda-state	matilda-state
README.md	README.md
docker-compose.yml	docker-compose.yml
docker-down.services.sh	docker-down.services.sh
docker-up.processors.sh	docker-up.processors.sh
docker-up.services.sh	docker-up.services.sh
docker-up.technologies.sh	docker-up.technologies.sh
pom.xml	pom.xml

MATILDA Data Crawling System

MATILDA is a distributed system for collecting, extracting, and analyzing software projects, libraries, and their migration history between technologies (extraction of design decisions). The system crawls repositories, extracts design decisions, and provides recommendations based on migration and dependency analyses.

Overview

The system is built on a microservice architecture using Spring Boot, Kafka as a message broker, and MongoDB/PostgreSQL as databases. It processes GitHub projects automatically through various stages: Crawling, Extraction, Analysis, and Recommendation.

System Components

Service	Port	Description
matilda-gateway	8080	API Gateway for all services
matilda-crawler	8082	Crawling GitHub repositories
matilda-dataextractor	8083	Extraction of dependencies and design decisions
matilda-analyzer	8084	Analysis and statistics on migrations
matilda-runner	-	Batch processing and runner tasks
matilda-auth	-	Authentication and authorization
matilda-state	-	State management
matilda-discovery	-	Service discovery (Eureka)
matilda-lib-manager	-	Library management

Additional Components

matilda-base: Common base classes and utilities
matilda-persistence-jpa: JPA-based persistence (PostgreSQL)
matilda-persistence-mongo: MongoDB-based persistence
matilda-korpus-dependencies: Corpus for dependency analyses
matilda-korpus-projects: Project corpus with analysis data
matilda-korpus-libsim-ki: AI-based library similarity analysis (Python)

Project Structure

data_crawling_system/
├── pom.xml                           # Maven Parent POM
├── docker-compose.yml                # Docker orchestration
├── docker-up.technologies.sh         # Start infrastructure (MongoDB, Kafka, PostgreSQL)
├── docker-up.services.sh             # Start core services
├── docker-up.processors.sh           # Start processing services
├── docker-down.services.sh           # Shutdown script
│
├── matilda-auth/                     # Authentication service
├── matilda-analyzer/                 # Analysis service with SpringBootTests
├── matilda-base/                     # Common base library
├── matilda-crawler/                  # GitHub crawler service
├── matilda-dataextractor/            # Data extraction service
├── matilda-discovery/                # Eureka service discovery
├── matilda-gateway/                  # API Gateway
├── matilda-lib-manager/              # Library management service
├── matilda-runner/                   # Batch runner service
├── matilda-state/                    # State management service
│
├── matilda-persistence-jpa/          # JPA persistence layer (PostgreSQL)
├── matilda-persistence-mongo/        # MongoDB persistence layer
│
├── matilda-korpus-dependencies/      # Dependency corpus
├── matilda-korpus-projects/          # Project corpus with CSV exports
└── matilda-korpus-libsim-ki/         # Python: AI library similarity

Quick Start (Docker)

Prerequisites

Docker & Docker Compose
Java 11+
Maven 3.5.4+

⚠️ Security Configuration

IMPORTANT: Before deploying to production, you MUST configure the following:

Change default passwords in application.properties files:

Default admin password: changeme
Default user password: userPass

Set environment variables:

export ADMIN_PASSWORD=your-secure-password
export ADMIN_USER=your-admin-username
export REGULAR_PASSWORD=your-user-password
export REGULAR_USER=your-username

Update hardcoded credentials in:
- matilda-gateway/src/main/java/edu/hm/ccwi/matilda/gateway/WebSecurityConfig.java
- All application.properties files with ${ADMIN_PASSWORD:changeme}
Security findings from code review:
- 9 deprecated classes should be removed
- Debug code (System.out, printStackTrace) should be replaced with proper logging
- Consider implementing proper authentication (OAuth2, JWT, database-backed)

Starting the System

Start infrastructure (MongoDB, Kafka, PostgreSQL):
```
sh docker-up.technologies.sh
```
⏳ Wait until all services are ready (~2-3 minutes)
Start core services:
```
sh docker-up.services.sh
```

Start processing services (optional):

docker-compose up --force-recreate --no-deps matilda-crawler matilda-crawler2
docker-compose up --force-recreate --no-deps matilda-dataextractor matilda-dataextractor2
docker-compose up --force-recreate --no-deps matilda-analyzer

Build (optional)

docker-compose build --no-cache matilda-crawler matilda-dataextractor matilda-analyzer

API Interfaces

REST Endpoints

Service	Endpoint	Description
MatildaAnalyzer	`GET /libraries?categoryId={id}`	Retrieve libraries by category
MatildaAnalyzer	`POST /technology/import?categoryId={id}`	Import technology
Gateway	`/swagger-ui.html`	API documentation

SpringBootTest Runners (MatildaAnalyzer)

The following runners are executed as SpringBootTests under edu.hm.ccwi.matilda.analyzer.service.runner:

Note: All runners are marked with @Disabled and must be enabled manually for execution.

InconsistentMongoRepositoriesAndRevisionsAndDocsHandler

Cleans up orphaned data in MongoDB collections that have no references.

Mode: WRITE_MODE = false (analysis only) or true (cleanup)
Prerequisite: Docker-Technologies and Docker-Services running

AssambleAndExportRevSimDataset

Creates CSV datasets about documents and used libraries for all revisions.

Output: target/ folder
Prerequisite: Docker-Technologies and Docker-Services running

CleanupInconsistentDependencyCategoriesInEDDs

Cleans up inconsistent dependency categories in ExtractedDesignDecisions (PostgreSQL).

Prerequisite: Docker-Technologies and Docker-Services running

AnalyzeStatsServiceImplRunner

⚠️ Disabled for automated builds - Manual runner for statistical analysis Performs various statistical analyses and outputs results to console:

analyzeGeneralStatsRunner(): General statistics
analyzeCategoriesOfMigrationsAndCommitsRunner(): Migration analyses
analyzeCategoriesOfMigrationsAndProjectAgeRunner(): Project age analyses
analyzeProjectCommitAgeAmountOfProjectMapRunner(): Commit age analyses
analyzeProjectCommitAgeDesignDecisionMapRunner(): Design decision analyses

Prerequisite: Docker-Technologies and Docker-Services running

Note: Integration tests (AnalyzerSpringIT, RecommenderSpringIT) are disabled as they require full infrastructure (MongoDB, Kafka). Enable them only for integration testing with live services.

Service Runners (MatildaAnalyzer)

Under edu.hm.ccwi.matilda.analyzer.service.library:

TechnologyAndCategoryMigrator

OneShot service for initial data migration:

Persists all LibCategories from enum
Persists characteristic types
Links ExtractedDesignDecision entries with categories

CharacteristicMigrationUsageRankCalculator

🔬 Prototyping phase

MatildaRunner Main Classes

GACategoryTagManualTagsToTotalEnricherRunner: Enrichment of category tags
LibSimClassificationRunner: ML classification for library similarity

Manual Installation

Environment Setup

Install Java 11
Install Maven 3.5.4+
Install and start MongoDB (required for persistence)
Install Confluent Platform (Kafka + Zookeeper) or Apache Kafka
Install PostgreSQL (required for JPA persistence)

Security Configuration

Before first start, configure authentication:

export ADMIN_PASSWORD=your-secure-admin-password
export ADMIN_USER=admin
export REGULAR_PASSWORD=your-secure-user-password
export REGULAR_USER=user

Build & Start

# Clean build all modules
mvn clean install

# Start individual services
cd matilda-gateway
mvn spring-boot:run -Drun.jvmArguments="-Xms2048m -Xmx4096m"

Create Kafka Topics

confluent start
# Create topics for crawled, extracted and analyzed projects
kafka-topics --create --topic matildaAnalyzerTopic --bootstrap-server localhost:9092
kafka-topics --create --topic matildaRecommenderTopic --bootstrap-server localhost:9092

Known Issues & Limitations

Integration Tests: AnalyzerSpringIT and RecommenderSpringIT are disabled by default as they require full infrastructure
Security: Default passwords are hardcoded and must be changed before production deployment
Test Data: Some test classes reference non-existent status codes (RECOMMENDATION_FOUND → use FINISHED_ANALYZING_PROJECT instead)
Deprecated Code: 9 classes marked @Deprecated should be reviewed for removal

Access

Swagger UI: http://localhost:8080/swagger-ui.html
Admin Dashboard: http://localhost:8081
API Gateway: http://localhost:8080

Technology Stack

Backend: Java 11, Spring Boot, Spring Cloud
ML/AI: Python (scikit-learn, pandas)
Messaging: Apache Kafka
Databases: MongoDB, PostgreSQL
Service Discovery: Eureka
Containerization: Docker, Docker Compose

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

MATILDA Data Crawling System

Overview

System Components

Additional Components

Project Structure

Quick Start (Docker)

Prerequisites

⚠️ Security Configuration

Starting the System

Build (optional)

API Interfaces

REST Endpoints

SpringBootTest Runners (MatildaAnalyzer)

InconsistentMongoRepositoriesAndRevisionsAndDocsHandler

AssambleAndExportRevSimDataset

CleanupInconsistentDependencyCategoriesInEDDs

AnalyzeStatsServiceImplRunner

Service Runners (MatildaAnalyzer)

TechnologyAndCategoryMigrator

CharacteristicMigrationUsageRankCalculator

MatildaRunner Main Classes

Manual Installation

Environment Setup

Security Configuration

Build & Start

Create Kafka Topics

Known Issues & Limitations

Access

Technology Stack

FilesExpand file tree

data_crawling_system

Directory actions

More options

Directory actions

More options

Latest commit

History

data_crawling_system

Folders and files

parent directory

README.md

MATILDA Data Crawling System

Overview

System Components

Additional Components

Project Structure

Quick Start (Docker)

Prerequisites

⚠️ Security Configuration

Starting the System

Build (optional)

API Interfaces

REST Endpoints

SpringBootTest Runners (MatildaAnalyzer)

InconsistentMongoRepositoriesAndRevisionsAndDocsHandler

AssambleAndExportRevSimDataset

CleanupInconsistentDependencyCategoriesInEDDs

AnalyzeStatsServiceImplRunner

Service Runners (MatildaAnalyzer)

TechnologyAndCategoryMigrator

CharacteristicMigrationUsageRankCalculator

MatildaRunner Main Classes

Manual Installation

Environment Setup

Security Configuration

Build & Start

Create Kafka Topics

Known Issues & Limitations

Access

Technology Stack