Functional Requirements Clustering Pipeline

The goal of this project is to derive initial architecture proposals by automatically identifying cohesive groups of functional requirements. Each group (or cluster) represents a potential software component or bounded context, enabling a behavior-driven approach to architectural insight and system decomposition.

The pipeline automates:

Embedding requirements into semantic vectors
Clustering them to reveal cohesive functional groups
Storing results in Qdrant for fast semantic search
Visualizing clusters in 2D for exploration
Serving results via a lightweight HTTP server

📦 Features

Flexible input: Load functional requirements from .txt or .json files
Modern embeddings: Uses all-MiniLM-L6-v2 (384-dim sentence transformers) for semantic representation
Controllable clustering: Agglomerative Clustering with cosine distance; cluster granularity tuned via --cluster-distance
Semantic search: Store vectors and metadata in Qdrant for fast querying and filtering
Interactive visualization: 2D scatter plots with hover details using Plotly
HTTP API:
- /clusters.html – Interactive cluster map
- /clusters.json – Machine-readable cluster assignments
- /embeddings?limit=N&vector_len=M – Inspect raw vectors
  - N = number of embedding vectors to return
  - M = length (dimensionality) of each embedding vector
Docker-ready: Runs in containers alongside Qdrant for easy setup

🚀 Quick Start

1. Prepare Requirements File

Create functional_requirements.txt in app directory (one requirement per line):

The system must authenticate a customer using a customer number and a password.
The system must reject any order-related action for a customer who is blacklisted.
...

Or use JSON format (requirements.json):

[
  "The system must authenticate...",
  "The system must reject..."
]

2. Run with Docker Compose

Build and run:

docker-compose up --build

3. Explore Results

# Cluster visualization
http://localhost:8000/clusters.html

# Raw cluster data
http://localhost:8000/clusters.json

# Embedding samples
http://localhost:8000/embeddings?limit=5

🧠 Interpreting Results

Cluster Quality

Trust clusters.json over the 2D plot: Clustering happens in 384D; the plot is a visualization aid
Small, tight clusters indicate specialized subdomains (e.g., SMS parsing, blueprint ordering)
Larger, more diverse clusters, ...

Architectural Mapping

After clustering, you can manually assign meaningful names to each cluster to reflect candidate components or subdomains. This helps turn the automated clustering output into an actionable architecture blueprint.

Example mapping:

{
  "CustomerIdentity": ["FR-1", "FR-2", "FR-56"],
  "ProductCatalog": ["FR-4", "FR-5", "FR-6"],
  "OrderIntake": ["FR-8", "FR-16", "FR-25-32"],
  "OtherClusters": []
}

How to Manually Name Clusters

Open the clusters.json file generated by the pipeline.
Review the functional requirements in each cluster.
Assign a descriptive component name for each cluster, for example: CustomerIdentity, OrderIntake, PaymentProcessing.
Replace the automatically generated cluster keys with your chosen names.
Save this mapping. It can now be used as a reference for designing bounded contexts or modules.

CLI Arguments

CLI arguments can be used by modifying the docker-compose.yml app service. Use the command: field to override the default script execution. For example:

python fr_clustering.py --help

- --print-fr                       : Print loaded requirements and exit
- --fr-file           PATH         : Override requirements file path
- --projection        [umap|tsne]  : Choose 2D projection method (default: umap)
- --perplexity        FLOAT        : t-SNE perplexity (default: 30.0)
- --cluster-distance  FLOAT        : Cosine distance threshold for clustering (default: 0.65)
- --embedding-model   [all-MiniLM-L6-v2|all-mpnet-base-v2] : Choose SentenceTransformer model for embeddings (default: all-MiniLM-L6-v2)

Usage in docker-compose.yaml:

# command: overrides the default CMD in the Dockerfile, allowing you to specify CLI arguments.
# for example:
command: python fr_clustering.py --projection tsne --perplexity 5 --cluster-distance 0.2 --embedding-model all-mpnet-base-v2

💡 Tip for small datasets (< 30 items): Use --projection tsne --perplexity 5 --cluster-distance 0.2 for finer-grained clusters.

⚙️ Configuration

Environment Variables

Variable	Default	Description
QDRANT_HOST	localhost	Qdrant service hostname
QDRANT_HTTP_PORT	6333	Qdrant HTTP API port
FR_FILE	functional_requirements.txt	Path to requirements file
LOG_LEVEL	INFO	Logging verbosity (DEBUG, INFO, WARNING)

Dependencies

See requirements.txt for full list.

📜 License

See MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
app		app
docs		docs
examples		examples
LICENSE.md		LICENSE.md
README.md		README.md
app.Dockerfile		app.Dockerfile
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Functional Requirements Clustering Pipeline

📦 Features

🚀 Quick Start

1. Prepare Requirements File

2. Run with Docker Compose

3. Explore Results

🧠 Interpreting Results

Cluster Quality

Architectural Mapping

How to Manually Name Clusters

CLI Arguments

⚙️ Configuration

Environment Variables

Dependencies

📜 License

About

Uh oh!

Releases

Packages

Languages

License

faetschi/FRS-clustering-pipeline

Folders and files

Latest commit

History

Repository files navigation

Functional Requirements Clustering Pipeline

📦 Features

🚀 Quick Start

1. Prepare Requirements File

2. Run with Docker Compose

3. Explore Results

🧠 Interpreting Results

Cluster Quality

Architectural Mapping

How to Manually Name Clusters

CLI Arguments

⚙️ Configuration

Environment Variables

Dependencies

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages