Detecting Multiple Semantic Concerns in Tangled Code Commits using Small Language Models

Author: Beomsu Koh Institution: University of Sheffield Project Type: MSc Computer Science Dissertation Dataset: Berom0227/Detecting-Semantic-Concerns-in-Tangled-Code-Changes-Using-SLMs

Overview

This repository contains the complete implementation and analysis for detecting semantic concerns in tangled code changes using Small Language Models (SLMs). The project investigates how fine-tuned SLMs can identify and separate different types of concerns (e.g., fixes, features, refactoring) in multi-concern commits.

Project Structure

├── datasets/                    # Dataset creation and processing
│   ├── data/                   # Raw and processed datasets
│   │   ├── CCS Dataset.csv
│   │   ├── sampled_ccs_dataset.csv
│   │   ├── tangled_ccs_dataset_train.csv
│   │   └── tangled_ccs_dataset_test.csv
│   └── scripts/                # Dataset generation scripts
│       ├── sample_atomic_commites.py
│       ├── generate_tangled_commites.py
│       ├── analyze_token_distribution.py
│       ├── concern_token_boxplot.py
│       └── upload_to_huggingface.py
│
├── RQ/                          # Research Questions - Models and Analysis
│   ├── GPT/                    # GPT-4.1 inference pipeline
│   ├── SLM/                    # Small Language Models (Qwen3-14B)
│   │   ├── configs/            # Model and training configurations
│   │   ├── train.py            # LoRA fine-tuning script
│   │   ├── infer.py            # Inference script
│   │   └── convert_to_gguf.py  # GGUF conversion for deployment
│   ├── analysis/               # Unified analysis scripts
│   │   ├── config.yaml         # Single source of truth for all RQs
│   │   ├── run.py              # Main analysis runner
│   │   ├── RQ1/                # Impact of Concern Count
│   │   ├── RQ2/                # Impact of Commit Message
│   │   ├── RQ3/                # Token-Budget Robustness
│   │   └── RQ4/                # Inference Efficiency
│   └── main.py
│
├── results/                     # Generated outputs
│   ├── analysis/               # Analysis results by RQ
│   │   ├── RQ1/
│   │   ├── RQ2/
│   │   ├── RQ3/
│   │   └── RQ4/
│   ├── gpt/                    # GPT-4.1 inference results
│   ├── Qwen/                   # Qwen3-14B inference results
│   └── Qwen3-14B-LoRA/         # Fine-tuned model results
│
├── visual_eval/                 # Interactive Streamlit dashboard
│   ├── components.py
│   ├── dataset.py
│   ├── session.py
│   └── setup.py
│
├── scripts/                     # HPC deployment scripts
│   ├── setup_env.sh
│   ├── run_training.sh
│   ├── run_lora_pipeline.sh
│   ├── run_infer_huggingface.sh
│   └── run_gguf_conversion.sh
│
├── utils/                       # Shared utilities
│   ├── eval.py                 # Evaluation metrics
│   ├── prompt.py               # Prompt templates
│   ├── model.py                # Data models
│   └── llms/                   # LLM API connectors
│       ├── openai.py
│       ├── hugging_face.py
│       ├── lmstudio.py
│       └── constant.py
│
├── __test__/                    # Test suite
│   ├── test_api.py
│   └── test_eval.py
│
└── app.py                       # Main Streamlit application

Research Questions

RQ1: Impact of Concern Count

Evaluates model performance as semantic complexity increases:

performance_summary.py: Performance comparison across models (GPT-4.1, Qwen, Fine-tuned Qwen)
concern_count_boxplot.py: Box plot visualization by concern count
concerncount-by-model.py: Performance comparison by model
model_comparison_analysis.py: Head-to-head model comparison with failure analysis
concern_count_pairwise_pvalue.py: Statistical significance testing

RQ2: Impact of Commit Message Inclusion

Investigates whether commit messages provide additional semantic cues:

msg_impact_analysis.py: Analyzes performance with/without commit messages
msg_impact_pairwise_pvalue.py: Pairwise statistical comparison

RQ3: Token-Budget Robustness

Examines model reliability when token budget is reduced (1024-12288 tokens):

context_length_performance.py: Performance across context lengths
context_length_boxplot.py: Box plot visualization by context length
context_length_pairwise_pvalue.py: Statistical significance testing

RQ4: Inference Efficiency

Analyzes how factors influence inference latency:

efficiency_commit_message.py: Correlation with commit message presence
efficiency_concern_count.py: Correlation with concern count
efficiency_input_tokens.py: Correlation with input tokens
efficiency_concern_count_input_token.py: Multiple regression analysis

Key Components

Models

GPT-4.1: OpenAI API baseline (zero-shot)
Qwen3-14B: Base SLM for comparison
Qwen3-14B-LoRA: Fine-tuned SLM with LoRA (rank=32, alpha=48)

Dataset

Train: tangled_ccs_dataset_train.csv (80% split)
Test: tangled_ccs_dataset_test.csv (20% split)
Based on Conventional Commits Specification (CCS)

Utilities

eval.py: Evaluation metrics (Hamming Loss, F1, Precision, Recall)
prompt.py: Prompt templates for zero-shot and few-shot learning
llms/: Unified API connectors for OpenAI, HuggingFace, and LM Studio

Supplementary Materials

Extended result tables from the paper are available in the supplementary/ directory.

Mean Hamming Loss by Concern Count (RQ1)

Count	GPT-4.1	Qwen3	Qwen3-FT
1	0.07	0.11	0.04
2	0.09	0.23	0.13
3	0.09	0.33	0.15
4	0.10	0.33	0.20
5	0.12	0.27	0.17

Mean Hamming Loss by Commit Message Inclusion (RQ2)

Condition	GPT-4.1	Qwen3	Qwen3-FT
Without Msg	0.11	0.28	0.25
With Msg	0.09	0.25	0.14
Delta	0.02	0.03	0.11

Note: For Qwen3-FT, commit message inclusion reduces Hamming Loss by 44% ((0.25 - 0.14) / 0.25 = 0.44).

Mean Hamming Loss by Input Token Length (RQ3)

Token Length	GPT-4.1	Qwen3	Qwen3-FT
1024	0.10	0.26	0.15
2048	0.10	0.26	0.15
4096	0.10	0.25	0.15
8192	0.10	0.25	0.14
12288	0.09	0.26	0.14

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 640 Commits
.github/workflows		.github/workflows
.lens		.lens
RQ		RQ
__test__		__test__
datasets		datasets
scripts		scripts
supplementary		supplementary
utils		utils
visual_eval		visual_eval
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
app.py		app.py
example.env		example.env
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detecting Multiple Semantic Concerns in Tangled Code Commits using Small Language Models

Overview

Project Structure

Research Questions

RQ1: Impact of Concern Count

RQ2: Impact of Commit Message Inclusion

RQ3: Token-Budget Robustness

RQ4: Inference Efficiency

Key Components

Models

Dataset

Utilities

Supplementary Materials

Mean Hamming Loss by Concern Count (RQ1)

Mean Hamming Loss by Commit Message Inclusion (RQ2)

Mean Hamming Loss by Input Token Length (RQ3)

License

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

GoBeromsu/Detecting-Multiple-Semantic-Concerns-in-Tangled-Code-Commits-using-Small-Language-Models

Folders and files

Latest commit

History

Repository files navigation

Detecting Multiple Semantic Concerns in Tangled Code Commits using Small Language Models

Overview

Project Structure

Research Questions

RQ1: Impact of Concern Count

RQ2: Impact of Commit Message Inclusion

RQ3: Token-Budget Robustness

RQ4: Inference Efficiency

Key Components

Models

Dataset

Utilities

Supplementary Materials

Mean Hamming Loss by Concern Count (RQ1)

Mean Hamming Loss by Commit Message Inclusion (RQ2)

Mean Hamming Loss by Input Token Length (RQ3)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages