Author: Beomsu Koh Institution: University of Sheffield Project Type: MSc Computer Science Dissertation Dataset: Berom0227/Detecting-Semantic-Concerns-in-Tangled-Code-Changes-Using-SLMs
This repository contains the complete implementation and analysis for detecting semantic concerns in tangled code changes using Small Language Models (SLMs). The project investigates how fine-tuned SLMs can identify and separate different types of concerns (e.g., fixes, features, refactoring) in multi-concern commits.
├── datasets/ # Dataset creation and processing
│ ├── data/ # Raw and processed datasets
│ │ ├── CCS Dataset.csv
│ │ ├── sampled_ccs_dataset.csv
│ │ ├── tangled_ccs_dataset_train.csv
│ │ └── tangled_ccs_dataset_test.csv
│ └── scripts/ # Dataset generation scripts
│ ├── sample_atomic_commites.py
│ ├── generate_tangled_commites.py
│ ├── analyze_token_distribution.py
│ ├── concern_token_boxplot.py
│ └── upload_to_huggingface.py
│
├── RQ/ # Research Questions - Models and Analysis
│ ├── GPT/ # GPT-4.1 inference pipeline
│ ├── SLM/ # Small Language Models (Qwen3-14B)
│ │ ├── configs/ # Model and training configurations
│ │ ├── train.py # LoRA fine-tuning script
│ │ ├── infer.py # Inference script
│ │ └── convert_to_gguf.py # GGUF conversion for deployment
│ ├── analysis/ # Unified analysis scripts
│ │ ├── config.yaml # Single source of truth for all RQs
│ │ ├── run.py # Main analysis runner
│ │ ├── RQ1/ # Impact of Concern Count
│ │ ├── RQ2/ # Impact of Commit Message
│ │ ├── RQ3/ # Token-Budget Robustness
│ │ └── RQ4/ # Inference Efficiency
│ └── main.py
│
├── results/ # Generated outputs
│ ├── analysis/ # Analysis results by RQ
│ │ ├── RQ1/
│ │ ├── RQ2/
│ │ ├── RQ3/
│ │ └── RQ4/
│ ├── gpt/ # GPT-4.1 inference results
│ ├── Qwen/ # Qwen3-14B inference results
│ └── Qwen3-14B-LoRA/ # Fine-tuned model results
│
├── visual_eval/ # Interactive Streamlit dashboard
│ ├── components.py
│ ├── dataset.py
│ ├── session.py
│ └── setup.py
│
├── scripts/ # HPC deployment scripts
│ ├── setup_env.sh
│ ├── run_training.sh
│ ├── run_lora_pipeline.sh
│ ├── run_infer_huggingface.sh
│ └── run_gguf_conversion.sh
│
├── utils/ # Shared utilities
│ ├── eval.py # Evaluation metrics
│ ├── prompt.py # Prompt templates
│ ├── model.py # Data models
│ └── llms/ # LLM API connectors
│ ├── openai.py
│ ├── hugging_face.py
│ ├── lmstudio.py
│ └── constant.py
│
├── __test__/ # Test suite
│ ├── test_api.py
│ └── test_eval.py
│
└── app.py # Main Streamlit application
Evaluates model performance as semantic complexity increases:
performance_summary.py: Performance comparison across models (GPT-4.1, Qwen, Fine-tuned Qwen)concern_count_boxplot.py: Box plot visualization by concern countconcerncount-by-model.py: Performance comparison by modelmodel_comparison_analysis.py: Head-to-head model comparison with failure analysisconcern_count_pairwise_pvalue.py: Statistical significance testing
Investigates whether commit messages provide additional semantic cues:
msg_impact_analysis.py: Analyzes performance with/without commit messagesmsg_impact_pairwise_pvalue.py: Pairwise statistical comparison
Examines model reliability when token budget is reduced (1024-12288 tokens):
context_length_performance.py: Performance across context lengthscontext_length_boxplot.py: Box plot visualization by context lengthcontext_length_pairwise_pvalue.py: Statistical significance testing
Analyzes how factors influence inference latency:
efficiency_commit_message.py: Correlation with commit message presenceefficiency_concern_count.py: Correlation with concern countefficiency_input_tokens.py: Correlation with input tokensefficiency_concern_count_input_token.py: Multiple regression analysis
- GPT-4.1: OpenAI API baseline (zero-shot)
- Qwen3-14B: Base SLM for comparison
- Qwen3-14B-LoRA: Fine-tuned SLM with LoRA (rank=32, alpha=48)
- Train:
tangled_ccs_dataset_train.csv(80% split) - Test:
tangled_ccs_dataset_test.csv(20% split) - Based on Conventional Commits Specification (CCS)
eval.py: Evaluation metrics (Hamming Loss, F1, Precision, Recall)prompt.py: Prompt templates for zero-shot and few-shot learningllms/: Unified API connectors for OpenAI, HuggingFace, and LM Studio
Extended result tables from the paper are available in the supplementary/ directory.
| Count | GPT-4.1 | Qwen3 | Qwen3-FT |
|---|---|---|---|
| 1 | 0.07 | 0.11 | 0.04 |
| 2 | 0.09 | 0.23 | 0.13 |
| 3 | 0.09 | 0.33 | 0.15 |
| 4 | 0.10 | 0.33 | 0.20 |
| 5 | 0.12 | 0.27 | 0.17 |
| Condition | GPT-4.1 | Qwen3 | Qwen3-FT |
|---|---|---|---|
| Without Msg | 0.11 | 0.28 | 0.25 |
| With Msg | 0.09 | 0.25 | 0.14 |
| Delta | 0.02 | 0.03 | 0.11 |
Note: For Qwen3-FT, commit message inclusion reduces Hamming Loss by 44% ((0.25 - 0.14) / 0.25 = 0.44).
| Token Length | GPT-4.1 | Qwen3 | Qwen3-FT |
|---|---|---|---|
| 1024 | 0.10 | 0.26 | 0.15 |
| 2048 | 0.10 | 0.26 | 0.15 |
| 4096 | 0.10 | 0.25 | 0.15 |
| 8192 | 0.10 | 0.25 | 0.14 |
| 12288 | 0.09 | 0.26 | 0.14 |
This project is licensed under the MIT License - see the LICENSE file for details.