Skip to content

feat: code for sft purpcode and baselines#2

Merged
ganler merged 1 commit intomainfrom
sft
Aug 5, 2025
Merged

feat: code for sft purpcode and baselines#2
ganler merged 1 commit intomainfrom
sft

Conversation

@ganler
Copy link
Contributor

@ganler ganler commented Aug 5, 2025

No description provided.

Copilot AI review requested due to automatic review settings August 5, 2025 00:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces training configurations and scripts for the SFT (Supervised Fine-Tuning) component of the PurpCode project, including baseline implementations and controlled experiments for data ablation studies.

  • Adds training configurations for context distillation fine-tuning with Qwen 14B and 32B models
  • Implements data splitting script for safety ratio ablation experiments with different ratios (1/3, 1/2, 2/3)
  • Provides baseline training configurations for SafeCoder and ProSec-SIMPO methods
  • Includes utility scripts for pushing models and datasets to Hugging Face Hub

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.

Show a summary per file
File Description
sft/ctxdistill_qwen32b.yaml Configuration for 32B model context distillation training
sft/ctxdistill_qwen14b.yaml Configuration for 14B model context distillation training
sft/controlled/data_split_for_data_ablation.py Script to split data for safety ratio ablation experiments
sft/controlled/ctxdistill_qwen14b_safety_ratio_*.yaml Training configs for different safety data ratios
sft/baselines/safecoder.yaml SafeCoder baseline training configuration
sft/baselines/prosec-simpo.yaml ProSec-SIMPO baseline training configuration
script/push_model_hub.py Fixed example command comment
script/push_data_hub.py Utility script for pushing datasets to Hugging Face Hub
Comments suppressed due to low confidence (3)

script/push_data_hub.py:9

  • The function name 'push_model' is misleading as this function pushes datasets, not models. It should be renamed to 'push_dataset' or 'push_data' for clarity.
def push_model(path: str, split: str, dataset: str = None):

@ganler ganler merged commit 5ea30b7 into main Aug 5, 2025
2 checks passed
@ganler ganler deleted the sft branch August 7, 2025 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments