- Generating Unsupervised & Supervised Train Data of Text Embedding Model
- Text Embedding Model Fine-Tuning Methods: SimCSE, Triplet Loss etc.
Data Generation Methods
- Data Generation Scripts
- This repository contains scripts for generating training datasets.
- Each script creates both unsupervised samples and supervised samples (anchor, positive, negative) from an input CSV file.
- They differ in how they construct anchor–positive–negative pairs, how the data is split, and how negatives are sampled.
- Unsupervised learning data: converting all columns in a row to a text
- Mapping method of supervised learning data:
- For each anchor, create anchor–positive pairs (number of positive columns pairs) using each positive column.
- Attach one random negative to each pair.
python gen_data/text_embedder_fine_tuning_data_gen_basic.py --data_path {csv data path} --encoding {encoding} --desc_col {anchor column} --category_col {hard negative column} --positive_cols {positive column1, ...} --output_unsupervised {unsupervised train data save path} --output_supervised {supervised train data save path}
- Same as baseline, but with domain-based data separation. (Each domain gets its own dataset file.)
python gen_data/text_embedder_fine_tuning_data_gen_domainwise.py --data_path {csv data path} --encoding {encoding} --desc_col {anchor column} --category_col {hard negative column} --positive_cols {positive column1, ...} ----domain_col {domain column} --output_dir {unsupervised, supervised train data save folder path}
- Unsupervised learning data: converting all columns in a row to a text
- Mapping method of supervised learning data:
- All positive values are fused into a single string
- E.g. "pos column1: xxx, pos column2: yyy, ..."
- All positive values are fused into a single string
- For each anchor, attach multiple hard negatives (default: 5).
- One row per anchor
python gen_data/text_embedder_fine_tuning_data_gen_fusion_multineg.py --data_path {csv data path} --encoding {encoding} --desc_col {anchor column} --category_col {hard negative column} --positive_cols {positive column1, ...} ----domain_col {domain column} --output_dir {unsupervised, supervised train data save folder path} --num_negatives {num of negatives}
Fine-Tuning Process
- Run env
conda create --name gemma-embedding python=3.10 -y
conda info --envs
conda activate gemma-embedding
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
pip install -r embedding_gemma_requirements.txt
pip install --upgrade accelerate transformers
- Fine-tuning
- enter the Huggingface Token (huggingface_token) in the '.env'
export CUDA_VISIBLE_DEVICES=0
python embedding_gemma_fine_tuning_test.py
-
Results
- Before Fine-Tuning
============ get_scores ============ - Query: task: search result | query: I want to start a tax-free installment investment, what should I do? Document: title: none | text: Opening a NISA Account -> 🤖 Score: 0.281624 Document: title: none | text: Opening a Regular Savings Account -> 🤖 Score: 0.291065 Document: title: none | text: Home Loan Application Guide -> 🤖 Score: 0.178349 ====================================- TripletLoss
## triplet_margin=0.5, batch=3 ============ get_scores ============ - Query: task: search result | query: I want to start a tax-free installment investment, what should I do? Document: title: none | text: Opening a NISA Account -> 🤖 Score: 0.652784 Document: title: none | text: Opening a Regular Savings Account -> 🤖 Score: 0.032676 Document: title: none | text: Home Loan Application Guide -> 🤖 Score: 0.183048 ==================================== ## triplet_margin=0.7, batch=3 ============ get_scores ============ - Query: task: search result | query: I want to start a tax-free installment investment, what should I do? Document: title: none | text: Opening a NISA Account -> 🤖 Score: 0.812057 Document: title: none | text: Opening a Regular Savings Account -> 🤖 Score: -0.170041 Document: title: none | text: Home Loan Application Guide -> 🤖 Score: 0.091807 ====================================- MultipleNegativesRankingLoss
## scale = 20, batch = 3 ============ get_scores ============ - Query: task: search result | query: I want to start a tax-free installment investment, what should I do? Document: title: none | text: Opening a NISA Account -> 🤖 Score: 0.904604 Document: title: none | text: Opening a Regular Savings Account -> 🤖 Score: 0.878559 Document: title: none | text: Home Loan Application Guide -> 🤖 Score: -0.050656 ==================================== ## scale = 30, batch = 3 ============ get_scores ============ - Query: task: search result | query: I want to start a tax-free installment investment, what should I do? Document: title: none | text: Opening a NISA Account -> 🤖 Score: 0.705811 Document: title: none | text: Opening a Regular Savings Account -> 🤖 Score: 0.612812 Document: title: none | text: Home Loan Application Guide -> 🤖 Score: -0.024527 ==================================== ## scale = 30, batch = 2 ============ get_scores ============ - Query: task: search result | query: I want to start a tax-free installment investment, what should I do? Document: title: none | text: Opening a NISA Account -> 🤖 Score: 0.880872 Document: title: none | text: Opening a Regular Savings Account -> 🤖 Score: 0.751842 Document: title: none | text: Home Loan Application Guide -> 🤖 Score: 0.516426 ====================================- AnglELoss
## scale = 20, batch = 3 ============ get_scores ============ - Query: task: search result | query: I want to start a tax-free installment investment, what should I do? Document: title: none | text: Opening a NISA Account -> 🤖 Score: 0.624790 Document: title: none | text: Opening a Regular Savings Account -> 🤖 Score: 0.304153 Document: title: none | text: Home Loan Application Guide -> 🤖 Score: 0.247993 ==================================== ## scale = 30, batch = 3 ============ get_scores ============ - Query: task: search result | query: I want to start a tax-free installment investment, what should I do? Document: title: none | text: Opening a NISA Account -> 🤖 Score: 0.536980 Document: title: none | text: Opening a Regular Savings Account -> 🤖 Score: 0.285615 Document: title: none | text: Home Loan Application Guide -> 🤖 Score: 0.160317 ====================================