Welcome to my NLP assignments repository! Here, you'll find three projects exploring cutting-edge NLP techniques—from word embeddings to LLM fine-tuning and medical agent frameworks. Each assignment tackles real-world challenges with code, experiments, and detailed reports. Let’s dive in! 🔍
Assignment | Topics Covered | Key Techniques | Dataset/Model | Highlights |
---|---|---|---|---|
1: Word Embeddings | • Word2Vec/CBoW • Embedding visualization • Polysemy & analogies |
• Skip-gram vs CBoW • Cosine similarity • t-SNE projections |
Quora Question Pairs | Trained embeddings from scratch, achieved semantic clustering for words like oil and banking! |
2: LLM Medical Agent | • Prompt engineering • Multi-agent validation • Regulatory compliance |
• Chain-of-Thought • Self-consistency checks • Tavily API integration |
National Pharmacist Exam Qs | Boosted accuracy by 15% using agent debates and real-time guideline checks! 🩺✅ |
3: LLM Fine-Tuning | • QLoRA/PEFT • Chinese medical QA • 4-bit quantization |
• LoRA adapters • Instruction tuning • Perplexity evaluation |
Huatuo26M-Lite Qwen-7B-Instruct |
Achieved 80% accuracy on medical MCQs with just 12.76GB VRAM! 🌟 |
- Core Idea: "You shall know a word by the company it keeps!"
- Cool Finds:
oil
clustered with ecuador and industry but not venezuela 🤔- Bias alert:
rich
was closer topoor
thanaffluent
in cosine space!
- Tools: PyTorch, NLTK, Gensim, Bokeh.
- Code Snippets
- Innovation: Multi-agent debate framework:
- Drug Interaction Agent 🧪 + Regulatory Agent 📜 → Consensus-driven answers.
- Result: 74.3% accuracy on best-choice questions with <5s latency.
- Try This Prompt:
"As a pharmacist, cross-reference [Drug X] with [2024 formulary] before answering."
-
Breakthrough: Supercharged Qwen-7B for Chinese medical QA with 4-bit quantization 🧠💡
-
Tech Stack: HuggingFace Transformers, PEFT, Accelerate.
-
Key Innovation:
- Trained on Huatuo26M-Lite (doctor-patient dialogues) using PEFT and LoRA adapters.
- Achieved 80% accuracy on medical MCQs while sipping just 12.76GB VRAM!
-
Sample Output:
Patient: "喉咙不舒服,瓜子吃多了怎么办?"
Model:- 多喝水保持咽喉湿润
- 停止食用瓜子
- 含服润喉片
- 症状持续需就医 → 临床指南对齐 ✅
-
Tech Stack:
# QLoRA Config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct", quantization_config=bnb_config)
-
Evaluation:
-
Perplexity ↓5.31 (vs baseline >10)
-
Peak GPU usage: 12.76GB 🚀
-
Metric | Assignment 1 | Assignment 2 | Assignment 3 |
---|---|---|---|
Accuracy | N/A | 74.3% | 80% |
Training Time | 2.5 hours | API-based | 8 hours |
GPU Memory Usage | 8GB (T4) | N/A | 12.76GB |
Key Visualization | t-SNE embeddings | Compliance heatmaps | Perplexity=5.31 📉 |
-
Setup Environment:
conda create -n nlp_course python=3.9 conda install pytorch=2.0 cudatoolkit=11.7 -c pytorch pip install -r requirements.txt
-
Run Experiments:
- Assignment 1 (Word2Vec):
python train_cbow.py --window_size 2 --batch_size 128
- Assignment 2 (Medical Agent):
python agent_orchestrator.py --use_debate True --api_key YOUR_API_KEY
- Assignment 3 (QLoRA Fine-Tuning):
accelerate launch finetune_medical.py --model Qwen-7B --dataset Huatuo26M-Lite
-
Evaluate & Visualize:
-
Generate t-SNE plots for embeddings (Assignment 1).
-
Check compliance logs in
data/wrong_answers.json
(Assignment 2). -
Monitor training metrics with
tensorboard --logdir ./logs
(Assignment 3).
-
-
Guides & Tools:
-
Datasets & Models:
-
Code Repos:
Made with ❤️ by Zijin CAI
"From embeddings to life-saving diagnostics—code that cares!" 🌟