Skip to content

This repository contains three projects exploring cutting-edge NLP techniques—from word embeddings to LLM fine-tuning and medical agent frameworks. Each assignment tackles real-world challenges with code, experiments, and detailed reports

License

Notifications You must be signed in to change notification settings

CAI991108/Natural-Language-Processing-Utils

Repository files navigation

Natural Language Processing Utils Repository 🚀

NLP Wizard

Welcome to my NLP assignments repository! Here, you'll find three projects exploring cutting-edge NLP techniques—from word embeddings to LLM fine-tuning and medical agent frameworks. Each assignment tackles real-world challenges with code, experiments, and detailed reports. Let’s dive in! 🔍


📂 Assignment Overview

Assignment Topics Covered Key Techniques Dataset/Model Highlights
1: Word Embeddings • Word2Vec/CBoW
• Embedding visualization
• Polysemy & analogies
• Skip-gram vs CBoW
• Cosine similarity
• t-SNE projections
Quora Question Pairs Trained embeddings from scratch, achieved semantic clustering for words like oil and banking!
2: LLM Medical Agent • Prompt engineering
• Multi-agent validation
• Regulatory compliance
• Chain-of-Thought
• Self-consistency checks
• Tavily API integration
National Pharmacist Exam Qs Boosted accuracy by 15% using agent debates and real-time guideline checks! 🩺✅
3: LLM Fine-Tuning • QLoRA/PEFT
• Chinese medical QA
• 4-bit quantization
• LoRA adapters
• Instruction tuning
• Perplexity evaluation
Huatuo26M-Lite
Qwen-7B-Instruct
Achieved 80% accuracy on medical MCQs with just 12.76GB VRAM! 🌟

🛠️ Technical Deep-Dive

Assignment 1: Word Embeddings

  • Core Idea: "You shall know a word by the company it keeps!"
  • Cool Finds:
    • oil clustered with ecuador and industry but not venezuela 🤔
    • Bias alert: rich was closer to poor than affluent in cosine space!
  • Tools: PyTorch, NLTK, Gensim, Bokeh.
  • Code Snippets

Assignment 2: Medical LLM Agent

  • Innovation: Multi-agent debate framework:
    • Drug Interaction Agent 🧪 + Regulatory Agent 📜 → Consensus-driven answers.
  • Result: 74.3% accuracy on best-choice questions with <5s latency.
  • Try This Prompt:
    "As a pharmacist, cross-reference [Drug X] with [2024 formulary] before answering."

Assignment 3: QLoRA Fine-Tuning

  • Breakthrough: Supercharged Qwen-7B for Chinese medical QA with 4-bit quantization 🧠💡

  • Tech Stack: HuggingFace Transformers, PEFT, Accelerate.

  • Key Innovation:

    • Trained on Huatuo26M-Lite (doctor-patient dialogues) using PEFT and LoRA adapters.
    • Achieved 80% accuracy on medical MCQs while sipping just 12.76GB VRAM!
  • Sample Output:

    Patient: "喉咙不舒服,瓜子吃多了怎么办?"
    Model:

    1. 多喝水保持咽喉湿润
    2. 停止食用瓜子
    3. 含服润喉片
    4. 症状持续需就医 → 临床指南对齐
  • Tech Stack:

    # QLoRA Config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )
    model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct", 
                quantization_config=bnb_config)
  • Evaluation:

    • Perplexity ↓5.31 (vs baseline >10)

    • Peak GPU usage: 12.76GB 🚀

  • Colab Demo | HuggingFace Model


📊 Results at a Glance

Metric Assignment 1 Assignment 2 Assignment 3
Accuracy N/A 74.3% 80%
Training Time 2.5 hours API-based 8 hours
GPU Memory Usage 8GB (T4) N/A 12.76GB
Key Visualization t-SNE embeddings Compliance heatmaps Perplexity=5.31 📉

🚀 How to Replicate

  1. Setup Environment:

    conda create -n nlp_course python=3.9
    conda install pytorch=2.0 cudatoolkit=11.7 -c pytorch
    pip install -r requirements.txt
  2. Run Experiments:

    • Assignment 1 (Word2Vec):
    python train_cbow.py --window_size 2 --batch_size 128
    • Assignment 2 (Medical Agent):
    python agent_orchestrator.py --use_debate True --api_key YOUR_API_KEY
    • Assignment 3 (QLoRA Fine-Tuning):
    accelerate launch finetune_medical.py --model Qwen-7B --dataset Huatuo26M-Lite
  3. Evaluate & Visualize:

    • Generate t-SNE plots for embeddings (Assignment 1).

    • Check compliance logs in data/wrong_answers.json (Assignment 2).

    • Monitor training metrics with tensorboard --logdir ./logs(Assignment 3).


📚 Resources & References


Made with ❤️ by Zijin CAI

"From embeddings to life-saving diagnostics—code that cares!" 🌟

About

This repository contains three projects exploring cutting-edge NLP techniques—from word embeddings to LLM fine-tuning and medical agent frameworks. Each assignment tackles real-world challenges with code, experiments, and detailed reports

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published