Chat data cleaning, filtering and deduplication pipeline.
-
Updated
Jul 25, 2023 - Python
Chat data cleaning, filtering and deduplication pipeline.
About working Propmting in OpenAI models, it is also used with deffrent pettren Alpaca prompt, INST prompt
Dolphin 3.0 🐬: Versatile AI for coding, math, and more
Standardized spec and vendor-specific transforms for ChatML
A Python-based interactive CLI interface for chatting with Hugging Face language models, optimized for casual, Discord-style conversation using ChatML. Supports both quantized and full-precision models, live token streaming with color formatting, and dynamic generation parameter adjustment.
A dataset toolbox for preparing and analyzing conversational datasets, including CSV splitting, CSV → Parquet conversion, dataset statistics, Parquet cleaning and sorting, HuggingFace–style metadata generation, and batched chain insertion into PostgreSQL — with Rich progress, multiprocessing, and 32 GB-RAM-friendly batching.
SmolLM2 🤗: Family of lightweight language models, performs diverse tasks on-device
Deepseek-Dataset-Generator creates conversational datasets for LLM fine-tuning via DeepSeek API. Supports various formats (ChatML, ShareGPT, Alpaca, JSON, CSV), easy configuration via YAML and detailed logs. Ideal for generating realistic and customized data quickly.
Week 5 project: build a hybrid retriever that fuses FAISS dense vectors with SQLite FTS5/BM25 keyword search (RRF/weighted fusion), plus a Supervised Fine-Tuning (SFT) pipeline (Full FT vs LoRA/QLoRA) using TRL/PEFT/DeepSpeed.
Qwen2.5-Coder: Family of LLMs excels in code, debugging, etc
Add a description, image, and links to the chatml topic page so that developers can more easily learn about it.
To associate your repository with the chatml topic, visit your repo's landing page and select "manage topics."