Fluxentropy is an open-source engine designed to enhance curriculum learning for language models. By leveraging entropy as a metric to organize training data, Fluxentropy aims to improve the efficiency and performance of model training. Built with modularity in mind, the project centers around using pretrained language models (like Hugging Face’s SmolLM) to assign entropy-based characteristics to dataset chunks, potentially streamlining convergence and optimizing training. Fluxentropy is a project stemming from the work done by the opensource community and spearheaded by xjdr and doomslide (aka shrek and frog) on entropix.
- Entropy Characterization: Fluxentropy’s core module, built on top of Hugging Face tools, enables entropy assessment and tagging of data chunks. The setup is customizable, handling tokenization, encoding, and entropy measurement in a flexible pipeline.
- Curriculum Learning via Entropy: For curriculum learning, dataset chunks are ordered by entropy instead of randomly, optimizing the learning progression.
- Potential for Enhanced RAG: Though initially focused on training, Fluxentropy’s entropy-based chunking could later enhance retrieval-augmented generation (RAG) tasks by prioritizing high-value chunks.
- Clone the repository:
git clone https://github.com/SinastrasC/fluxentropy.git cd fluxentropy
- Install dependencies:
pip install -r requirements.txt
- Core Functionality: Develop an entropy characterization function based on Hugging Face tools, capable of tagging entropy levels for dataset chunks.
- Testing with nanoGPT: Use entropy-ordered data chunks to test training speed and convergence improvements in nanoGPT.
- Visualization: Integrate visualization for entropy distribution and training efficiency metrics to optimize curriculum learning.
- Milestone 1:
Build and validate theentropy_characterize
function to tag entropy levels and output results to a file. - Milestone 2: Implement visualization for entropy-based data preparation and assess improvements in training efficiency.
- Sidequest 1:
Implement statistical analysis to gauge entropy-based ordering across models - Sidequest 2:
Correllate benchmark Q&A performance with assigned entropy. - Sidequest 2:
Create llama3 tokenized fineweb10B dataset for sorting.
- Sidequest 1:
- Milestone 3: Connect Fluxentropy to a data import pipeline for data scheduling during training.
Collaboration is central to Fluxentropy! We’re focused on core features initially, but plan to bring on new contributors to expand functionality and test the engine across diverse tasks.
Join us in making model training smarter, one entropy-ordered chunk at a time!