Fluxentropy

Fluxentropy is an open-source engine designed to enhance curriculum learning for language models. By leveraging entropy as a metric to organize training data, Fluxentropy aims to improve the efficiency and performance of model training. Built with modularity in mind, the project centers around using pretrained language models (like Hugging Face’s SmolLM) to assign entropy-based characteristics to dataset chunks, potentially streamlining convergence and optimizing training. Fluxentropy is a project stemming from the work done by the opensource community and spearheaded by xjdr and doomslide (aka shrek and frog) on entropix.

Features

Entropy Characterization: Fluxentropy’s core module, built on top of Hugging Face tools, enables entropy assessment and tagging of data chunks. The setup is customizable, handling tokenization, encoding, and entropy measurement in a flexible pipeline.
Curriculum Learning via Entropy: For curriculum learning, dataset chunks are ordered by entropy instead of randomly, optimizing the learning progression.
Potential for Enhanced RAG: Though initially focused on training, Fluxentropy’s entropy-based chunking could later enhance retrieval-augmented generation (RAG) tasks by prioritizing high-value chunks.

Installation

Clone the repository:

git clone https://github.com/SinastrasC/fluxentropy.git
cd fluxentropy

Install dependencies:
```
pip install -r requirements.txt
```

Getting Started

TODOs

Core Functionality: Develop an entropy characterization function based on Hugging Face tools, capable of tagging entropy levels for dataset chunks.
Testing with nanoGPT: Use entropy-ordered data chunks to test training speed and convergence improvements in nanoGPT.
Visualization: Integrate visualization for entropy distribution and training efficiency metrics to optimize curriculum learning.

Roadmap

Milestone 1: ~~Build and validate the entropy_characterize function to tag entropy levels and output results to a file.~~
Milestone 2: Implement visualization for entropy-based data preparation and assess improvements in training efficiency.
- Sidequest 1: ~~Implement statistical analysis to gauge entropy-based ordering across models~~
- Sidequest 2: ~~Correllate benchmark Q&A performance with assigned entropy.~~
- Sidequest 2: ~~Create llama3 tokenized fineweb10B dataset for sorting.~~
Milestone 3: Connect Fluxentropy to a data import pipeline for data scheduling during training.

Contributing

Collaboration is central to Fluxentropy! We’re focused on core features initially, but plan to bring on new contributors to expand functionality and test the engine across diverse tasks.

Join us in making model training smarter, one entropy-ordered chunk at a time!

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
eval_entropy		eval_entropy
sanity_checks		sanity_checks
src		src
.gitignore		.gitignore
Fluxentropy_torch.ipynb		Fluxentropy_torch.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fluxentropy

Features

Installation

Getting Started

TODOs

Roadmap

Contributing

About

Releases

Packages

Contributors 2

Languages

License

SinatrasC/fluxentropy

Folders and files

Latest commit

History

Repository files navigation

Fluxentropy

Features

Installation

Getting Started

TODOs

Roadmap

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages