[feat] Integrate dataset re-weighting and preprocessing into Fast-LLM for streamlined data loading

# 🧐 Problem Description

Currently, creating a training dataset with Fast-LLM involves a multi-step, cumbersome process:

1. **Organizing Datasets:** Start with a collection of memory-mapped Megatron dataset files in different folders, each typically corresponding to a dataset (e.g., Fineweb-edu or The Stack).
2. **Generating JSON Manifests:** Use a separate script ([concatenate_dataset.py](https://github.com/ServiceNow/Fast-LLM/blob/main/tools/concatenate_dataset.py)) to create a JSON manifest for each folder. Each file entry is weighted based on its token count to ensure uniform sampling by tokens.
3. **Defining the Dataset Mix:** Use fml-ops' [mix_datasets.py](https://github.com/ServiceNow/fml-ops/blob/main/fml/commands/mix_datasets.py) to combine these manifests into a final weighted dataset mix (e.g., 30% from The Stack and 70% from Fineweb-edu).

This workflow is inefficient, error-prone ([e.g., issue #71](https://github.com/ServiceNow/fml-ops/pull/71)), and less user-friendly compared to other LLM training frameworks that offer simpler, more integrated data-loading mechanisms:

- **Mosaic Composer**: Allows composing datasets with minimal code:
    ```python
    stream_A = Stream(remote='s3://stream_A_remote', local='/tmp/stream_A', proportion=0.25)
    stream_B = Stream(remote='s3://stream_B_remote', local='/tmp/stream_B', proportion=0.75)
    dataset = StreamingDataset(streams=[stream_A, stream_B])
    ```
- **Megatron's `BlendedDataset`**: Supports combining datasets with different weights directly:
    ```python
    BlendDataset(
        (get_train_dataset('/my/dataset/image_dataset', ...), 0.6),
        (get_train_dataset('/my/dataset/captioning_dataset', ...), 0.4),
    )
    ```

The additional steps required by Fast-LLM add complexity and reduce competitiveness in terms of data handling and preparation.

# 💡 Proposed Solution

1. **Integrate the Preprocessing Step into Fast-LLM:**  
   - Embed the current preprocessing capabilities directly into the Fast-LLM framework, allowing it to load complex dataset mixtures without requiring a separate preprocessing steps.

2. **Revamp the Dataset Configuration Format:**  
   - Update the format to specify a list of data paths, each with a target proportion representing the fraction of the final dataset's tokens that will come from that path.
   - For directories, distribute the target proportion across datasets in the folder based on their token counts. For individual files, apply the proportion directly.
   - Extend support to additional formats, such as Parquet files, in the future.

### Example Configuration
```yaml
datasets:
  - path: /data/datasets/folder1
    target_proportion: 0.6
  - path: /data/datasets/single_file.idx
    target_proportion: 0.4
```
With this setup, Fast-LLM will automatically distribute the proportions among datasets within the specified paths.

# 🔄 Alternatives Considered

1. **Keep the Existing Script-Based Workflow:**  
   - This option would retain complexity and dependencies (e.g., fml-ops), requiring users to manage intermediate files and multiple steps—issues we've already faced.

2. **Provide a Standalone Utility for Merging and Weighting:**  
   - While combining the tooling into one utility would reduce some complexity, it would still separate preprocessing from the main training workflow and add dependencies on Fast-LLM's dataset implementation.

# 📈 Potential Benefits

1. **Improved Usability:**  
   - Loading datasets directly from structured folders simplifies usage, saving time and effort for existing users and new adopters, who may otherwise find the current process daunting.

2. **Enhanced Competitiveness:**  
   - Bringing data handling into Fast-LLM will align it with alternative frameworks that offer more seamless data-loading capabilities.

3. **Streamlined Workflow:**  
   - Reducing the steps from data preparation to training will improve efficiency and reduce potential for user errors.

# 📝 Additional Context

Integrating preprocessing directly into Fast-LLM would bring it closer to modern LLM frameworks that offer unified dataset preparation. This approach will facilitate future support for custom dataset implementations, such as streaming Parquet files from cloud storage (e.g., S3). For reference, frameworks like Mosaic's Composer already provide flexible data-loading options, making integration smoother.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] Integrate dataset re-weighting and preprocessing into Fast-LLM for streamlined data loading #25

🧐 Problem Description

💡 Proposed Solution

Example Configuration

🔄 Alternatives Considered

📈 Potential Benefits

📝 Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[feat] Integrate dataset re-weighting and preprocessing into Fast-LLM for streamlined data loading #25

Description

🧐 Problem Description

💡 Proposed Solution

Example Configuration

🔄 Alternatives Considered

📈 Potential Benefits

📝 Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions