Skip to content

Torchtune dataset cache memory error despite providing dataset file path #2775

Open
@zabir110

Description

@zabir110

Hello,

I am trying to use torchtune to fine-tune the Llama-3.1-8B model. I need to implement it on our local data, so I am trying to set up our config file properly. However, whenever I try to run using the config file using

tune run full_finetune_single_device --config custom_config.yaml
I get the following error:

NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.

This is how I updated our config file:

dataset:
  _component_: torchtune.datasets.instruct_dataset
  data_files: /gpfs/u/home/project_name/project_user/scratch/datasets/custom_instruct_data_file.json
  source: json
  split: train

From the error logs, I found out that even when I specified the data file location, it is still trying to download some instruct dataset to some cache memory:

Downloading and preparing dataset json/default to file:///gpfs/u/home/project_name/project_user/scratch/Llama-3.1/huggingface_cache/datasets/json/default-04bb5956f8665e2d/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...
Downloading data files: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 7244.05it/s]
Extracting data files: 100%|█████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 105.73it/s]
Dataset json downloaded and prepared to file:///gpfs/u/home/project_name/project_user/scratch/Llama-3.1/huggingface_cache/datasets/json/default-04bb5956f8665e2d/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.

Previously, I thought it was because of the .cache location, so I changed the cache path using

export FT_ROOT=/gpfs/u/home/project_name/project_user/scratch/Llama-3.1

# inside it, make a folder for ALL HF caches
mkdir -p $FT_ROOT/huggingface_cache

However, I still get the local filesystem not supported error. Either way, I do not see any reason torchtune is trying to download data to cache memory despite the file location being specified. Am I doing something wrong? Does the download here simply imply loading the local data file to a separate location for processing, or is it downloading some data from the hf repo? Why is it caching the files to a local LocalFileSystem anyways? Why can it not directly read from the provided file location? Is there a way we can load it directly to the RAM?

How do I solve this error issue? Do we need to provide some read/write permission? Or do we need to change version? I have the 0.6.1 version.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions