[BUG]: HuggingFace dataset materializer problem #1135

ghpu · 2022-12-05T11:43:42Z

Contact Details [Optional]

No response

System Information

zenml v0.30rc0 (applies also to v0.22 and v0.23)

ZenML version: 0.30.0rc0
Install path: /opt/anaconda3/lib/python3.9/site-packages/zenml
Python version: 3.9.7
Platform information: {'os': 'linux', 'linux_distro': 'ubuntu', 'linux_distro_like': 'debian', 'linux_distro_version': '22.10'}
Environment: native
Integrations: ['github', 'huggingface', 'pillow', 'plotly', 'pytorch', 'pytorch_lightning', 'scipy', 'sklearn']

The current user is: 'default'
The active project is: 'default' (global)
The active stack is: 'default' (global)

    Stack Configuration

┏━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━┓
┃ COMPONENT_TYPE │ COMPONENT_NAME ┃
┠────────────────┼────────────────┨
┃ ARTIFACT_STORE │ default ┃
┠────────────────┼────────────────┨
┃ ORCHESTRATOR │ default ┃
┗━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━┛

What happened?

While passing a arrow dataset between two steps through HuggingFace dataset materializer, the dataset cannot be used in the second step, because its cache_file folder doesn't exist anymore.

The problem arises from the use of TemporaryDirectory in handle_input , which is destroyed at return, whereas this directory is still referenced in the dataset info. The folder must survive as long as the dataset is in use, so a simple work-around would be to use tempfile.mkdtemp instead of TemporaryDirectory.

But we should find how to clean-up properly after use.

Reproduction steps

No response

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

dnth · 2022-12-05T13:36:41Z

Hi @ghpu thank you for reporting this issue. Is there any code snippet to help me to replicate this issue?
It would be helpful if you can post the full error traceback here too. :)

Also, would you please post the output of

python -c "import zenml.environment; print(zenml.environment.get_system_details())"

and

zenml status

and

zenml stack describe ?

morganveyret · 2022-12-05T13:59:20Z

Here is a minimal example producing the issue:

from zenml.pipelines import pipeline
from zenml.steps import step
from zenml.steps import BaseParameters, Output

import datasets, transformers
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

class Params(BaseParameters):
    model = "distilbert-base-uncased"
    dataset_name = "imdb"
    num_labels = 2
    label_col = 'label'
    text_col = 'text'
    batch_size = 16
    epochs = 3
    log_steps = 100
    output_max_length = 128
    max_length = 512
    learning_rate = 2e-5
    weight_decay = 0.01

@step
def load_dataset(params: Params) -> datasets.DatasetDict:
    data = datasets.load_dataset(params.dataset_name)
    return data


@step
def load_tokenizer(params: Params) -> transformers.PreTrainedTokenizerBase:
    tokenizer = AutoTokenizer.from_pretrained(params.model,
                                              model_max_length=params.max_length)
    return tokenizer

@step
def tokenize(params: Params,
             tokenizer: transformers.PreTrainedTokenizerBase,
             data: datasets.DatasetDict) -> datasets.DatasetDict:
    data = data.map(
        lambda exs: tokenizer(exs[params.text_col],truncation=True),
        batched=True)
    return data

@step
def train_model(params: Params,
                  tokenized_data: datasets.DatasetDict,
                  tokenizer: transformers.PreTrainedTokenizerBase) -> transformers.PreTrainedModel:
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
    model = AutoModelForSequenceClassification.from_pretrained(params.model,
                                                               num_labels=params.num_labels)
    training_args = TrainingArguments(output_dir="./results/",
                                      learning_rate=params.learning_rate,
                                      per_device_train_batch_size=params.batch_size,
                                      per_device_eval_batch_size=params.batch_size,
                                      num_train_epochs=params.epochs,
                                      weight_decay=params.weight_decay)
    trainer = Trainer(model=model,
                      args=training_args,
                      train_dataset=tokenized_data["train"],
                      eval_dataset=tokenized_data["test"],
                      tokenizer=tokenizer,
                      data_collator=data_collator)
    trainer.train()
    return model


@pipeline(enable_cache=False)
def test_pipeline(load_data,
                  load_tokenizer,
                  tokenize_data,
                  train_model):
    data = load_data()
    tokenizer = load_tokenizer()
    tokenized_data = tokenize_data(data=data,tokenizer=tokenizer)
    model = train_model(tokenized_data=tokenized_data,tokenizer=tokenizer)

def main(**kwargs):
    pipeline = test_pipeline(
        load_data=load_dataset(),
        load_tokenizer=load_tokenizer(),
        tokenize_data=tokenize(),
        train_model=train_model()
    )
    pipeline.run(unlisted=True)


if __name__ == "__main__":
    main()

safoinme · 2022-12-05T17:02:45Z

Hi @ghpu, @morganveyret, thank you for reporting this issue and providing the minimal code example. we have now an open PR that will be released by the end of this week. However, if you want to use it before it's out you can create a custom materializer CustomHFDatasetMaterializer with the same code and call it within a step using @step(output_materializers=CustomHFDatasetMaterializer).

schustmi · 2022-12-11T18:52:44Z

This was fixed in 0.30.0

ghpu added the bug Something isn't working label Dec 5, 2022

dnth self-assigned this Dec 5, 2022

safoinme mentioned this issue Dec 5, 2022

Fix Huggingface dataset materializer #1142

Merged

8 tasks

schustmi closed this as completed Dec 11, 2022

akesterson mentioned this issue Mar 12, 2024

Implement Cleanup of Temporary Directories for built-in Materializers #2257

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: HuggingFace dataset materializer problem #1135

[BUG]: HuggingFace dataset materializer problem #1135

ghpu commented Dec 5, 2022 •

edited

Loading

dnth commented Dec 5, 2022 •

edited

Loading

morganveyret commented Dec 5, 2022

safoinme commented Dec 5, 2022

schustmi commented Dec 11, 2022

[BUG]: HuggingFace dataset materializer problem #1135

[BUG]: HuggingFace dataset materializer problem #1135

Comments

ghpu commented Dec 5, 2022 • edited Loading

Contact Details [Optional]

System Information

What happened?

Reproduction steps

Relevant log output

Code of Conduct

dnth commented Dec 5, 2022 • edited Loading

morganveyret commented Dec 5, 2022

safoinme commented Dec 5, 2022

schustmi commented Dec 11, 2022

ghpu commented Dec 5, 2022 •

edited

Loading

dnth commented Dec 5, 2022 •

edited

Loading