Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: HuggingFace dataset materializer problem #1135

Closed
1 task done
ghpu opened this issue Dec 5, 2022 · 4 comments
Closed
1 task done

[BUG]: HuggingFace dataset materializer problem #1135

ghpu opened this issue Dec 5, 2022 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@ghpu
Copy link

ghpu commented Dec 5, 2022

Contact Details [Optional]

No response

System Information

zenml v0.30rc0 (applies also to v0.22 and v0.23)

ZenML version: 0.30.0rc0
Install path: /opt/anaconda3/lib/python3.9/site-packages/zenml
Python version: 3.9.7
Platform information: {'os': 'linux', 'linux_distro': 'ubuntu', 'linux_distro_like': 'debian', 'linux_distro_version': '22.10'}
Environment: native
Integrations: ['github', 'huggingface', 'pillow', 'plotly', 'pytorch', 'pytorch_lightning', 'scipy', 'sklearn']

The current user is: 'default'
The active project is: 'default' (global)
The active stack is: 'default' (global)

    Stack Configuration        

┏━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━┓
┃ COMPONENT_TYPE │ COMPONENT_NAME ┃
┠────────────────┼────────────────┨
┃ ARTIFACT_STORE │ default ┃
┠────────────────┼────────────────┨
┃ ORCHESTRATOR │ default ┃
┗━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━┛

What happened?

While passing a arrow dataset between two steps through HuggingFace dataset materializer, the dataset cannot be used in the second step, because its cache_file folder doesn't exist anymore.

The problem arises from the use of TemporaryDirectory in handle_input , which is destroyed at return, whereas this directory is still referenced in the dataset info. The folder must survive as long as the dataset is in use, so a simple work-around would be to use tempfile.mkdtemp instead of TemporaryDirectory.

But we should find how to clean-up properly after use.

Reproduction steps

No response

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@ghpu ghpu added the bug Something isn't working label Dec 5, 2022
@dnth dnth self-assigned this Dec 5, 2022
@dnth
Copy link
Contributor

dnth commented Dec 5, 2022

Hi @ghpu thank you for reporting this issue. Is there any code snippet to help me to replicate this issue?
It would be helpful if you can post the full error traceback here too. :)

Also, would you please post the output of

python -c "import zenml.environment; print(zenml.environment.get_system_details())"

and

zenml status

and

zenml stack describe ?

@morganveyret
Copy link

Here is a minimal example producing the issue:

from zenml.pipelines import pipeline
from zenml.steps import step
from zenml.steps import BaseParameters, Output

import datasets, transformers
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

class Params(BaseParameters):
    model = "distilbert-base-uncased"
    dataset_name = "imdb"
    num_labels = 2
    label_col = 'label'
    text_col = 'text'
    batch_size = 16
    epochs = 3
    log_steps = 100
    output_max_length = 128
    max_length = 512
    learning_rate = 2e-5
    weight_decay = 0.01

@step
def load_dataset(params: Params) -> datasets.DatasetDict:
    data = datasets.load_dataset(params.dataset_name)
    return data


@step
def load_tokenizer(params: Params) -> transformers.PreTrainedTokenizerBase:
    tokenizer = AutoTokenizer.from_pretrained(params.model,
                                              model_max_length=params.max_length)
    return tokenizer

@step
def tokenize(params: Params,
             tokenizer: transformers.PreTrainedTokenizerBase,
             data: datasets.DatasetDict) -> datasets.DatasetDict:
    data = data.map(
        lambda exs: tokenizer(exs[params.text_col],truncation=True),
        batched=True)
    return data

@step
def train_model(params: Params,
                  tokenized_data: datasets.DatasetDict,
                  tokenizer: transformers.PreTrainedTokenizerBase) -> transformers.PreTrainedModel:
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
    model = AutoModelForSequenceClassification.from_pretrained(params.model,
                                                               num_labels=params.num_labels)
    training_args = TrainingArguments(output_dir="./results/",
                                      learning_rate=params.learning_rate,
                                      per_device_train_batch_size=params.batch_size,
                                      per_device_eval_batch_size=params.batch_size,
                                      num_train_epochs=params.epochs,
                                      weight_decay=params.weight_decay)
    trainer = Trainer(model=model,
                      args=training_args,
                      train_dataset=tokenized_data["train"],
                      eval_dataset=tokenized_data["test"],
                      tokenizer=tokenizer,
                      data_collator=data_collator)
    trainer.train()
    return model


@pipeline(enable_cache=False)
def test_pipeline(load_data,
                  load_tokenizer,
                  tokenize_data,
                  train_model):
    data = load_data()
    tokenizer = load_tokenizer()
    tokenized_data = tokenize_data(data=data,tokenizer=tokenizer)
    model = train_model(tokenized_data=tokenized_data,tokenizer=tokenizer)

def main(**kwargs):
    pipeline = test_pipeline(
        load_data=load_dataset(),
        load_tokenizer=load_tokenizer(),
        tokenize_data=tokenize(),
        train_model=train_model()
    )
    pipeline.run(unlisted=True)


if __name__ == "__main__":
    main()

@safoinme
Copy link
Contributor

safoinme commented Dec 5, 2022

Hi @ghpu, @morganveyret, thank you for reporting this issue and providing the minimal code example. we have now an open PR that will be released by the end of this week. However, if you want to use it before it's out you can create a custom materializer CustomHFDatasetMaterializer with the same code and call it within a step using @step(output_materializers=CustomHFDatasetMaterializer).

@schustmi
Copy link
Contributor

This was fixed in 0.30.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants