-
Notifications
You must be signed in to change notification settings - Fork 437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: HuggingFace dataset materializer problem #1135
Comments
Hi @ghpu thank you for reporting this issue. Is there any code snippet to help me to replicate this issue? Also, would you please post the output of
and
and
|
Here is a minimal example producing the issue: from zenml.pipelines import pipeline
from zenml.steps import step
from zenml.steps import BaseParameters, Output
import datasets, transformers
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
class Params(BaseParameters):
model = "distilbert-base-uncased"
dataset_name = "imdb"
num_labels = 2
label_col = 'label'
text_col = 'text'
batch_size = 16
epochs = 3
log_steps = 100
output_max_length = 128
max_length = 512
learning_rate = 2e-5
weight_decay = 0.01
@step
def load_dataset(params: Params) -> datasets.DatasetDict:
data = datasets.load_dataset(params.dataset_name)
return data
@step
def load_tokenizer(params: Params) -> transformers.PreTrainedTokenizerBase:
tokenizer = AutoTokenizer.from_pretrained(params.model,
model_max_length=params.max_length)
return tokenizer
@step
def tokenize(params: Params,
tokenizer: transformers.PreTrainedTokenizerBase,
data: datasets.DatasetDict) -> datasets.DatasetDict:
data = data.map(
lambda exs: tokenizer(exs[params.text_col],truncation=True),
batched=True)
return data
@step
def train_model(params: Params,
tokenized_data: datasets.DatasetDict,
tokenizer: transformers.PreTrainedTokenizerBase) -> transformers.PreTrainedModel:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
model = AutoModelForSequenceClassification.from_pretrained(params.model,
num_labels=params.num_labels)
training_args = TrainingArguments(output_dir="./results/",
learning_rate=params.learning_rate,
per_device_train_batch_size=params.batch_size,
per_device_eval_batch_size=params.batch_size,
num_train_epochs=params.epochs,
weight_decay=params.weight_decay)
trainer = Trainer(model=model,
args=training_args,
train_dataset=tokenized_data["train"],
eval_dataset=tokenized_data["test"],
tokenizer=tokenizer,
data_collator=data_collator)
trainer.train()
return model
@pipeline(enable_cache=False)
def test_pipeline(load_data,
load_tokenizer,
tokenize_data,
train_model):
data = load_data()
tokenizer = load_tokenizer()
tokenized_data = tokenize_data(data=data,tokenizer=tokenizer)
model = train_model(tokenized_data=tokenized_data,tokenizer=tokenizer)
def main(**kwargs):
pipeline = test_pipeline(
load_data=load_dataset(),
load_tokenizer=load_tokenizer(),
tokenize_data=tokenize(),
train_model=train_model()
)
pipeline.run(unlisted=True)
if __name__ == "__main__":
main() |
Hi @ghpu, @morganveyret, thank you for reporting this issue and providing the minimal code example. we have now an open PR that will be released by the end of this week. However, if you want to use it before it's out you can create a custom materializer |
This was fixed in 0.30.0 |
Contact Details [Optional]
No response
System Information
zenml v0.30rc0 (applies also to v0.22 and v0.23)
ZenML version: 0.30.0rc0
Install path: /opt/anaconda3/lib/python3.9/site-packages/zenml
Python version: 3.9.7
Platform information: {'os': 'linux', 'linux_distro': 'ubuntu', 'linux_distro_like': 'debian', 'linux_distro_version': '22.10'}
Environment: native
Integrations: ['github', 'huggingface', 'pillow', 'plotly', 'pytorch', 'pytorch_lightning', 'scipy', 'sklearn']
The current user is: 'default'
The active project is: 'default' (global)
The active stack is: 'default' (global)
┏━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━┓
┃ COMPONENT_TYPE │ COMPONENT_NAME ┃
┠────────────────┼────────────────┨
┃ ARTIFACT_STORE │ default ┃
┠────────────────┼────────────────┨
┃ ORCHESTRATOR │ default ┃
┗━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━┛
What happened?
While passing a arrow dataset between two steps through HuggingFace dataset materializer, the dataset cannot be used in the second step, because its cache_file folder doesn't exist anymore.
The problem arises from the use of TemporaryDirectory in handle_input , which is destroyed at return, whereas this directory is still referenced in the dataset info. The folder must survive as long as the dataset is in use, so a simple work-around would be to use tempfile.mkdtemp instead of TemporaryDirectory.
But we should find how to clean-up properly after use.
Reproduction steps
No response
Relevant log output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: