You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
latest version. tested via both `pip install -U accelerate` and `pip install git+https://github.com/huggingface/accelerate`
Information
My own modified scripts
The official example scripts
Tasks
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)
Reproduction
While trying to fine-tune llms via torch/transformers on Kaggle TPU v3-8, getting an error that says accelerate don't count TPUs as device:
Error: RuntimeError: There are currently no available devices found, must be one of 'XPU', 'CUDA', or 'NPU'.
Error thrown on the trainer = SFTTrainer(...). You can see full Traceback of Error in below:
Click to Show Full Error Traceback
---------------------------------------------------------------------------RuntimeErrorTraceback (mostrecentcalllast)
CellIn[48], line41fromtrlimportSFTTrainer2fromtransformersimportAutoTokenizer, AutoModelForCausalLM, TrainingArguments---->4trainer=SFTTrainer(
5model=base_model,
6train_dataset=data,
7args=TrainingArguments(
8per_device_train_batch_size=BATCH_SIZE, # This is actually the global batch size for SPMD.9num_train_epochs=1,
10max_steps=-1,
11output_dir="/output_dir",
12optim="adafactor",
13logging_steps=1,
14dataloader_drop_last=True, # Required for SPMD.15fsdp="full_shard",
16fsdp_config=fsdp_config,
17 ),
18peft_config=lora_config,
19dataset_text_field="quote",
20max_seq_length=max_seq_length,
21packing=True,
22 )
File/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:101, in_deprecate_arguments.<locals>._inner_deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
99message+="\n\n"+custom_message100warnings.warn(message, FutureWarning)
-->101returnf(*args, **kwargs)
File/usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:401, inSFTTrainer.__init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics, peft_config, dataset_text_field, packing, formatting_func, max_seq_length, infinite, num_of_sequences, chars_per_token, dataset_num_proc, dataset_batch_size, neftune_noise_alpha, model_init_kwargs, dataset_kwargs, eval_packing)
395iftokenizer.padding_sideisnotNoneandtokenizer.padding_side!="right":
396warnings.warn(
397"You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to "398"overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code."399 )
-->401super().__init__(
402model=model,
403args=args,
404data_collator=data_collator,
405train_dataset=train_dataset,
406eval_dataset=eval_dataset,
407tokenizer=tokenizer,
408model_init=model_init,
409compute_metrics=compute_metrics,
410callbacks=callbacks,
411optimizers=optimizers,
412preprocess_logits_for_metrics=preprocess_logits_for_metrics,
413 )
415# Add tags for models that have been loaded with the correct transformers version416ifhasattr(self.model, "add_model_tags"):
File/usr/local/lib/python3.10/site-packages/transformers/trainer.py:411, inTrainer.__init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics)
408self.deepspeed=None409self.is_in_train=False-->411self.create_accelerator_and_postprocess()
413# memory metrics - must set up as early as possible414self._memory_tracker=TrainerMemoryTracker(self.args.skip_memory_metrics)
File/usr/local/lib/python3.10/site-packages/transformers/trainer.py:4858, inTrainer.create_accelerator_and_postprocess(self)
4855args.update(accelerator_config)
4857# create accelerator object->4858self.accelerator=Accelerator(**args)
4859# some Trainer classes need to use `gather` instead of `gather_for_metrics`, thus we store a flag4860self.gather_function=self.accelerator.gather_for_metricsFile/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py:349, inAccelerator.__init__(self, device_placement, split_batches, mixed_precision, gradient_accumulation_steps, cpu, dataloader_config, deepspeed_plugin, fsdp_plugin, megatron_lm_plugin, rng_types, log_with, project_dir, project_config, gradient_accumulation_plugin, step_scheduler_with_optimizer, kwargs_handlers, dynamo_backend, deepspeed_plugins)
345raiseValueError(f"FSDP requires PyTorch >= {FSDP_PYTORCH_VERSION}")
347iffsdp_pluginisNone: # init from env variables348fsdp_plugin= (
-->349FullyShardedDataParallelPlugin() ifos.environ.get("ACCELERATE_USE_FSDP", "false") =="true"elseNone350 )
351else:
352ifnotisinstance(fsdp_plugin, FullyShardedDataParallelPlugin):
File<string>:21, in__init__(self, sharding_strategy, backward_prefetch, mixed_precision_policy, auto_wrap_policy, cpu_offload, ignored_modules, state_dict_type, state_dict_config, optim_state_dict_config, limit_all_gathers, use_orig_params, param_init_fn, sync_module_states, forward_prefetch, activation_checkpointing, cpu_ram_efficient_loading, transformer_cls_names_to_wrap, min_num_params)
File/usr/local/lib/python3.10/site-packages/accelerate/utils/dataclasses.py:1684, inFullyShardedDataParallelPlugin.__post_init__(self)
1682device=torch.xpu.current_device()
1683else:
->1684raiseRuntimeError(
1685"There are currently no available devices found, must be one of 'XPU', 'CUDA', or 'NPU'."1686 )
1687# Create a function that will be used to initialize the parameters of the model1688# when using `sync_module_states`1689self.param_init_fn=lambdax: x.to_empty(device=device, recurse=False)
RuntimeError: Therearecurrentlynoavailabledevicesfound, mustbeoneof'XPU', 'CUDA', or'NPU'.
upgraded transformers, peft and trl to latest version, but got same error.
Expected behavior
accelerate detect TPU and dont throw error through accelerate/utils/dataclasses.py.
The text was updated successfully, but these errors were encountered:
the error RuntimeError: There are currently no available devices found, must be one of 'XPU', 'CUDA', or 'NPU' dosen't thrown on transformers==4.38.2 and llama-3 fine-tune successfully done on TPU VM
but llama-3.1 requires upgraded version of transformers. so it's a deadend.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
While trying to fine-tune llms via torch/transformers on Kaggle TPU v3-8, getting an error that says
accelerate
don't count TPUs as device:Error:
RuntimeError: There are currently no available devices found, must be one of 'XPU', 'CUDA', or 'NPU'.
To make sure i also tested GoogleCloudPlatform Example that is a torch TPU fine-tune, got same exact error.
Error thrown on the
trainer = SFTTrainer(...)
. You can see full Traceback of Error in below:Click to Show Full Error Traceback
upgraded
transformers
,peft
andtrl
to latest version, but got same error.Expected behavior
accelerate detect
TPU
and dont throw error throughaccelerate/utils/dataclasses.py
.The text was updated successfully, but these errors were encountered: