Skip to content

CUDA OOM doing reading comprehension on A10 24GB VRAM GPU #81

Open
@tleyden

Description

With a subset of the nuclear patent dataset, it throws this error:

12/05/2023 16:14:48 - INFO - dalm.pipelines.reading_comprehension_pipeline - LLM RC dataset generated text of length 2415 from context of length 670
12/05/2023 16:14:48 - INFO - dalm.pipelines.reading_comprehension_pipeline - Writing unprocessed LLM output to context_data_c8307498-165e-49b6-b073-214fbe9bb8e0.csv8_0.json
12/05/2023 16:14:48 - INFO - dalm.pipelines.reading_comprehension_pipeline - Writing Q & A chat completions of length 9 to context_data_c8307498-165e-49b6-b073-214fbe9bb8e0.csv8_0.json
12/05/2023 16:15:17 - INFO - dalm.pipelines.reading_comprehension_pipeline - LLM RC dataset generated text of length 2855 from context of length 11202
12/05/2023 16:15:17 - INFO - dalm.pipelines.reading_comprehension_pipeline - Writing unprocessed LLM output to context_data_c8307498-165e-49b6-b073-214fbe9bb8e0.csv9_0.json
12/05/2023 16:15:17 - INFO - dalm.pipelines.reading_comprehension_pipeline - Writing Q & A chat completions of length 9 to context_data_c8307498-165e-49b6-b073-214fbe9bb8e0.csv9_0.json
/opt/conda/lib/python3.10/site-packages/transformers/pipelines/base.py:1101: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
12/05/2023 16:15:41 - INFO - dalm.pipelines.reading_comprehension_pipeline - LLM RC dataset generated text of length 2240 from context of length 2841
12/05/2023 16:15:41 - WARNING - dalm.datasets.reading_comprehension_generation.utils - Found a question with no answer: {'question': 's and answer task:', 'answer': 'TBD'}.  Skipping.
12/05/2023 16:15:41 - INFO - dalm.pipelines.reading_comprehension_pipeline - Writing unprocessed LLM output to context_data_c8307498-165e-49b6-b073-214fbe9bb8e0.csv10_0.json
12/05/2023 16:15:41 - INFO - dalm.pipelines.reading_comprehension_pipeline - Writing Q & A chat completions of length 7 to context_data_c8307498-165e-49b6-b073-214fbe9bb8e0.csv10_0.json

12/05/2023 16:15:42 - ERROR - root - Training failed with exception: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 22.20 GiB total capacity; 17.54 GiB already allocated; 327.12 MiB free; 20.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "//train_generator.py", line 153, in <module>
    create_reading_comprehension_dataset_and_train()
  File "//train_generator.py", line 134, in create_reading_comprehension_dataset_and_train
    pipeline(
  File "/opt/conda/lib/python3.10/site-packages/dalm/pipelines/reading_comprehension_pipeline.py", line 146, in pipeline
    for index, text_identifier, context, gen_text in llm_rc_dataset_generator:
  File "/opt/conda/lib/python3.10/site-packages/dalm/datasets/reading_comprehension_generation/synthetic_based.py", line 119, in generate_synthetic_dataset
    gen_text = generate_synthetic_data(model_pipeline, chunk_, generation_params)
  File "/opt/conda/lib/python3.10/site-packages/dalm/datasets/reading_comprehension_generation/synthetic_based.py", line 82, in generate_synthetic_data
    outputs = model_pipeline(prompt, **generation_params)
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 208, in __call__
    return super().__call__(text_inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1140, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1147, in run_single
    model_outputs = self.forward(model_inputs, **forward_params)
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1046, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 271, in _forward
    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1719, in generate
    return self.sample(
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2801, in sample
    outputs = self(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1009, in forward
    outputs = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 897, in forward
    layer_outputs = decoder_layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 626, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 286, in forward
    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 1845, in softmax
    ret = input.softmax(dim, dtype=dtype)

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions