langchain_huggingface: Fix multiple GPU usage bug in from_model_id function #23628

kenchanLOL · 2024-06-28T10:25:58Z

Description:
- pass the device_map into model_kwargs
- removing the unused device_map variable in the hf_pipeline function call
Issue: issue How do I use multiple GPU when using a model with hugging face? #13128
When using the from_model_id function to load a Hugging Face model for text generation across multiple GPUs, the model defaults to loading on the CPU despite multiple GPUs being available using the expected format

llm = HuggingFacePipeline.from_model_id(
    model_id="model-id",
    task="text-generation",
    device_map="auto",
)

Currently, to enable multiple GPU , we have to pass in variable in this format instead

llm = HuggingFacePipeline.from_model_id(
    model_id="model-id",
    task="text-generation",
    device=None,
    model_kwargs={
        "device_map": "auto",
    }
)

This issue arises due to improper handling of the device and device_map parameters.

Explanation:

In from_model_id, the model is created using model_kwargs and passed as the model variable of the pipeline function. So at this moment, to load the model with multiple GPUs, "device_map" needs to be set to "auto" within model_kwargs. Otherwise, the model defaults to loading on the CPU.
The device_map variable in from_model_id is not utilized correctly. In the pipeline function's source code of tnansformer:

The device_map variable is stored in the model_kwargs dictionary (lines 867-878 of transformers/src/transformers/pipelines/_init_.py).

    if device_map is not None:
        ......
        model_kwargs["device_map"] = device_map

The model is constructed with model_kwargs containing the device_map value ONLY IF it is a string (lines 893-903 of transformers/src/transformers/pipelines/_init_.py).

    if isinstance(model, str) or framework is None:
        model_classes = {"tf": targeted_task["tf"], "pt": targeted_task["pt"]}
        framework, model = infer_framework_load_model( ... , **model_kwargs, )

Consequently, since a model object is already passed to the pipeline function, the device_map variable from from_model_id is never used.

The device_map variable in from_model_id not only appears unused but also causes errors. Without explicitly setting device=None, attempting to load the model on multiple GPUs may result in the following error:

 Device has 2 GPUs available. Provide device={deviceId} to `from_model_id` to use available GPUs for execution. deviceId is -1 (default) for CPU and can be a positive integer associated with CUDA device id.
 Traceback (most recent call last):
   File "foo.py", line 15, in <module>
     llm = HuggingFacePipeline.from_model_id(
   File "foo\site-packages\langchain_huggingface\llms\huggingface_pipeline.py", line 217, in from_model_id
     pipeline = hf_pipeline(
   File "foo\lib\site-packages\transformers\pipelines\__init__.py", line 1108, in pipeline
     return pipeline_class(model=model, framework=framework, task=task, **kwargs)
   File "foo\lib\site-packages\transformers\pipelines\text_generation.py", line 96, in __init__
     super().__init__(*args, **kwargs)
   File "foo\lib\site-packages\transformers\pipelines\base.py", line 835, in __init__
     raise ValueError(
 ValueError: The model has been loaded with `accelerate` and therefore cannot be moved to a specific device. Please discard the `device` argument when creating your pipeline object.

This error occurs because, in from_model_id, the default values in from_model_id for device and device_map are -1 and None, respectively. It would passes the statement (device_map is not None and device < 0) and keep the device as -1 so the pipeline function later raises an error when trying to move a GPU-loaded model back to the CPU.

langchain/libs/community/langchain_community/llms/huggingface_pipeline.py

Lines 204 to 213 in 19eb82e

    
           if device_map is not None and device < 0: 
        
               device = None 
        
           if device is not None and device < 0 and cuda_device_count > 0: 
        
               logger.warning( 
        
                   "Device has %d GPUs available. " 
        
                   "Provide device={deviceId} to `from_model_id` to use available" 
        
                   "GPUs for execution. deviceId is -1 (default) for CPU and " 
        
                   "can be a positive integer associated with CUDA device id.", 
        
                   cuda_device_count, 
        
               )

If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

- Avoid the ambiguity of device_map variable by raising error when people try to pass device_map as key-value pair in model_kwargs - removing the unused device_map variable in the hf_pipeline function call

vercel · 2024-06-28T10:26:02Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Sep 23, 2024 7:34pm

SaladDay

ok

…OL/langchain into fix-hfpipeline-multigpu

…d assignment ("dict[Any, Any] | None") [index]

efriis · 2024-07-12T23:25:33Z

@Jofthomas could you take a look and mark ready to review by the langchain team if it looks good on the huggingface side?

kenchanLOL · 2024-08-14T16:15:55Z

@efriis I wanted to kindly follow up on the pull request I submitted over a month ago. I understand that everyone is busy, but I would greatly appreciate it if you could take a moment to check it and let me know if there's anything further I need to address.

Jofthomas · 2024-08-15T09:14:17Z

Hey, Sorry for the wait, I will check it

efriis · 2024-08-23T23:38:33Z

Looking at the transformers.hf_pipeline implementation, I actually think the existing implementation is correct, as device_map is a full parameter in this function and probably shouldn't be plumbed through model_kwargs.

If this is incorrect, should probably fix that in transformers instead of langchain.

https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py#L564

kenchanLOL · 2024-08-24T21:01:55Z

Thanks for the reply. Actually, my PR addresses the issue you mentioned. The current implementation has a problem where the device_map parameter doesn't work as expected unless it is plumbed through model_kwargs to correctly assign model weights based on the user setting. This issue arises from a tricky detail in the transformers.pipeline implementation.

The model parameter in transformers.pipeline accepts either a string or a PreTrainedModel object. When model is a string, the device_map parameter is referenced and store as key value pair of model_kwargs, which is then passed into the from_pretrained() method to create a model object. But when the model is an object, device_map is never used. You can see this in https://github.com/huggingface/transformers/blob/0a7af19f4dc868bafc82f35eb7e8d13bac87a594/src/transformers/pipelines/__init__.py#L867-L903
However, in the current from_model_id implementation, the model is first created using model_id and model_kwargs, resulting in a PreTrainedModel object that is then passed into transformers.pipeline. This means that even if the device_map parameter from this function is also passed into transformers.pipeline , it is never actually utilized by the pipeline as we expected. Therefore, in current implementation, if user really want to set the device_map for their model, they need to plumb through model_kwargs instead of the device_map. And My PR is trying to fix it and make the device_map parameter of this function works as we expected.

I tried my best to clarify the reasoning, but the overlap in variable names between these two functions might make it a bit confusing.

kenchanLOL · 2024-08-29T21:11:48Z

@efriis I kindly ask to reopen this PR as it seems there might have been a misunderstanding regarding its purpose. I believe the changes I proposed actually address the concerns you raised. I've added a detailed explanation in the comments to clarify how the PR aligns with the goals you mentioned.

efriis · 2024-08-30T02:14:56Z

doesn't this seem like a bug in transformers though? If so, I don't want logic bypassing their interface.

As a holdover, can't you just pass model_kwargs={"device_map": ...} manually without this PR?

kenchanLOL · 2024-08-30T02:41:25Z

@efriis I don't believe this is a bug in transformers because it makes sense not to recreate the model object when the user already passes a model object into the function. However, while the current implementation allows for passing model_kwargs={"device_map": ...} as a workaround, this approach is not well-documented. The device_map parameter, as mentioned in the official documentation (https://python.langchain.com/v0.2/docs/integrations/llms/huggingface_pipelines/), cannot be used as described. The proper solution, which involves passing model_kwargs, is only mentioned in the issue I linked in my initial PR message. This lack of clear and accessible documentation makes it confusing for users like myself to utilize the device_map parameter for multi-GPU setups, despite following the official guide.

Additionally, I believe my implementation won't negatively impact existing users who pass device_map as model_kwargs={"device_map": ...} since all I'm doing is to overwrite the device_map value within model_kwargs. There are no significant logic changes, and the existing functionality remains intact.

GPU Inference
When running on a machine with GPU, you can specify the device=n parameter to put the model on the specified device. Defaults to -1 for CPU inference.

If you have multiple-GPUs and/or the model is too large for a single GPU, you can specify device_map="auto", which requires and uses the [Accelerate](https://huggingface.co/docs/accelerate/index) library to automatically determine how to load the model weights.

Note: both device and device_map should not be specified together and can lead to unexpected behavior.

gpu_llm = HuggingFacePipeline.from_model_id(
    model_id="gpt2",
    task="text-generation",
    device=0,  # replace with device_map="auto" to use the accelerate library.
    pipeline_kwargs={"max_new_tokens": 10},
)

gpu_chain = prompt | gpu_llm

question = "What is electroencephalography?"

print(gpu_chain.invoke({"question": question}))

efriis · 2024-09-03T23:23:21Z

yes but doesn't this documentation make device_map seem like it should be injected into model_kwargs["device_map"] by transformers.pipeline?

https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.pipeline

efriis · 2024-09-03T23:23:39Z

(not recreating the model object - just the routing of the parameter)

kenchanLOL · 2024-09-04T06:28:12Z

@efriis, I’m not entirely clear on the point you made. Both model_kwargs and device_map are only utilized during the model initialization stage. In the documentation you shared, all examples pass a string as the model input parameter, so the model initialization occurs within the pipeline function. However, in the current implementation of from_model_id, this initialization happens before calling transformers.pipeline, where the model input parameter is already a model object.

As a result, regardless of the value or data type of model_kwargs or device_map passed to transformers.pipeline, they have no impact on the code since they are not used at that stage. The reason why passing device_map as model_kwargs["device_map"] seems to work is that model_kwargs is utilized by the from_pretrained method to create model before these values are passed to transformers.pipeline.

I understand that device_map is just a shortcut for setting model_kwargs["device_map"]. While everything works fine without this PR, I hope to replicate this shortcut from transformers to provide developers with a consistent experience between langchain_huggingface and transformers.

In summary, the existing device_map parameter in from_model_id is ineffective and misleading as currently implemented. It should either be removed or fixed to function as described in the documentation.

langchain/libs/partners/huggingface/langchain_huggingface/llms/huggingface_pipeline.py

Lines 99 to 130 in 038c287

    
           _model_kwargs = model_kwargs or {} 
        
           tokenizer = AutoTokenizer.from_pretrained(model_id, **_model_kwargs) 
        
           try: 
        
               if task == "text-generation": 
        
                   if backend == "openvino": 
        
                       try: 
        
                           from optimum.intel.openvino import (  # type: ignore[import] 
        
                               OVModelForCausalLM, 
        
                           ) 
        
                       except ImportError: 
        
                           raise ValueError( 
        
                               "Could not import optimum-intel python package. " 
        
                               "Please install it with: " 
        
                               "pip install 'optimum[openvino,nncf]' " 
        
                           ) 
        
                       try: 
        
                           # use local model 
        
                           model = OVModelForCausalLM.from_pretrained( 
        
                               model_id, **_model_kwargs 
        
                           ) 
        
                       except Exception: 
        
                           # use remote model 
        
                           model = OVModelForCausalLM.from_pretrained( 
        
                               model_id, export=True, **_model_kwargs 
        
                           ) 
        
                   else: 
        
                       model = AutoModelForCausalLM.from_pretrained( 
        
                           model_id, **_model_kwargs 
        
                       )

langchain/libs/partners/huggingface/langchain_huggingface/llms/huggingface_pipeline.py

Lines 216 to 225 in 038c287

    
           pipeline = hf_pipeline( 
        
               task=task, 
        
               model=model, 
        
               tokenizer=tokenizer, 
        
               device=device, 
        
               device_map=device_map, 
        
               batch_size=batch_size, 
        
               model_kwargs=_model_kwargs, 
        
               **_pipeline_kwargs, 
        
           )

efriis · 2024-09-08T19:31:47Z

I think you might be confusing the tokenizer and the pipeline object.

In order to help me understand your case, could you write 2 code snippets:

an initialization of transformers pipeline(model_obj, device_map="auto") that works
an initialization of HuggingFacePipeline.from_model_id(device_map="auto") that doesn't work as expected

pipeline (renamed hf_pipeline in this implementation) takes the same device_map argument in this instance, so I think the behavior should be identical to what transformers does when passed a device_map right?

kenchanLOL · 2024-09-11T08:00:40Z

@efriis First of all, i didn't mix up the tokenizer and pipeline. The tokenizer is included for the sake of code completeness and I just double checked the source code of transformers to make sure that its not related to the bug we are discussing . All my discussion is around the initialization and data type difference of the model variable only.
Second. Here are the code snippets.

an initialization of transformers pipeline(model_obj, device_map="auto") that works

from transformers import pipeline
pipe = pipeline(model="meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto")
# checking GPU memory usage by nvidia-smi, the model is properly assigned across multiple GPU

an initialization of HuggingFacePipeline.from_model_id(device_map="auto") that doesn't work as expected

from langchain_huggingface import HuggingFacePipeline
pipe = HuggingFacePipeline.from_model_id(
    model_id = "meta-llama/Meta-Llama-3-8B-Instruct",
    task="text-generation",
    device_map="auto"
)
#  following message are printed in terminal showing that no device is used
#  Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.

Lastly, for your question, the answer is no for current implementation and that's exactly why i am making this PR to fix this counter-intuitive bug. Also, I realized that this bug persists across 2 HuggingFacePipeline implementations , one from langchain_huggingface and the other from langchain_community.

efriis · 2024-09-11T23:25:12Z

thank you for clarifying! I misunderstood the original point - sorry about that!

if we pass a model object instead of a model string to pipeline(model=model_obj, device_map="auto") it will fail to set device_map (and that's the behavior we see in from_model_id because we're initializing from_pretrained model objects instead of passing the model string directly.

I'll reopen and plan on merging after 0.3 changes go in. main thing that would be great to get into the 0.3 releases would be to raise an exception instead of a warning for when both are set

efriis · 2024-09-11T23:26:32Z

libs/partners/huggingface/langchain_huggingface/llms/huggingface_pipeline.py

+            if model_kwargs is None:
+                model_kwargs = {}
+            model_kwargs["device_map"] = device_map
+            device = None


Suggested change

device = None

Same reason as the comment above, this line was used as a temporary fix. If the default value of device is changed, this line is no longer needed

efriis · 2024-09-11T23:27:45Z

libs/partners/huggingface/langchain_huggingface/llms/huggingface_pipeline.py

-
+        if device_map is not None:
+            if device is not None:
+                logger.warning(


Suggested change

logger.warning(

raise ValueError(

If we simply escalate it from warning to error, it would always be triggered when we set device_map as the default value of device is -1(cpu) in current implementation. I checked the documentation of transformer and realized that the default value of device for pipeline changed from -1 to None. I tested both value and they have same behavior loading the model into CPU, the only difference is their logging message in terminal

pipe = HuggingFacePipeline.from_model_id(model_id = "microsoft/phi-2",task="text-generation",device=None) # message : Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU. pipe = HuggingFacePipeline.from_model_id(model_id = "microsoft/phi-2",task="text-generation",device=-1) # message : Device has 1 GPUs available. Provide device={deviceId} to `from_model_id` to use availableGPUs for execution. deviceId is -1 (default) for CPU and can be a positive integer associated with CUDA device id.

Despite the log message for device=-1 looks better, given that it is a legacy value that could cause conflicts , updating the default to None will align the behavior with the latest practices and prevent unexpected errors in the future. So, I think we should also change the default value of device in line 77 while changing this from warning to error.

efriis · 2024-09-11T23:28:34Z

libs/partners/huggingface/langchain_huggingface/llms/huggingface_pipeline.py

@@ -218,7 +230,6 @@ def from_model_id(
            model=model,
            tokenizer=tokenizer,
            device=device,


Suggested change

device=device,

device=device,

device_map=device_map,

would it make sense to keep this passthrough incase transformers uses it in future?

similar reason as above. There is also a similar parameter validation to check if both device and device_map are not None in pipeline. So, It would always be triggered as the current default value for device is -1 and device_map is also not None (either be 'auto' or a dictionary). So, we cant simply keep it without considering the device parameter value.

libs/partners/huggingface/langchain_huggingface/llms/huggingface_pipeline.py

…ce_pipeline.py Co-authored-by: Erick Friis <erickfriis@gmail.com>

Co-authored-by: Erick Friis <erickfriis@gmail.com>

kenchanLOL · 2024-09-12T18:04:53Z

@efriis sorry i am not very familiar with git and accidentally commit all your suggestions. In short, your suggestions might cause bug without changing the default value of another parameter device. Please check my comments on your suggestions for details

…nality

kenchanLOL · 2024-09-23T19:35:53Z

@efriis I’ve reviewed your suggestions and made the necessary changes accordingly. Please have a look and let me know if anything else needs further adjustment.

kenchanLOL · 2024-10-01T18:36:36Z

@efriis
I wanted to kindly follow up on this PR. I believe it's nearly complete and ready for review. When you have a moment, I'd appreciate any feedback or approval. I understand you may be busy, so I just wanted to ensure this doesn't slip through the cracks.

kenchanLOL and others added 3 commits June 28, 2024 04:48

Fix multiple GPU usage bug in from_model_id function

7a2b7ee

- Avoid the ambiguity of device_map variable by raising error when people try to pass device_map as key-value pair in model_kwargs - removing the unused device_map variable in the hf_pipeline function call

remove the error statement for backwards compatiblity

7e635b4

restores lines accidentally deleted

f98f992

efriis added the partner label Jun 28, 2024

efriis self-assigned this Jun 28, 2024

dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Jun 28, 2024

dosubot bot added 🔌: huggingface Primarily related to HuggingFace integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jun 28, 2024

SaladDay suggested changes Jun 28, 2024

View reviewed changes

kenchanLOL and others added 2 commits June 28, 2024 15:39

fix lint E501 error on lin 102, 103

f5ff450

Merge branch 'master' into fix-hfpipeline-multigpu

d010fb3

kenchanLOL requested a review from SaladDay June 30, 2024 00:31

SaladDay approved these changes Jun 30, 2024

View reviewed changes

Merge branch 'master' into fix-hfpipeline-multigpu

fb774b5

kenchanLOL marked this pull request as draft July 5, 2024 09:58

kenchanLOL marked this pull request as ready for review July 5, 2024 09:59

Merge branch 'master' into fix-hfpipeline-multigpu

cd90167

kenchanLOL mentioned this pull request Jul 11, 2024

unable to merge approved pull request #24154

Closed

5 tasks

hinthornw and others added 4 commits July 11, 2024 17:48

Merge branch 'master' into fix-hfpipeline-multigpu

0c49e7a

remove trailing space

802b2c9

Merge branch 'fix-hfpipeline-multigpu' of https://github.com/kenchanL…

3754828

…OL/langchain into fix-hfpipeline-multigpu

fix huggingface_pipeline.py:108: error: Unsupported target for indexe…

836ba88

…d assignment ("dict[Any, Any] | None") [index]

efriis marked this pull request as draft July 12, 2024 23:25

efriis closed this Aug 23, 2024

efriis reopened this Sep 11, 2024

efriis marked this pull request as ready for review September 11, 2024 23:25

Merge branch 'master' into fix-hfpipeline-multigpu

1975792

dosubot bot added the langchain Related to the langchain package label Sep 11, 2024

efriis reviewed Sep 11, 2024

View reviewed changes

kenchanLOL and others added 2 commits September 12, 2024 11:42

Update libs/partners/huggingface/langchain_huggingface/llms/huggingfa…

acd3803

…ce_pipeline.py Co-authored-by: Erick Friis <erickfriis@gmail.com>

Apply suggestions from code review

2d2309a

Co-authored-by: Erick Friis <erickfriis@gmail.com>

ccurme removed the langchain Related to the langchain package label Sep 20, 2024

Change default value of device argument to enable suggestions functio…

dd31867

…nality

kenchanLOL requested a review from efriis October 16, 2024 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

langchain_huggingface: Fix multiple GPU usage bug in from_model_id function #23628

langchain_huggingface: Fix multiple GPU usage bug in from_model_id function #23628

kenchanLOL commented Jun 28, 2024

vercel bot commented Jun 28, 2024 •

edited

Loading

SaladDay left a comment

efriis commented Jul 12, 2024

kenchanLOL commented Aug 14, 2024

Jofthomas commented Aug 15, 2024

efriis commented Aug 23, 2024

kenchanLOL commented Aug 24, 2024 •

edited

Loading

kenchanLOL commented Aug 29, 2024

efriis commented Aug 30, 2024

kenchanLOL commented Aug 30, 2024 •

edited

Loading

efriis commented Sep 3, 2024

efriis commented Sep 3, 2024

kenchanLOL commented Sep 4, 2024

efriis commented Sep 8, 2024

kenchanLOL commented Sep 11, 2024

efriis commented Sep 11, 2024

efriis Sep 11, 2024

kenchanLOL Sep 12, 2024 •

edited

Loading

efriis Sep 11, 2024

kenchanLOL Sep 12, 2024 •

edited

Loading

efriis Sep 11, 2024

kenchanLOL Sep 12, 2024

kenchanLOL commented Sep 12, 2024

kenchanLOL commented Sep 23, 2024

kenchanLOL commented Oct 1, 2024

	if device_map is not None and device < 0:
	device = None
	if device is not None and device < 0 and cuda_device_count > 0:
	logger.warning(
	"Device has %d GPUs available. "
	"Provide device={deviceId} to `from_model_id` to use available"
	"GPUs for execution. deviceId is -1 (default) for CPU and "
	"can be a positive integer associated with CUDA device id.",
	cuda_device_count,
	)

langchain_huggingface: Fix multiple GPU usage bug in from_model_id function #23628

Are you sure you want to change the base?

langchain_huggingface: Fix multiple GPU usage bug in from_model_id function #23628

Conversation

kenchanLOL commented Jun 28, 2024

vercel bot commented Jun 28, 2024 • edited Loading

SaladDay left a comment

Choose a reason for hiding this comment

efriis commented Jul 12, 2024

kenchanLOL commented Aug 14, 2024

Jofthomas commented Aug 15, 2024

efriis commented Aug 23, 2024

kenchanLOL commented Aug 24, 2024 • edited Loading

kenchanLOL commented Aug 29, 2024

efriis commented Aug 30, 2024

kenchanLOL commented Aug 30, 2024 • edited Loading

efriis commented Sep 3, 2024

efriis commented Sep 3, 2024

kenchanLOL commented Sep 4, 2024

efriis commented Sep 8, 2024

kenchanLOL commented Sep 11, 2024

efriis commented Sep 11, 2024

efriis Sep 11, 2024

Choose a reason for hiding this comment

kenchanLOL Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

efriis Sep 11, 2024

Choose a reason for hiding this comment

kenchanLOL Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

efriis Sep 11, 2024

Choose a reason for hiding this comment

kenchanLOL Sep 12, 2024

Choose a reason for hiding this comment

kenchanLOL commented Sep 12, 2024

kenchanLOL commented Sep 23, 2024

kenchanLOL commented Oct 1, 2024

vercel bot commented Jun 28, 2024 •

edited

Loading

kenchanLOL commented Aug 24, 2024 •

edited

Loading

kenchanLOL commented Aug 30, 2024 •

edited

Loading

kenchanLOL Sep 12, 2024 •

edited

Loading

kenchanLOL Sep 12, 2024 •

edited

Loading