Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable BNB multi-backend support #31098

Merged
merged 66 commits into from
Sep 24, 2024
Merged

Conversation

jiqing-feng
Copy link
Contributor

@jiqing-feng jiqing-feng commented May 29, 2024

Refer to 1227, and 1178, 1206. The bitsandbytes now support CPU backend, so we can remove the cuda restriction in transformers.

@jiqing-feng jiqing-feng marked this pull request as draft May 29, 2024 06:14
@jiqing-feng jiqing-feng changed the title enaable cpu bnb path [WIP] enaable cpu bnb path May 29, 2024
@jiqing-feng jiqing-feng changed the title [WIP] enaable cpu bnb path [WIP] enable cpu bnb path May 29, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@jiqing-feng jiqing-feng marked this pull request as ready for review July 4, 2024 05:54
@jiqing-feng
Copy link
Contributor Author

Hi @younesbelkada @SunMarc @ArthurZucker . I think this PR is ready to be reviewed, as the bitsandbytes already gives the installation guide for CPU. See here.

@SunMarc
Copy link
Member

SunMarc commented Jul 4, 2024

cc @Titus-von-Koeller

@Titus-von-Koeller
Copy link
Contributor

@jiqing-feng, we decided this likely still needs some work. I'll look into it more this week.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/transformers/utils/import_utils.py Outdated Show resolved Hide resolved
Comment on lines 61 to 62
if not is_accelerate_available():
raise ImportError("Using `bitsandbytes` 4-bit quantization requires Accelerate: `pip install accelerate`")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat unrelated to this PR, but: why is this the case? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In BNB the quantization happens when the tensor is move .to(device), so far cuda... Does that answer your question?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about it.

@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Jul 17, 2024

Refer to 1243. I updated the check method when cuda is not available, please review it. Thx! @SunMarc

@Titus-von-Koeller please help to check if it is enough for multi-backend-refactor branch.

@Titus-von-Koeller
Copy link
Contributor

Ok, I started looking into this a bit on the Intel machine that we have access to and based on the code of this PR, I'm still getting other errors when running the tests, disabling the skipping of tests due to CUDA having been required before.

Basically almost all the tests fail with the same below error and none of them pass.

______________________________________________________ Bnb4BitGPT2Test.test_rwkv_4bit _______________________________________________________

self = <bnb.test_4bit.Bnb4BitGPT2Test testMethod=test_rwkv_4bit>

    def setUp(self):
        super().setUp()
    
        # Models and tokenizer
        self.model_fp16 = AutoModelForCausalLM.from_pretrained(
            self.model_name, torch_dtype=torch.float16, device_map="auto"
        )
>       self.model_4bit = AutoModelForCausalLM.from_pretrained(self.model_name, load_in_4bit=True, device_map="auto")

tests/quantization/bnb/test_4bit.py:122: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/transformers/models/auto/auto_factory.py:564: in from_pretrained
    return model_class.from_pretrained(
src/transformers/modeling_utils.py:3834: in from_pretrained
    hf_quantizer.validate_environment(device_map=device_map)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <transformers.quantizers.quantizer_bnb_4bit.Bnb4BitHfQuantizer object at 0x7f0219252070>, args = ()
kwargs = {'device_map': OrderedDict([('', 'cpu')])}, device_map_without_lm_head = {'': 'cpu'}

    def validate_environment(self, *args, **kwargs):
        if not is_accelerate_available():
            raise ImportError("Using `bitsandbytes` 4-bit quantization requires Accelerate: `pip install accelerate`")
        if not is_bitsandbytes_available():
            raise ImportError(
                "Using `bitsandbytes` 4-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes`"
            )
        # if not torch.cuda.is_available():
        #    import bitsandbytes as bnb
    
        #    if not getattr(bnb, "is_multi_backend_refactor_preview", False):
        #        raise RuntimeError(
        #            "Current bitsandbytes only support cuda, please switch to multi_backend_refactor to support multi backends."
        #        )
    
        if kwargs.get("from_tf", False) or kwargs.get("from_flax", False):
            raise ValueError(
                "Converting into 4-bit or 8-bit weights from tf/flax weights is currently not supported, please make"
                " sure the weights are in PyTorch format."
            )
    
        device_map = kwargs.get("device_map", None)
        if (
            device_map is not None
            and isinstance(device_map, dict)
            and not self.quantization_config.llm_int8_enable_fp32_cpu_offload
        ):
            device_map_without_lm_head = {
                key: device_map[key] for key in device_map.keys() if key not in self.modules_to_not_convert
            }
            if "cpu" in device_map_without_lm_head.values() or "disk" in device_map_without_lm_head.values():
>               raise ValueError(
                    "Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the "
                    "quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules "
                    "in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom `device_map` to "
                    "`from_pretrained`. Check "
                    "https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu "
                    "for more details. "
                )
E               ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.

src/transformers/quantizers/quantizer_bnb_4bit.py:91: ValueError

@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Jul 22, 2024

Ok, I started looking into this a bit on the Intel machine that we have access to and based on the code of this PR, I'm still getting other errors when running the tests, disabling the skipping of tests due to CUDA having been required before.

Basically almost all the tests fail with the same below error and none of them pass.

Could you try to pass None or cpu into device_map since we are testing it on CPU? I suppose device_map="auto" is for cuda, am I right? @SunMarc

@Titus-von-Koeller
Copy link
Contributor

Titus-von-Koeller commented Jul 22, 2024

Yeah, I'm working on a solution.

To see if there are further issues when running the test script from the original Intel 4bit PR or when running the Transformer bnb-related integration test suite, I modified it temporarily like this:

 90             if "disk" in device_map_without_lm_head.values():
 91             #if "cpu" in device_map_without_lm_head.values() or "disk" in device_map_without_lm_head.values():

My idea is to check if BNB is multi-backend enabled and in that case skip all the GPU related checks and allow CPU as well.

@SunMarc In this case, do you think it's save to sth like:

if ("cpu" in device_map_without_lm_head.values() or bnb_is_multibackend_enabled) or "disk" in device_map_without_lm_head.values():

?

Currently, for the tests/quantization/bnb/test_4bit.py I'm still getting a bunch of failures, but most are related to the model being moved to cuda(0) or NotImplementedError, for which I'm working on a fix and is not related to the Intel implementation in BNB.

However, there's one failing test that caught my attention and that I couldn't understand the root cause of, yet.. Could you please help me figure out what's going on with that one, @jiqing-feng ? See below:

______________________________________________ Pipeline4BitTest.test_pipeline _______________________________________________
                                                                                                                             
self = <bnb.test_4bit.Pipeline4BitTest testMethod=test_pipeline>                                                             
                                                                                                                             
    def test_pipeline(self):                                  
        r"""                                                                                                                 
        The aim of this test is to verify that the mixed 4bit is compatible with `pipeline` from transformers. Since         
        we used pipline for inference speed benchmarking we want to make sure that this feature does not break anything      
        on pipline.                                                                                                          
        """                                                                                                                  
        # self._clear_cuda_cache()                                                                                           
        self.pipe = pipeline(                                                                                                
            "text-generation",                                                                                               
            model=self.model_name,                                                                                           
            model_kwargs={"device_map": "auto", "load_in_4bit": True, "torch_dtype": torch.float16},                         
            max_new_tokens=self.MAX_NEW_TOKENS,                                                                              
        )                                                                                                                    
                                                                                                                             
        # Real second forward pass                                                                                           
>       pipeline_output = self.pipe(self.input_text)                                                                         
                                                                                                                             
tests/quantization/bnb/test_4bit.py:462:                                                                                     
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/transformers/pipelines/text_generation.py:262: in __call__                                                               
    return super().__call__(text_inputs, **kwargs)                                                                           
src/transformers/pipelines/base.py:1254: in __call__                                                                         
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)                                    
src/transformers/pipelines/base.py:1261: in run_single                                                                       
    model_outputs = self.forward(model_inputs, **forward_params)                                                             
src/transformers/pipelines/base.py:1161: in forward                                                                          
    model_outputs = self._forward(model_inputs, **forward_params)  
src/transformers/pipelines/text_generation.py:351: in _forward                                         (11 results) [443/925]
    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)          
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/utils/_contextlib.py:115: in decorate_context                 
    return func(*args, **kwargs)                                                                                             
src/transformers/generation/utils.py:1969: in generate                                                                       
    result = self._sample(                                                                                                   
src/transformers/generation/utils.py:2912: in _sample                                                                        
    outputs = self(**model_inputs, return_dict=True)                                                                         
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1532: in _wrapped_call_impl              
    return self._call_impl(*args, **kwargs)                                                                                  
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1541: in _call_impl                      
    return forward_call(*args, **kwargs)                                                                                     
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/accelerate/hooks.py:169: in new_forward                             
    output = module._old_forward(*args, **kwargs)                                                                            
src/transformers/models/bloom/modeling_bloom.py:848: in forward                                                              
    transformer_outputs = self.transformer(                                                                                  
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1532: in _wrapped_call_impl              
    return self._call_impl(*args, **kwargs)                                                                                  
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1541: in _call_impl                      
    return forward_call(*args, **kwargs)                                                                                     
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/accelerate/hooks.py:169: in new_forward                             
    output = module._old_forward(*args, **kwargs)                                                                            
src/transformers/models/bloom/modeling_bloom.py:712: in forward                                                              
    outputs = block(                                                                                                         
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1532: in _wrapped_call_impl              
    return self._call_impl(*args, **kwargs)                   
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1541: in _call_impl                      
    return forward_call(*args, **kwargs)                                                                                     
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/accelerate/hooks.py:169: in new_forward                             
    output = module._old_forward(*args, **kwargs)                                                                            
src/transformers/models/bloom/modeling_bloom.py:400: in forward                                                              
    attn_outputs = self.self_attention(                                                                                      
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1532: in _wrapped_call_impl              
    return self._call_impl(*args, **kwargs)                                                                                  
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1541: in _call_impl                      
    return forward_call(*args, **kwargs)                                                                                     
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/accelerate/hooks.py:169: in new_forward                             
    output = module._old_forward(*args, **kwargs)                                                                            
src/transformers/models/bloom/modeling_bloom.py:251: in forward                                                              
    fused_qkv = self.query_key_value(hidden_states)  # [batch_size, seq_length, 3 x hidden_size]                             
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1532: in _wrapped_call_impl              
    return self._call_impl(*args, **kwargs)                                                                                  
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1541: in _call_impl                      
    return forward_call(*args, **kwargs)                                                                                     
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/accelerate/hooks.py:169: in new_forward                             
    output = module._old_forward(*args, **kwargs)                                                                            
../bnb/bitsandbytes/nn/modules.py:475: in forward                                                                            
    out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)                                
../bnb/bitsandbytes/autograd/_functions.py:586: in matmul_4bit                                                               
    out = F.gemv_4bit(A, B.t(), out, state=quant_state)                                                                      
../bnb/bitsandbytes/functional.py:1506: in gemv_4bit                                                                         
    return backends[A.device.type].gemv_4bit(                                                                                
../bnb/bitsandbytes/backends/cpu.py:171: in gemv_4bit                                                                        
    return gemm_4bit_impl(A, B, out, transposed_A, transposed_B, state)                                                      
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
                                                                                                                             
A = tensor([[[ 0.0030,  0.2297,  0.1515,  ..., -0.7241, -0.2993, -0.6665],                                                   
         [-0.8872, -0.4702,  0.6714,  ...,  0.2...62,  0.1533,  ..., -0.0767, -0.4731,  0.0826],                             
         [-0.1199,  0.2257,  0.4138,  ...,  0.1516,  0.3250, -0.5029]]])                                                     
B = tensor([231, 102, 109,  ...,  44, 100, 203], dtype=torch.uint8), out = None, transposed_A = False, transposed_B = False  
state = <bitsandbytes.utils.QuantState object at 0x7f6d1f90d760>                                                             
                                                                                                                             
    def gemm_4bit_impl(                                                                                                      
        A: torch.Tensor,                                                                                                     
        B: torch.Tensor,                                                                                                     
        out: Optional[torch.Tensor] = None,                                                                                  
        transposed_A=False,                                                                                                  
        transposed_B=False,                                                                                                  
        state: QuantState = None,                                                                                            
    ) -> torch.Tensor:                                                                                                       
        """                                                                                                                  
        Matrix-matrix multiplication with 4-bit quantization.                                                                
                                                                                                                             
        Parameters                                                                                                           
        ----------                                                                                                           
        A : torch.Tensor
            The first input tensor. Usually the activation tensor.
        B : torch.Tensor
            The second input tensor. Usually the weight tensor.
        out : torch.Tensor
            The output tensor.
        transposed_A : bool
            Whether A is transposed
        transposed_B : bool
            Whether B is transposed
        state : QuantState
            Contains quantization info, such as blocksize and dtype
     
        Returns
        -------
        torch.Tensor:
            GEMM output tensor.
        """
        if ipex_cpu and _ipex_cpu_version_prereq(2, 3) and hasattr(state, "op_context"):
            assert state.op_context is not None
            output = torch.ops.torch_ipex.ipex_woq_linear(A, state.op_context.get_data_handle())
        else:
            dqB = dequantize_4bit_impl(B, state, blocksize=state.blocksize)
>           output = torch.matmul(A, dqB)
E           RuntimeError: expected m1 and m2 to have the same dtype, but got: float != c10::Half

../bnb/bitsandbytes/backends/cpu_xpu_common.py:527: RuntimeError

@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Jul 30, 2024

Hi @Titus-von-Koeller , as we have fixed the issue 1285, do you have any updates?

@Titus-von-Koeller
Copy link
Contributor

Hey @jiqing-feng,

Yes, I'm actively working on it and currently struggling with these remaining failing tests. Do any of these catch your eye, could you help fixing them?

=============================================================================================== FAILURES ================================================================================================
_____________________________________________________________________________ Bnb4BitTest.test_generate_quality_dequantize ______________________________________________________________________________

self = <bnb.test_4bit.Bnb4BitTest testMethod=test_generate_quality_dequantize>

    def test_generate_quality_dequantize(self):
        r"""
        Test that loading the model and unquantize it produce correct results
        """
        bnb_config = BitsAndBytesConfig(load_in_4bit=True)
    
        model_4bit = AutoModelForCausalLM.from_pretrained(
            self.model_name, quantization_config=bnb_config, device_map="auto"
        )
    
        model_4bit.dequantize()
    
        encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
>       output_sequences = model_4bit.generate(input_ids=encoded_input["input_ids"].to(self.device), max_new_tokens=10)

tests/quantization/bnb/test_4bit.py:285: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/utils/_contextlib.py:115: in decorate_context
    return func(*args, **kwargs)
src/transformers/generation/utils.py:1969: in generate
    result = self._sample(
src/transformers/generation/utils.py:2912: in _sample
    outputs = self(**model_inputs, return_dict=True)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1532: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1541: in _call_impl
    return forward_call(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/accelerate/hooks.py:169: in new_forward
    output = module._old_forward(*args, **kwargs)
src/transformers/models/bloom/modeling_bloom.py:848: in forward
    transformer_outputs = self.transformer(
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1532: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1541: in _call_impl
    return forward_call(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/accelerate/hooks.py:169: in new_forward
    output = module._old_forward(*args, **kwargs)
src/transformers/models/bloom/modeling_bloom.py:712: in forward
    outputs = block(
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1532: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1541: in _call_impl
    return forward_call(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/accelerate/hooks.py:169: in new_forward
    output = module._old_forward(*args, **kwargs)
src/transformers/models/bloom/modeling_bloom.py:400: in forward
    attn_outputs = self.self_attention(
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1532: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1541: in _call_impl
    return forward_call(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/accelerate/hooks.py:169: in new_forward
    output = module._old_forward(*args, **kwargs)
src/transformers/models/bloom/modeling_bloom.py:251: in forward
    fused_qkv = self.query_key_value(hidden_states)  # [batch_size, seq_length, 3 x hidden_size]
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1532: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1541: in _call_impl
    return forward_call(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/accelerate/hooks.py:169: in new_forward
    output = module._old_forward(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Linear(in_features=2048, out_features=6144, bias=True)
input = tensor([[[ 0.0030,  0.2297,  0.1515,  ..., -0.7241, -0.2993, -0.6665],
         [-0.8872, -0.4702,  0.6714,  ...,  0.2...-0.4731,  0.0826],
         [-0.1199,  0.2257,  0.4138,  ...,  0.1516,  0.3250, -0.5029]]],
       dtype=torch.float16)

    def forward(self, input: Tensor) -> Tensor:
>       return F.linear(input, self.weight, self.bias)
E       RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x2048 and 6144x2048)

../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/linear.py:116: RuntimeError
___________________________________________________________________________________ Bnb4BitTestTraining.test_training ___________________________________________________________________________________

self = <bnb.test_4bit.Bnb4BitTestTraining testMethod=test_training>

    def test_training(self):
        if version.parse(importlib.metadata.version("bitsandbytes")) < version.parse("0.37.0"):
            self.skipTest(reason="This test requires bitsandbytes >= 0.37.0")
    
        # Step 1: freeze all parameters
        model = AutoModelForCausalLM.from_pretrained(self.model_name, load_in_4bit=True)
    
>       self.assertEqual(set(model.hf_device_map.values()), {torch.cuda.current_device()})

tests/quantization/bnb/test_4bit.py:544: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_id...entwise_affine=True)
        )
      )
    )
  )
  (lm_head): Linear(in_features=512, out_features=50272, bias=False)
)
name = 'hf_device_map'

    def __getattr__(self, name: str) -> Any:
        if '_parameters' in self.__dict__:
            _parameters = self.__dict__['_parameters']
            if name in _parameters:
                return _parameters[name]
        if '_buffers' in self.__dict__:
            _buffers = self.__dict__['_buffers']
            if name in _buffers:
                return _buffers[name]
        if '_modules' in self.__dict__:
            modules = self.__dict__['_modules']
            if name in modules:
                return modules[name]
>       raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
E       AttributeError: 'OPTForCausalLM' object has no attribute 'hf_device_map'

../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1709: AttributeError
_______________________________________________________________________________ Bnb4BitGPT2Test.test_fp32_4bit_conversion _______________________________________________________________________________

self = <bnb.test_4bit.Bnb4BitGPT2Test testMethod=test_fp32_4bit_conversion>

    def test_fp32_4bit_conversion(self):
        r"""
        Test whether it is possible to mix both `4bit` and `fp32` weights when using `keep_in_fp32_modules` correctly.
        """
        model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small", load_in_4bit=True, device_map="auto")
>       self.assertTrue(model.decoder.block[0].layer[2].DenseReluDense.wo.weight.dtype == torch.float32)
E       AssertionError: False is not true

tests/quantization/bnb/test_4bit.py:334: AssertionError
___________________________________________________________________________ Bnb4BitGPT2Test.test_generate_quality_dequantize ____________________________________________________________________________

self = <bnb.test_4bit.Bnb4BitGPT2Test testMethod=test_generate_quality_dequantize>

    def test_generate_quality_dequantize(self):
        r"""
        Test that loading the model and unquantize it produce correct results
        """
        bnb_config = BitsAndBytesConfig(load_in_4bit=True)
    
        model_4bit = AutoModelForCausalLM.from_pretrained(
            self.model_name, quantization_config=bnb_config, device_map="auto"
        )
    
        model_4bit.dequantize()
    
        encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
>       output_sequences = model_4bit.generate(input_ids=encoded_input["input_ids"].to(self.device), max_new_tokens=10)

tests/quantization/bnb/test_4bit.py:285: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/utils/_contextlib.py:115: in decorate_context
    return func(*args, **kwargs)
src/transformers/generation/utils.py:1969: in generate
    result = self._sample(
src/transformers/generation/utils.py:2912: in _sample
    outputs = self(**model_inputs, return_dict=True)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1532: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1541: in _call_impl
    return forward_call(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/accelerate/hooks.py:169: in new_forward
    output = module._old_forward(*args, **kwargs)
src/transformers/models/gpt2/modeling_gpt2.py:1315: in forward
    transformer_outputs = self.transformer(
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1532: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1541: in _call_impl
    return forward_call(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/accelerate/hooks.py:169: in new_forward
    output = module._old_forward(*args, **kwargs)
src/transformers/models/gpt2/modeling_gpt2.py:1129: in forward
    outputs = block(
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1532: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1541: in _call_impl
    return forward_call(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/accelerate/hooks.py:169: in new_forward
    output = module._old_forward(*args, **kwargs)
src/transformers/models/gpt2/modeling_gpt2.py:614: in forward
    attn_outputs = self.attn(
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1532: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1541: in _call_impl
    return forward_call(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/accelerate/hooks.py:169: in new_forward
    output = module._old_forward(*args, **kwargs)
src/transformers/models/gpt2/modeling_gpt2.py:517: in forward
    query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1532: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1541: in _call_impl
    return forward_call(*args, **kwargs)
../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/accelerate/hooks.py:169: in new_forward
    output = module._old_forward(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Linear(in_features=1600, out_features=4800, bias=True)
input = tensor([[[ 0.1334, -0.0541, -0.0432,  ...,  0.0113,  0.1115,  0.0245],
         [-0.1196,  0.0271, -0.1114,  ...,  0.0...-0.3340,  0.0841],
         [-0.1744, -0.0175, -0.0184,  ..., -0.5654, -0.4402, -0.0274]]],
       dtype=torch.float16)

    def forward(self, input: Tensor) -> Tensor:
>       return F.linear(input, self.weight, self.bias)
E       RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x1600 and 4800x1600)

../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/linear.py:116: RuntimeError

@Titus-von-Koeller
Copy link
Contributor

Particularly the tests with

RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x1600 and 4800x1600)

seem to me that they might be related to the Intel backend logic.

@Titus-von-Koeller
Copy link
Contributor

Ok, so I just pushed my changes with which I ran all the tests. Find attached also the 8-bit related test results. There are a bunch of failures, essentially three types (just search the logs for E ):

  1. RuntimeError: self and mat2 must have the same dtype, but got Half and BFloat16
  2. AttributeError: 'NoneType' object has no attribute 'device'
  3. AttributeError: 'OPTForCausalLM' object has no attribute 'hf_device_map'

No idea about the root causes yet.

@jiqing-feng I saw you also changed the HF 8bit quantizer. Not sure to what extent these tests should be working based on the Intel 8bit BNB PR (cc @Xia-Weiwen) and which parts show an issue on the BNB side and which might be related to further fixes needed in the tests on the Transformer side. Would extremely appreciate your input on that.

See transf_multi-backend_8bit-tests.log.

Do the other changes I introduced seem sensible to you? Would be happy to hear your opinion.

cc @SunMarc do you have an idea of how error type 3 could come to be? looks more like sth on the Transformer side to me..

@jiqing-feng
Copy link
Contributor Author

RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x1600 and 4800x1600)
seem to me that they might be related to the Intel backend logic.

This PR could fix this error

@Titus-von-Koeller
Copy link
Contributor

@jiqing-feng Unfortunately, it doesn't, see the transf_multi-backend_4bit-tests.log:

❯ rg -o 'E\s\s+(.*)' ~/Downloads/transf_multi-backend_4bit-tests.log -r '$1' | sort | uniq -c | sort -rn

   4 RuntimeError: mat1 and mat2 shapes cannot be multiplied (7x768 and 3072x768)
   3 RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x2048 and 6144x2048)
   2 RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x1600 and 4800x1600)
   2 RuntimeError: mat1 and mat2 shapes cannot be multiplied (12x512 and 2048x512)
   1 AttributeError: 'OPTForCausalLM' object has no attribute 'hf_device_map'
   1 AssertionError: False is not true

You can validate yourself with the following commands:

RUN_SLOW=1 pytest tests/quantization/bnb/test_4bit.py -rsx -v
RUN_SLOW=1 pytest tests/quantization/bnb/test_mixed_int8.py -rsx -v

Since it's practical to have this overview of failed tests, I'm also pasting the current status of failing tests for the 8-bit tests:

❯ rg -o 'E\s\s+(.*)' ~/Downloads/transf_multi-backend_8bit-tests.log -r '$1' | sort | uniq -c | sort -rn

   9 AttributeError: 'NoneType' object has no attribute 'device'
   2 RuntimeError: self and mat2 must have the same dtype, but got Half and BFloat16
   1 AttributeError: 'OPTForCausalLM' object has no attribute 'hf_device_map'

Who could help me in figuring out how to get the 8-bit tests working from the Intel side? Would that be @Xia-Weiwen I invited him to the new #bitsandbytes-intel-collab Slack channel. Could you communicate internally with him that he accepts the invite and/or let me know who else I should invite?

@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Aug 1, 2024

  1. For self and mat2 must have the same dtype, but got Half and BFloat16: The 4bit has been fixed by 1285. The 8bit comes from here. I think we should pass the original dtype so we can dequant the model's type to the original dtype. WDYT? I fixed it by this commit, please let me know your opinion.(fix dtype mismatch bitsandbytes-foundation/bitsandbytes#1285)
  2. For mat1 and mat2 shapes cannot be multiplied: The 4bit has been fixed by 1300.
  3. For 'OPTForCausalLM' object has no attribute 'hf_device_map': I suppose the CPU model doesn't have device_map or hf_device_map, we need to discuss with the transformers maintainer whether we should enable device_map in the CPU model, or we should disable device_map check in CPU model. And we cannot pass device_map='auto' in CPU models.
  4. For 'NoneType' object has no attribute 'device': I cannot produce the error if I set device_map="cpu".

@jiqing-feng
Copy link
Contributor Author

Who could help me in figuring out how to get the 8-bit tests working from the Intel side? Would that be @Xia-Weiwen I invited him to the new #bitsandbytes-intel-collab Slack channel. Could you communicate internally with him that he accepts the invite and/or let me know who else I should invite?

I have communicated with him and he should be in the group.

@Titus-von-Koeller
Copy link
Contributor

Hey @SunMarc 🤗

For 'OPTForCausalLM' object has no attribute 'hf_device_map': I suppose the CPU model doesn't have device_map or hf_device_map, we need to discuss with the transformers maintainer whether we should enable device_map in the CPU model, or we should disable device_map check in CPU model. And we cannot pass device_map='auto' in CPU models.

Do you have an opinion on this or do you have a suggestion whom of our colleagues we could pull in to help answer this question?

@Titus-von-Koeller
Copy link
Contributor

Titus-von-Koeller commented Aug 1, 2024

@jiqing-feng

Current status of open failing tests:

4-bit Test Error Summary:
      4 RuntimeError: mat1 and mat2 shapes cannot be multiplied (7x768 and 3072x768)
      1 AttributeError: 'OPTForCausalLM' object has no attribute 'hf_device_map'

8-bit Test Error Summary:
      9 AttributeError: 'NoneType' object has no attribute 'device'
      1 AttributeError: 'OPTForCausalLM' object has no attribute 'hf_device_map'
  1. For self and mat2 must have the same dtype, but got Half and BFloat16 [...]

YES, these are all fixed, thanks 🤗 !

  1. For mat1 and mat2 shapes cannot be multiplied: The 4bit has been fixed by 1300.

That PR fixed most of them, but not all:

all fixed

3x RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x2048 and 6144x2048)
2x RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x1600 and 4800x1600)
2x RuntimeError: mat1 and mat2 shapes cannot be multiplied (12x512 and 2048x512)

remaining in 4-bit, even with current fix:

4x RuntimeError: mat1 and mat2 shapes cannot be multiplied (7x768 and 3072x768)
  1. For 'OPTForCausalLM' object has no attribute 'hf_device_map':

Agreed, also not an expert on this, let's see what Marc says.

  1. For 'NoneType' object has no attribute 'device': I cannot produce the error if I set device_map="cpu".

I tried replacing device_map="auto" in transformers/src/transformers/quantizers/quantizer_bnb_8bit.py with device_map="auto" if torch.cuda.is_available() else "cpu", but that didn't change the test results regarding NoneType at all:

9x AttributeError: 'NoneType' object has no attribute 'device' for the 8 bit test before or after the replacement. Is that what you meant?

If it's sth else, please try changing the tests accordingly and see for yourself if the tests pass or not. On the Intel dev VM that you also have access to I created a command integration that you can run from any directory and that will output the test logs to stdout and also to file in the ~/src/workbench/multi-backend/logs dir. Please feel free to work there and share the results here. The below style of summary will be written out as summary.log in the just-mentioned directory.

@jiqing-feng
Copy link
Contributor Author

Hi @Titus-von-Koeller

For 4bit matmul dim mismatch: Fixed by 1301

For 8bit None type error: Fixed by 1303

@Titus-von-Koeller
Copy link
Contributor

Titus-von-Koeller commented Aug 2, 2024

Ok, all tests passing now, but the below:

=================================== FAILURES ===================================
______________________ Bnb4BitTestTraining.test_training _______________________

self = <bnb.test_4bit.Bnb4BitTestTraining testMethod=test_training>

    def test_training(self):
        if version.parse(importlib.metadata.version("bitsandbytes")) < version.parse("0.37.0"):
            self.skipTest(reason="This test requires bitsandbytes >= 0.37.0")
    
        # Step 1: freeze all parameters
        model = AutoModelForCausalLM.from_pretrained(self.model_name, load_in_4bit=True)
    
>       self.assertEqual(set(model.hf_device_map.values()), {torch.cuda.current_device()})

tests/quantization/bnb/test_4bit.py:523: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_id...entwise_affine=True)
        )
      )
    )
  )
  (lm_head): Linear(in_features=512, out_features=50272, bias=False)
)
name = 'hf_device_map'

    def __getattr__(self, name: str) -> Any:
        if '_parameters' in self.__dict__:
            _parameters = self.__dict__['_parameters']
            if name in _parameters:
                return _parameters[name]
        if '_buffers' in self.__dict__:
            _buffers = self.__dict__['_buffers']
            if name in _buffers:
                return _buffers[name]
        if '_modules' in self.__dict__:
            modules = self.__dict__['_modules']
            if name in modules:
                return modules[name]
>       raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
E       AttributeError: 'OPTForCausalLM' object has no attribute 'hf_device_map'

../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1709: AttributeError


_____________________ MixedInt8TestTraining.test_training ______________________

self = <bnb.test_mixed_int8.MixedInt8TestTraining testMethod=test_training>

    def test_training(self):
        if version.parse(importlib.metadata.version("bitsandbytes")) < version.parse("0.37.0"):
            self.skipTest(reason="This test requires bitsandbytes>=0.37.0")
    
        # Step 1: freeze all parameters
        model = AutoModelForCausalLM.from_pretrained(self.model_name, load_in_8bit=True)
    
>       self.assertEqual(set(model.hf_device_map.values()), {torch.cuda.current_device()})

tests/quantization/bnb/test_mixed_int8.py:858: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_id...entwise_affine=True)
        )
      )
    )
  )
  (lm_head): Linear(in_features=512, out_features=50272, bias=False)
)
name = 'hf_device_map'

    def __getattr__(self, name: str) -> Any:
        if '_parameters' in self.__dict__:
            _parameters = self.__dict__['_parameters']
            if name in _parameters:
                return _parameters[name]
        if '_buffers' in self.__dict__:
            _buffers = self.__dict__['_buffers']
            if name in _buffers:
                return _buffers[name]
        if '_modules' in self.__dict__:
            modules = self.__dict__['_modules']
            if name in modules:
                return modules[name]
>       raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
E       AttributeError: 'OPTForCausalLM' object has no attribute 'hf_device_map'

../../.condax/mamba/envs/bnb/lib/python3.8/site-packages/torch/nn/modules/module.py:1709: AttributeError

@Titus-von-Koeller
Copy link
Contributor

Titus-von-Koeller commented Aug 2, 2024

Ok, so with this change I could avoid the device map error:

-        self.assertEqual(set(model.hf_device_map.values()), {torch.cuda.current_device()})
+        if torch.cuda.is_available():
+            self.assertEqual(set(model.hf_device_map.values()), {torch.cuda.current_device()})
+        else:
+            self.assertTrue(all(param.device.type == "cpu" for param in model.parameters()))

It seems for GPU it triggers Accelerate and therefore has hf_device_map and for CPU it doesn't so you need to do another check..

However, this brought to light two more failures:

4-bit Test Error Summary:
      1 RuntimeError: expected m1 and m2 to have the same dtype, but got: c10::Half != float

8-bit Test Error Summary:
      1 RuntimeError: expected m1 and m2 to have the same dtype, but got: c10::Half != float

Find the detailed test failure logs here

Reproduce with:

export RUN_SLOW=1 && pytest tests/quantization/bnb/test_4bit.py::Bnb4BitTestTraining::test_training -rsx -v && pytest tests/quantization/bnb/test_mixed_int8.py::MixedInt8TestTraining::test_training

All other tests are passing now.

@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Aug 5, 2024

Hi @Titus-von-Koeller

For RuntimeError: expected m1 and m2 to have the same dtype, but got: c10::Half != float. The error comes from the default type. The default type of opt model is float16 which is the same as cuda bnb dtype, but bfloat16 in CPU. In that case, the error will be in cuda if you replace the model with a bf16 model like mistral.

So we can fix the model dtype to fp32 by replacing model = AutoModelForCausalLM.from_pretrained(self.model_name, load_in_8bit=True) to model = AutoModelForCausalLM.from_pretrained(self.model_name, load_in_8bit=True, torch_dtype=torch.float32). Then, the error will disappear.

@jiqing-feng jiqing-feng changed the title [WIP] enable cpu bnb path Enable cpu bnb path Aug 5, 2024
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave my thoughts on a few issues. Can we rebase this PR on main also ? There are unrelated commits in the diff

src/transformers/integrations/integration_utils.py Outdated Show resolved Hide resolved
src/transformers/integrations/integration_utils.py Outdated Show resolved Hide resolved
@jiqing-feng
Copy link
Contributor Author

Hi @SunMarc @akx , you were right. We should keep is_bitsandbytes_avaliable() as simple as possible and move all bnb-relevant functions to integration/bitsandbytes.py. The have 2 cases in is_bitsandbytes_avaliable().

  1. Return torch.cuda.is_available() if bnb < 0.43.1 because this version of bnb only supports cuda
  2. Return True if bnb >= 0.43.1, other checks will be performed in validate_environment.

@akx
Copy link
Contributor

akx commented Sep 14, 2024

Can we rebase this PR on main also ? There are unrelated commits in the diff

Seconded. It's rather impossible to re-review now.

Co-authored-by: Aarni Koskela <akx@iki.fi>
@jiqing-feng
Copy link
Contributor Author

Can we rebase this PR on main also ? There are unrelated commits in the diff

Seconded. It's rather impossible to re-review now.

Unrelated changes have been reverted.

@SunMarc
Copy link
Member

SunMarc commented Sep 17, 2024

Nice @jiqing-feng ! With that, we just need need to rerun the test in order to merge the PR. cc @Titus-von-Koeller

@Titus-von-Koeller
Copy link
Contributor

I'll do a final review today and tmr and then merge. Was busy this week preparing for presenting at Pytorch conf and traveling.

Thanks everyone for the great work 🤗

cc @jiqing-feng

Copy link
Contributor

@Titus-von-Koeller Titus-von-Koeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-checked everything, all todos, threads are resolved, reran the tests on CUDA, ROCm and Intel CPU. All green and ready to merge.

src/transformers/utils/import_utils.py Outdated Show resolved Hide resolved
src/transformers/integrations/integration_utils.py Outdated Show resolved Hide resolved
src/transformers/utils/__init__.py Outdated Show resolved Hide resolved
src/transformers/integrations/integration_utils.py Outdated Show resolved Hide resolved
src/transformers/integrations/integration_utils.py Outdated Show resolved Hide resolved
@Titus-von-Koeller Titus-von-Koeller merged commit 11c27dd into huggingface:main Sep 24, 2024
23 checks passed
avishaiElmakies pushed a commit to avishaiElmakies/transformers that referenced this pull request Sep 25, 2024
* enable cpu bnb path

* fix style

* fix code style

* fix 4 bit path

* Update src/transformers/utils/import_utils.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* add multi backend refactor tests

* fix style

* tweak 4bit quantizer + fix corresponding tests

* tweak 8bit quantizer + *try* fixing corresponding tests

* fix dequant bnb 8bit

* account for Intel CPU in variability of expected outputs

* enable cpu and xpu device map

* further tweaks to account for Intel CPU

* fix autocast to work with both cpu + cuda

* fix comments

* fix comments

* switch to testing_utils.torch_device

* allow for xpu in multi-gpu tests

* fix tests 4bit for CPU NF4

* fix bug with is_torch_xpu_available needing to be called as func

* avoid issue where test reports attr err due to other failure

* fix formatting

* fix typo from resolving of merge conflict

* polish based on last PR review

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* fix CI

* Update src/transformers/integrations/integration_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/integrations/integration_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix error log

* fix error msg

* add \n in error log

* make quality

* rm bnb cuda restriction in doc

* cpu model don't need dispatch

* fix doc

* fix style

* check cuda avaliable in testing

* fix tests

* Update docs/source/en/model_doc/chameleon.md

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* Update docs/source/en/model_doc/llava_next.md

Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update tests/quantization/bnb/test_4bit.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update tests/quantization/bnb/test_4bit.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* fix doc

* fix check multibackends

* fix import sort

* remove check torch in bnb

* docs: update bitsandbytes references with multi-backend info

* docs: fix small mistakes in bnb paragraph

* run formatting

* reveret bnb check

* move bnb multi-backend check to import_utils

* Update src/transformers/utils/import_utils.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* fix bnb check

* minor fix for bnb

* check lib first

* fix code style

* Revert "run formatting"

This reverts commit ac108c6.

* fix format

* give warning when bnb version is low and no cuda found]

* fix device assignment check to be multi-device capable

* address akx feedback on get_avlbl_dev fn

* revert partially, as we don't want the function that public, as docs would be too much (enforced)

---------

Co-authored-by: Aarni Koskela <akx@iki.fi>
Co-authored-by: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Oct 2, 2024
* enable cpu bnb path

* fix style

* fix code style

* fix 4 bit path

* Update src/transformers/utils/import_utils.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* add multi backend refactor tests

* fix style

* tweak 4bit quantizer + fix corresponding tests

* tweak 8bit quantizer + *try* fixing corresponding tests

* fix dequant bnb 8bit

* account for Intel CPU in variability of expected outputs

* enable cpu and xpu device map

* further tweaks to account for Intel CPU

* fix autocast to work with both cpu + cuda

* fix comments

* fix comments

* switch to testing_utils.torch_device

* allow for xpu in multi-gpu tests

* fix tests 4bit for CPU NF4

* fix bug with is_torch_xpu_available needing to be called as func

* avoid issue where test reports attr err due to other failure

* fix formatting

* fix typo from resolving of merge conflict

* polish based on last PR review

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* fix CI

* Update src/transformers/integrations/integration_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/integrations/integration_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix error log

* fix error msg

* add \n in error log

* make quality

* rm bnb cuda restriction in doc

* cpu model don't need dispatch

* fix doc

* fix style

* check cuda avaliable in testing

* fix tests

* Update docs/source/en/model_doc/chameleon.md

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* Update docs/source/en/model_doc/llava_next.md

Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update tests/quantization/bnb/test_4bit.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update tests/quantization/bnb/test_4bit.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* fix doc

* fix check multibackends

* fix import sort

* remove check torch in bnb

* docs: update bitsandbytes references with multi-backend info

* docs: fix small mistakes in bnb paragraph

* run formatting

* reveret bnb check

* move bnb multi-backend check to import_utils

* Update src/transformers/utils/import_utils.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* fix bnb check

* minor fix for bnb

* check lib first

* fix code style

* Revert "run formatting"

This reverts commit ac108c6.

* fix format

* give warning when bnb version is low and no cuda found]

* fix device assignment check to be multi-device capable

* address akx feedback on get_avlbl_dev fn

* revert partially, as we don't want the function that public, as docs would be too much (enforced)

---------

Co-authored-by: Aarni Koskela <akx@iki.fi>
Co-authored-by: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
ArthurZucker added a commit that referenced this pull request Oct 10, 2024
* add sdpa to OPT

* chore: remove redundant whitespace in OPTDecoder class

* fixup

* bug fix

* add sdpa and attention generate test

* fixup

* Refactor OPTAttention forward method for improved readability and maintainability

* undo refactor for _shape and key,val states

* add OPT to doc, fixup didn't find it for some reason

* change order

* change default attn_implemntation in testing to eager

* [run-slow] opt

* change test_eager_matches_sdpa_generate to the one llama

* Update default attention implementation in testing common

* [run-slow] opt

* remove uneeded print

* [run-slow] opt

* refactor model testers to have attn_implementation="eager"

* [run-slow] opt

* convert test_eager_matches_sdpa_generate to opt-350M

* bug fix when creating mask for opt

* [run-slow] opt

* if layer head mask default to eager

* if head mask is not none fall to eager

* [run-slow] opt

* Update src/transformers/models/opt/modeling_opt.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Clean up Unpack imports (#33631)

clean up Unpack imports

* Fix DPT /Dinov2 sdpa regression on main (#33660)

* fallback to eager if output attentions.

* fix copies

* handle dependency errors in check_imports (#33622)

* handle dependency errors in check_imports

* change log level to warning

* add back self.max_position_embeddings = config.max_position_embeddings (#33550)

* add back self.max_position_embeddings = config.max_position_embeddings

* fix-copies

* Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower (#33613)

fix llavaqwen2 model conversion

* Uniformize kwargs for Udop processor and update docs (#33628)

* Add optional kwargs and uniformize udop

* cleanup Unpack

* nit Udop

* Generation: deprecate `PreTrainedModel` inheriting from `GenerationMixin`  (#33203)

* Enable BNB multi-backend support (#31098)

* enable cpu bnb path

* fix style

* fix code style

* fix 4 bit path

* Update src/transformers/utils/import_utils.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* add multi backend refactor tests

* fix style

* tweak 4bit quantizer + fix corresponding tests

* tweak 8bit quantizer + *try* fixing corresponding tests

* fix dequant bnb 8bit

* account for Intel CPU in variability of expected outputs

* enable cpu and xpu device map

* further tweaks to account for Intel CPU

* fix autocast to work with both cpu + cuda

* fix comments

* fix comments

* switch to testing_utils.torch_device

* allow for xpu in multi-gpu tests

* fix tests 4bit for CPU NF4

* fix bug with is_torch_xpu_available needing to be called as func

* avoid issue where test reports attr err due to other failure

* fix formatting

* fix typo from resolving of merge conflict

* polish based on last PR review

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* fix CI

* Update src/transformers/integrations/integration_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/integrations/integration_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix error log

* fix error msg

* add \n in error log

* make quality

* rm bnb cuda restriction in doc

* cpu model don't need dispatch

* fix doc

* fix style

* check cuda avaliable in testing

* fix tests

* Update docs/source/en/model_doc/chameleon.md

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* Update docs/source/en/model_doc/llava_next.md

Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update tests/quantization/bnb/test_4bit.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update tests/quantization/bnb/test_4bit.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* fix doc

* fix check multibackends

* fix import sort

* remove check torch in bnb

* docs: update bitsandbytes references with multi-backend info

* docs: fix small mistakes in bnb paragraph

* run formatting

* reveret bnb check

* move bnb multi-backend check to import_utils

* Update src/transformers/utils/import_utils.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* fix bnb check

* minor fix for bnb

* check lib first

* fix code style

* Revert "run formatting"

This reverts commit ac108c6.

* fix format

* give warning when bnb version is low and no cuda found]

* fix device assignment check to be multi-device capable

* address akx feedback on get_avlbl_dev fn

* revert partially, as we don't want the function that public, as docs would be too much (enforced)

---------

Co-authored-by: Aarni Koskela <akx@iki.fi>
Co-authored-by: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Fix error string after refactoring into get_chat_template (#33652)

* Fix error string after refactoring into get_chat_template

* Take suggestion from CR

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

---------

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

* uniformize git processor (#33668)

* uniformize git processor

* update doctring

* Modular `transformers`: modularity and inheritance for new model additions (#33248)

* update exampel

* update

* push the converted diff files for testing and ci

* correct one example

* fix class attributes and docstring

* nits

* oups

* fixed config!

* update

* nitd

* class attributes are not matched against the other, this is missing

* fixed overwriting self.xxx now onto the attributes I think

* partial fix, now order with docstring

* fix docstring order?

* more fixes

* update

* fix missing docstrings!

* examples don't all work yet

* fixup

* nit

* updated

* hick

* update

* delete

* update

* update

* update

* fix

* all default

* no local import

* fix more diff

* some fix related to "safe imports"

* push fixed

* add helper!

* style

* add a check

* all by default

* add the

* update

* FINALLY!

* nit

* fix config dependencies

* man that is it

* fix fix

* update diffs

* fix the last issue

* re-default to all

* alll the fixes

* nice

* fix properties vs setter

* fixup

* updates

* update dependencies

* make sure to install what needs to be installed

* fixup

* quick fix for now

* fix!

* fixup

* update

* update

* updates

* whitespaces

* nit

* fix

* simplify everything, and make it file agnostic (should work for image processors)

* style

* finish fixing all import issues

* fixup

* empty modeling should not be written!

* Add logic to find who depends on what

* update

* cleanup

* update

* update gemma to support positions

* some small nits

* this is the correct docstring for gemma2

* fix merging of docstrings

* update

* fixup

* update

* take doc into account

* styling

* update

* fix hidden activation

* more fixes

* final fixes!

* fixup

* fixup instruct  blip video

* update

* fix bugs

* align gemma2 with the rest as well

* updats

* revert

* update

* more reversiom

* grind

* more

* arf

* update

* order will matter

* finish del stuff

* update

* rename to modular

* fixup

* nits

* update makefile

* fixup

* update order of the checks!

* fix

* fix docstring that has a call inside

* fiix conversion check

* style

* add some initial documentation

* update

* update doc

* some fixup

* updates

* yups

* Mostly todo gimme a minut

* update

* fixup

* revert some stuff

* Review docs for the modular transformers (#33472)

Docs

* good update

* fixup

* mmm current updates lead to this code

* okay, this fixes it

* cool

* fixes

* update

* nit

* updates

* nits

* fix doc

* update

* revert bad changes

* update

* updates

* proper update

* update

* update?

* up

* update

* cool

* nits

* nits

* bon bon

* fix

* ?

* minimise changes

* update

* update

* update

* updates?

* fixed gemma2

* kind of a hack

* nits

* update

* remove `diffs` in favor of `modular`

* fix make fix copies

---------

Co-authored-by: Lysandre Debut <hi@lysand.re>

* Fix CIs post merging modular transformers (#33681)

update

* Fixed docstring for cohere model regarding unavailability of prune_he… (#33253)

* Fixed docstring for cohere model regarding unavailability of prune_head() methods

The docstring mentions that cohere model supports prune_heads() methods. I have fixed the docstring by explicitly mentioning that it doesn't support that functionality.

* Update src/transformers/models/cohere/modeling_cohere.py

---------

Co-authored-by: Lysandre Debut <hi@lysand.re>

* Generation tests: update imagegpt input name, remove unused functions (#33663)

* Improve Error Messaging for Flash Attention 2 on CPU (#33655)

Update flash-attn error message on CPU

Rebased to latest branch

* Gemma2: fix config initialization (`cache_implementation`) (#33684)

* Fix ByteLevel alphabet missing when Sequence pretokenizer is used (#33556)

* Fix ByteLevel alphabet missing when Sequence pretokenizer is used

* Fixed formatting with `ruff`.

* Uniformize kwargs for image-text-to-text processors (#32544)

* uniformize FUYU processor kwargs

* Uniformize instructblip processor kwargs

* Fix processor kwargs and tests Fuyu, InstructBlip, Kosmos2

* Uniformize llava_next processor

* Fix save_load test for processor with chat_template only as extra init args

* Fix import Unpack

* Fix Fuyu Processor import

* Fix FuyuProcessor import

* Fix FuyuProcessor

* Add defaults for specific kwargs kosmos2

* Fix Udop to return BatchFeature instead of BatchEncoding and uniformize kwargs

* Add tests processor Udop

* remove Copied from in processing Udop as change of input orders caused by BatchEncoding -> BatchFeature

* Fix overwrite tests kwargs processors

* Add warnings and BC for changes in processor inputs order, change docs, add BC for text_pair as arg for Udop

* Fix processing test fuyu

* remove unnecessary pad_token check in instructblip ProcessorTest

* Fix BC tests and cleanup

* FIx imports fuyu

* Uniformize Pix2Struct

* Fix wrong name for FuyuProcessorKwargs

* Fix slow tests reversed inputs align fuyu llava-next, change udop warning

* Fix wrong logging import udop

* Add check images text input order

* Fix copies

* change text pair handling when positional arg

* rebase on main, fix imports in test_processing_common

* remove optional args and udop uniformization from this PR

* fix failing tests

* remove unnecessary test, fix processing utils and test processing common

* cleanup Unpack

* cleanup

* fix conflict grounding dino

* 🚨🚨 Setting default behavior of assisted decoding (#33657)

* tests: fix pytorch tensor placement errors (#33485)

This commit fixes the following errors:
* Fix "expected all tensors to be on the same device" error
* Fix "can't convert device type tensor to numpy"

According to pytorch documentation torch.Tensor.numpy(force=False)
performs conversion only if tensor is on CPU (plus few other restrictions)
which is not the case. For our case we need force=True since we just
need a data and don't care about tensors coherency.

Fixes: #33517
See: https://pytorch.org/docs/2.4/generated/torch.Tensor.numpy.html

Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>

* bump tokenizers, fix added tokens fast (#32535)

* update based on tokenizers release

* update

* nits

* update

* revert re addition

* don't break that yet

* fmt

* revert unwanted

* update tokenizers version

* update dep table

* update

* update in conversion script as well

* some fix

* revert

* fully revert

* fix training

* remove set trace

* fixup

* update

* update

* [Pixtral] Improve docs, rename model (#33491)

* Improve docs, rename model

* Fix style

* Update repo id

* fix code quality after merge

* HFQuantizer implementation for compressed-tensors library (#31704)

* Add compressed-tensors HFQuantizer implementation

* flag serializable as False

* run

* revive lines deleted by ruff

* fixes to load+save from sparseml, edit config to quantization_config, and load back

* address satrat comment

* compressed_tensors to compressed-tensors and revert back is_serializable

* rename quant_method from sparseml to compressed-tensors

* tests

* edit tests

* clean up tests

* make style

* cleanup

* cleanup

* add test skip for when compressed tensors is not installed

* remove pydantic import + style

* delay torch import in test

* initial docs

* update main init for compressed tensors config

* make fix-copies

* docstring

* remove fill_docstring

* Apply suggestions from code review

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* review comments

* review comments

* comments - suppress warnings on state dict load, tests, fixes

* bug-fix - remove unnecessary call to apply quant lifecycle

* run_compressed compatability

* revert changes not needed for compression

* no longer need unexpected keys fn

* unexpected keys not needed either

* Apply suggestions from code review

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* add to_diff_dict

* update docs and expand testing

* Update _toctree.yml with compressed-tensors

* Update src/transformers/utils/quantization_config.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* update doc

* add note about saving a loaded model

---------

Co-authored-by: George Ohashi <george@neuralmagic.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Sara Adkins <sara@neuralmagic.com>
Co-authored-by: Sara Adkins <sara.adkins65@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Dipika Sikka <ds3822@columbia.edu>
Co-authored-by: Dipika <dipikasikka1@gmail.com>

* update model card for opt

* add batch size to inference table

* [slow-run] opt

* [run-slow] opt

---------

Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
Co-authored-by: Avishai Elmakies <avishai.elma@cs.huji.ac.il>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>
Co-authored-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: jiqing-feng <jiqing.feng@intel.com>
Co-authored-by: Aarni Koskela <akx@iki.fi>
Co-authored-by: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Tibor Reiss <75096465+tibor-reiss@users.noreply.github.com>
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
Co-authored-by: Lysandre Debut <hi@lysand.re>
Co-authored-by: Muhammad Naufil <m.naufil1@gmail.com>
Co-authored-by: sizhky <yyeshr@gmail.com>
Co-authored-by: Umar Butler <umar@umar.au>
Co-authored-by: Jonathan Mamou <jonathan.mamou@intel.com>
Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
Co-authored-by: George Ohashi <george@neuralmagic.com>
Co-authored-by: Sara Adkins <sara@neuralmagic.com>
Co-authored-by: Sara Adkins <sara.adkins65@gmail.com>
Co-authored-by: Dipika Sikka <ds3822@columbia.edu>
Co-authored-by: Dipika <dipikasikka1@gmail.com>
@cbensimon
Copy link
Contributor

@cbensimon you may be interested here for ZeroGPU especially in combination with #33122.

Thanks @matthewdouglas for the pointer, I'll look at this

NielsRogge added a commit to NielsRogge/transformers that referenced this pull request Oct 21, 2024
* add sdpa to OPT

* chore: remove redundant whitespace in OPTDecoder class

* fixup

* bug fix

* add sdpa and attention generate test

* fixup

* Refactor OPTAttention forward method for improved readability and maintainability

* undo refactor for _shape and key,val states

* add OPT to doc, fixup didn't find it for some reason

* change order

* change default attn_implemntation in testing to eager

* [run-slow] opt

* change test_eager_matches_sdpa_generate to the one llama

* Update default attention implementation in testing common

* [run-slow] opt

* remove uneeded print

* [run-slow] opt

* refactor model testers to have attn_implementation="eager"

* [run-slow] opt

* convert test_eager_matches_sdpa_generate to opt-350M

* bug fix when creating mask for opt

* [run-slow] opt

* if layer head mask default to eager

* if head mask is not none fall to eager

* [run-slow] opt

* Update src/transformers/models/opt/modeling_opt.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Clean up Unpack imports (huggingface#33631)

clean up Unpack imports

* Fix DPT /Dinov2 sdpa regression on main (huggingface#33660)

* fallback to eager if output attentions.

* fix copies

* handle dependency errors in check_imports (huggingface#33622)

* handle dependency errors in check_imports

* change log level to warning

* add back self.max_position_embeddings = config.max_position_embeddings (huggingface#33550)

* add back self.max_position_embeddings = config.max_position_embeddings

* fix-copies

* Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower (huggingface#33613)

fix llavaqwen2 model conversion

* Uniformize kwargs for Udop processor and update docs (huggingface#33628)

* Add optional kwargs and uniformize udop

* cleanup Unpack

* nit Udop

* Generation: deprecate `PreTrainedModel` inheriting from `GenerationMixin`  (huggingface#33203)

* Enable BNB multi-backend support (huggingface#31098)

* enable cpu bnb path

* fix style

* fix code style

* fix 4 bit path

* Update src/transformers/utils/import_utils.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* add multi backend refactor tests

* fix style

* tweak 4bit quantizer + fix corresponding tests

* tweak 8bit quantizer + *try* fixing corresponding tests

* fix dequant bnb 8bit

* account for Intel CPU in variability of expected outputs

* enable cpu and xpu device map

* further tweaks to account for Intel CPU

* fix autocast to work with both cpu + cuda

* fix comments

* fix comments

* switch to testing_utils.torch_device

* allow for xpu in multi-gpu tests

* fix tests 4bit for CPU NF4

* fix bug with is_torch_xpu_available needing to be called as func

* avoid issue where test reports attr err due to other failure

* fix formatting

* fix typo from resolving of merge conflict

* polish based on last PR review

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* fix CI

* Update src/transformers/integrations/integration_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/integrations/integration_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix error log

* fix error msg

* add \n in error log

* make quality

* rm bnb cuda restriction in doc

* cpu model don't need dispatch

* fix doc

* fix style

* check cuda avaliable in testing

* fix tests

* Update docs/source/en/model_doc/chameleon.md

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* Update docs/source/en/model_doc/llava_next.md

Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update tests/quantization/bnb/test_4bit.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update tests/quantization/bnb/test_4bit.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* fix doc

* fix check multibackends

* fix import sort

* remove check torch in bnb

* docs: update bitsandbytes references with multi-backend info

* docs: fix small mistakes in bnb paragraph

* run formatting

* reveret bnb check

* move bnb multi-backend check to import_utils

* Update src/transformers/utils/import_utils.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* fix bnb check

* minor fix for bnb

* check lib first

* fix code style

* Revert "run formatting"

This reverts commit ac108c6.

* fix format

* give warning when bnb version is low and no cuda found]

* fix device assignment check to be multi-device capable

* address akx feedback on get_avlbl_dev fn

* revert partially, as we don't want the function that public, as docs would be too much (enforced)

---------

Co-authored-by: Aarni Koskela <akx@iki.fi>
Co-authored-by: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Fix error string after refactoring into get_chat_template (huggingface#33652)

* Fix error string after refactoring into get_chat_template

* Take suggestion from CR

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

---------

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

* uniformize git processor (huggingface#33668)

* uniformize git processor

* update doctring

* Modular `transformers`: modularity and inheritance for new model additions (huggingface#33248)

* update exampel

* update

* push the converted diff files for testing and ci

* correct one example

* fix class attributes and docstring

* nits

* oups

* fixed config!

* update

* nitd

* class attributes are not matched against the other, this is missing

* fixed overwriting self.xxx now onto the attributes I think

* partial fix, now order with docstring

* fix docstring order?

* more fixes

* update

* fix missing docstrings!

* examples don't all work yet

* fixup

* nit

* updated

* hick

* update

* delete

* update

* update

* update

* fix

* all default

* no local import

* fix more diff

* some fix related to "safe imports"

* push fixed

* add helper!

* style

* add a check

* all by default

* add the

* update

* FINALLY!

* nit

* fix config dependencies

* man that is it

* fix fix

* update diffs

* fix the last issue

* re-default to all

* alll the fixes

* nice

* fix properties vs setter

* fixup

* updates

* update dependencies

* make sure to install what needs to be installed

* fixup

* quick fix for now

* fix!

* fixup

* update

* update

* updates

* whitespaces

* nit

* fix

* simplify everything, and make it file agnostic (should work for image processors)

* style

* finish fixing all import issues

* fixup

* empty modeling should not be written!

* Add logic to find who depends on what

* update

* cleanup

* update

* update gemma to support positions

* some small nits

* this is the correct docstring for gemma2

* fix merging of docstrings

* update

* fixup

* update

* take doc into account

* styling

* update

* fix hidden activation

* more fixes

* final fixes!

* fixup

* fixup instruct  blip video

* update

* fix bugs

* align gemma2 with the rest as well

* updats

* revert

* update

* more reversiom

* grind

* more

* arf

* update

* order will matter

* finish del stuff

* update

* rename to modular

* fixup

* nits

* update makefile

* fixup

* update order of the checks!

* fix

* fix docstring that has a call inside

* fiix conversion check

* style

* add some initial documentation

* update

* update doc

* some fixup

* updates

* yups

* Mostly todo gimme a minut

* update

* fixup

* revert some stuff

* Review docs for the modular transformers (huggingface#33472)

Docs

* good update

* fixup

* mmm current updates lead to this code

* okay, this fixes it

* cool

* fixes

* update

* nit

* updates

* nits

* fix doc

* update

* revert bad changes

* update

* updates

* proper update

* update

* update?

* up

* update

* cool

* nits

* nits

* bon bon

* fix

* ?

* minimise changes

* update

* update

* update

* updates?

* fixed gemma2

* kind of a hack

* nits

* update

* remove `diffs` in favor of `modular`

* fix make fix copies

---------

Co-authored-by: Lysandre Debut <hi@lysand.re>

* Fix CIs post merging modular transformers (huggingface#33681)

update

* Fixed docstring for cohere model regarding unavailability of prune_he… (huggingface#33253)

* Fixed docstring for cohere model regarding unavailability of prune_head() methods

The docstring mentions that cohere model supports prune_heads() methods. I have fixed the docstring by explicitly mentioning that it doesn't support that functionality.

* Update src/transformers/models/cohere/modeling_cohere.py

---------

Co-authored-by: Lysandre Debut <hi@lysand.re>

* Generation tests: update imagegpt input name, remove unused functions (huggingface#33663)

* Improve Error Messaging for Flash Attention 2 on CPU (huggingface#33655)

Update flash-attn error message on CPU

Rebased to latest branch

* Gemma2: fix config initialization (`cache_implementation`) (huggingface#33684)

* Fix ByteLevel alphabet missing when Sequence pretokenizer is used (huggingface#33556)

* Fix ByteLevel alphabet missing when Sequence pretokenizer is used

* Fixed formatting with `ruff`.

* Uniformize kwargs for image-text-to-text processors (huggingface#32544)

* uniformize FUYU processor kwargs

* Uniformize instructblip processor kwargs

* Fix processor kwargs and tests Fuyu, InstructBlip, Kosmos2

* Uniformize llava_next processor

* Fix save_load test for processor with chat_template only as extra init args

* Fix import Unpack

* Fix Fuyu Processor import

* Fix FuyuProcessor import

* Fix FuyuProcessor

* Add defaults for specific kwargs kosmos2

* Fix Udop to return BatchFeature instead of BatchEncoding and uniformize kwargs

* Add tests processor Udop

* remove Copied from in processing Udop as change of input orders caused by BatchEncoding -> BatchFeature

* Fix overwrite tests kwargs processors

* Add warnings and BC for changes in processor inputs order, change docs, add BC for text_pair as arg for Udop

* Fix processing test fuyu

* remove unnecessary pad_token check in instructblip ProcessorTest

* Fix BC tests and cleanup

* FIx imports fuyu

* Uniformize Pix2Struct

* Fix wrong name for FuyuProcessorKwargs

* Fix slow tests reversed inputs align fuyu llava-next, change udop warning

* Fix wrong logging import udop

* Add check images text input order

* Fix copies

* change text pair handling when positional arg

* rebase on main, fix imports in test_processing_common

* remove optional args and udop uniformization from this PR

* fix failing tests

* remove unnecessary test, fix processing utils and test processing common

* cleanup Unpack

* cleanup

* fix conflict grounding dino

* 🚨🚨 Setting default behavior of assisted decoding (huggingface#33657)

* tests: fix pytorch tensor placement errors (huggingface#33485)

This commit fixes the following errors:
* Fix "expected all tensors to be on the same device" error
* Fix "can't convert device type tensor to numpy"

According to pytorch documentation torch.Tensor.numpy(force=False)
performs conversion only if tensor is on CPU (plus few other restrictions)
which is not the case. For our case we need force=True since we just
need a data and don't care about tensors coherency.

Fixes: huggingface#33517
See: https://pytorch.org/docs/2.4/generated/torch.Tensor.numpy.html

Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>

* bump tokenizers, fix added tokens fast (huggingface#32535)

* update based on tokenizers release

* update

* nits

* update

* revert re addition

* don't break that yet

* fmt

* revert unwanted

* update tokenizers version

* update dep table

* update

* update in conversion script as well

* some fix

* revert

* fully revert

* fix training

* remove set trace

* fixup

* update

* update

* [Pixtral] Improve docs, rename model (huggingface#33491)

* Improve docs, rename model

* Fix style

* Update repo id

* fix code quality after merge

* HFQuantizer implementation for compressed-tensors library (huggingface#31704)

* Add compressed-tensors HFQuantizer implementation

* flag serializable as False

* run

* revive lines deleted by ruff

* fixes to load+save from sparseml, edit config to quantization_config, and load back

* address satrat comment

* compressed_tensors to compressed-tensors and revert back is_serializable

* rename quant_method from sparseml to compressed-tensors

* tests

* edit tests

* clean up tests

* make style

* cleanup

* cleanup

* add test skip for when compressed tensors is not installed

* remove pydantic import + style

* delay torch import in test

* initial docs

* update main init for compressed tensors config

* make fix-copies

* docstring

* remove fill_docstring

* Apply suggestions from code review

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* review comments

* review comments

* comments - suppress warnings on state dict load, tests, fixes

* bug-fix - remove unnecessary call to apply quant lifecycle

* run_compressed compatability

* revert changes not needed for compression

* no longer need unexpected keys fn

* unexpected keys not needed either

* Apply suggestions from code review

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* add to_diff_dict

* update docs and expand testing

* Update _toctree.yml with compressed-tensors

* Update src/transformers/utils/quantization_config.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* update doc

* add note about saving a loaded model

---------

Co-authored-by: George Ohashi <george@neuralmagic.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Sara Adkins <sara@neuralmagic.com>
Co-authored-by: Sara Adkins <sara.adkins65@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Dipika Sikka <ds3822@columbia.edu>
Co-authored-by: Dipika <dipikasikka1@gmail.com>

* update model card for opt

* add batch size to inference table

* [slow-run] opt

* [run-slow] opt

---------

Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
Co-authored-by: Avishai Elmakies <avishai.elma@cs.huji.ac.il>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>
Co-authored-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: jiqing-feng <jiqing.feng@intel.com>
Co-authored-by: Aarni Koskela <akx@iki.fi>
Co-authored-by: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Tibor Reiss <75096465+tibor-reiss@users.noreply.github.com>
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
Co-authored-by: Lysandre Debut <hi@lysand.re>
Co-authored-by: Muhammad Naufil <m.naufil1@gmail.com>
Co-authored-by: sizhky <yyeshr@gmail.com>
Co-authored-by: Umar Butler <umar@umar.au>
Co-authored-by: Jonathan Mamou <jonathan.mamou@intel.com>
Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
Co-authored-by: George Ohashi <george@neuralmagic.com>
Co-authored-by: Sara Adkins <sara@neuralmagic.com>
Co-authored-by: Sara Adkins <sara.adkins65@gmail.com>
Co-authored-by: Dipika Sikka <ds3822@columbia.edu>
Co-authored-by: Dipika <dipikasikka1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants