Add configs to run int4 inference #37

RezaYazdaniAminabadi · 2022-11-18T19:31:33Z

Add some minor config changes to support int4 inference through DeepSpeed-Inference.

The Int4 support will be added to DeepSpeed through this PR.

stas00

Amazing work with adding int4-support, Reza!

stas00 · 2022-11-18T19:37:58Z

bloom-inference-scripts/bloom-ds-inference.py

@@ -191,6 +191,7 @@ def write_checkponts_json():
    mp_size=world_size,
    base_dir=repo_root,
    dtype=getattr(torch, infer_dtype),
+    quantization_bits=8 if args.dtype == 'int8' else 4,


what happens with --dtype float16?

probably best to set this in kwargs only if quantization dtype is provided

The quabtization-bit should not be used when running in half-precision. But, I agree we can do it in the kwargs and only for qunatized inference mode.

these demos are already used by many users so let's make those nice and clean configuration-wise, so it's clear to the reader when what bits should be enabled.

stas00 · 2022-11-18T19:38:54Z

bloom-inference-scripts/bloom-ds-inference.py

@@ -227,7 +228,7 @@ def write_checkponts_json():
    # dynamically extend to support larger bs by repetition
    input_sentences *= math.ceil(args.batch_size / len(input_sentences))

-generate_kwargs = dict(max_new_tokens=num_tokens, do_sample=False)
+generate_kwargs = dict(max_new_tokens=num_tokens, do_sample=True)


this is already a very different type of change.

If int4 requires do_sample=True, then again, let's change it only if it's --dtype int4

Sure, I will double check the do_sample=False again to see if the text generated makes sense. If not, I set it to true for int4

I just checked with do_sample=False and I see the text is produced in the same way as for FP16 and INT8:

in=DeepSpeed is a machine learning framework out=DeepSpeed is a machine learning framework for deep learning. It is a Python library, and it is also a framework. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and

So, I am gonna turn it off for now.

stas00 · 2022-11-18T19:44:45Z

Also should probably assert if int4 attempted to be used w/o deepspeed>=xyz once the DS PR is merged... could tentatively set to the next deepspeed version? perhaps with XXX to enabled so the script can be used against ds@master.

I can take care of that.

RezaYazdaniAminabadi · 2022-11-18T21:33:16Z

Also should probably assert if int4 attempted to be used w/o deepspeed>=xyz once the DS PR is merged... could tentatively set to the next deepspeed version? perhaps with XXX to enabled so the script can be used against ds@master.

I can take care of that.

Sounds good to me. Thanks @stas00

stas00 · 2022-11-18T23:53:47Z

bloom-inference-scripts/bloom-ds-inference.py

@@ -100,7 +100,7 @@ def get_checkpoint_files(model_name_or_path):


 model_name = args.name
-infer_dtype = args.dtype
+infer_dtype = args.dtype if args.dtype != 'int4' else 'int8'


would it make for a more user-friendly API to

keep the dtype intact

drop quantization_bits

let deepspeed.init_inference derive the number of bits from dtype?

not only the currently suggested override is confusing, I fail to see what purpose serves carrying the same information in dtype and and quantization_bits twice

oh, wait, torch.init4 still doesn't exist, does it?

let's find the feature request.

still not implemented pytorch/pytorch#74627

so that's why you had to do the odd workarounds, right?

I guess we can drop it once its implemented @stas00 ?
For now, this might be the best way to do it.

see #37 (comment)

it's pointless to wait, since they won't have int3 and int12

would it make for a more user-friendly API to

keep the dtype intact

drop quantization_bits

let deepspeed.init_inference derive the number of bits from dtype?

not only the currently suggested override is confusing, I fail to see what purpose serves carrying the same information in dtype and and quantization_bits twice

@stas00 and @RezaYazdaniAminabadi - just clarifying that we have introduced a new DeepSpeedInferenceConfig that can be passed to init_inference. We are keeping it backwards compatible but if we are okay to make changes to this file, I would advocate for writing a config dictionary for DeepSpeed and pass that to init_inference instead of the various kwargs. Please see here for an example: https://gist.github.com/awan-10/6e3d5c756be3a876522e860c6bbf702d#file-bloom-ds-inference-py-L173

Also, see the docs for the new config: https://deepspeed.readthedocs.io/en/latest/inference-init.html

That definitely works.

@awan-10, may I suggest you make the inference config accept dict_or_path just like zero does? it might be for some users easier to write out a separate file.

@stas00 - thanks for the suggestion. Created an issue so we can track it: deepspeedai/DeepSpeed#2532. Mike and I will work on it.

Thank you very much, @awan-10

stas00 · 2022-11-19T00:12:35Z

OK, I think I understand the limitations of pytorch and it'll get only worse when you try int3, etc. even if int4 is supported.
https://github.com/huggingface/transformers-bloom-inference/pull/37/files#r1026981222

I propose we break the currently proposed API and draw a better one.

I propose to have only 2 user-configurable args related to how deepspeed-inference operates

dtype is the dtype of the original model - so only fp32, fp16 or bf16 - never intX (i.e. we drop int8)
quantization_bits: [None, 12, 8, 4, 3]

Now the API is simple, unambiguous and future proof (as in int12 or int3, Mixture of Precisions support)

For back-compat deepspeed.init_inference can simply set quantization_bits=8 if dtype==torch.int8 is passed. So the API will be unbroken.

What do you think, Reza?

…eepSpeed

mayank31398 · 2022-11-19T00:45:12Z

Huh?
Int4?
I will test this branch surely and let you know.
Thanks a lot for this :)

RezaYazdaniAminabadi · 2022-11-19T01:01:08Z

is simple, unambiguous and future pro

Hi @stas00,
I agree with what you said, and we are going through the same route as you see from my last commit here.
Thanks for the good suggestion :)
Best,
Reza

RezaYazdaniAminabadi · 2022-11-19T01:03:25Z

In that case, we

is simple, unambiguous and future pro

Hi @stas00, I agree with what you said, and we are going through the same route as you see from my last commit here. Thanks for the good suggestion :) Best, Reza

In this case, we can simply pass the bits to the DeepSpeed-inference config: kwargs['quant']['weight']['num_bits'] = quantization_bits

stas00 · 2022-11-19T01:05:49Z

may I suggest that the just added kwargs['quant']['weight']['num_bits'] isn't the most user-friendly API as far as kwargs go?

why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.

RezaYazdaniAminabadi · 2022-11-19T01:13:53Z

may I suggest that the just added kwargs['quant']['weight']['num_bits'] isn't the most user-friendly API as far as kwargs go?

why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.

I agree, let me work on that and I fix it.

awan-10 · 2022-11-19T01:34:12Z

may I suggest that the just added kwargs['quant']['weight']['num_bits'] isn't the most user-friendly API as far as kwargs go?
why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.

I agree, let me work on that and I fix it.

@RezaYazdaniAminabadi -- please see my comment above. #37 (comment)

RezaYazdaniAminabadi · 2022-11-19T04:32:55Z

may I suggest that the just added kwargs['quant']['weight']['num_bits'] isn't the most user-friendly API as far as kwargs go?
why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.

I agree, let me work on that and I fix it.

@RezaYazdaniAminabadi -- please see my comment above. #37 (comment)

thanks @awan-10. Please go ahead and push your changes.

Update bloom-ds-inference.py

Add configs to run int4 inference

572e644

stas00 suggested changes Nov 18, 2022

View reviewed changes

fix quantization-bit config & turn off ds_sample

132d99d

stas00 reviewed Nov 18, 2022

View reviewed changes

change the quantization config format to work with the new style at D…

99cd7c9

…eepSpeed

awan-10 and others added 2 commits November 21, 2022 10:05

Update bloom-ds-inference.py

32779e8

Merge pull request #1 from awan-10/patch-1

b472e48

Update bloom-ds-inference.py

mayank31398 mentioned this pull request Dec 5, 2022

Sharding a model checkpoint for deepspeed usage #39

Open

mayank31398 force-pushed the main branch 12 times, most recently from 235af1a to 134b703 Compare January 22, 2023 12:19

mayank31398 force-pushed the main branch 10 times, most recently from abfc97f to 9d48dbf Compare January 24, 2023 11:37

mayank31398 force-pushed the main branch 2 times, most recently from 655179f to 114b912 Compare March 3, 2023 16:33

Add configs to run int4 inference #37

Are you sure you want to change the base?

Add configs to run int4 inference #37

Uh oh!

Conversation

RezaYazdaniAminabadi commented Nov 18, 2022

Uh oh!

stas00 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 Nov 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 commented Nov 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RezaYazdaniAminabadi commented Nov 18, 2022

Uh oh!

stas00 Nov 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 commented Nov 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mayank31398 commented Nov 19, 2022

Uh oh!

RezaYazdaniAminabadi commented Nov 19, 2022

Uh oh!

RezaYazdaniAminabadi commented Nov 19, 2022

Uh oh!

stas00 commented Nov 19, 2022

Uh oh!

RezaYazdaniAminabadi commented Nov 19, 2022

Uh oh!

awan-10 commented Nov 19, 2022 • edited by stas00 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RezaYazdaniAminabadi commented Nov 19, 2022

Uh oh!

Uh oh!

stas00 left a comment •

edited

Loading

stas00 Nov 18, 2022 •

edited

Loading

stas00 commented Nov 18, 2022 •

edited

Loading

stas00 Nov 18, 2022 •

edited

Loading

stas00 commented Nov 19, 2022 •

edited

Loading

awan-10 commented Nov 19, 2022 •

edited by stas00

Loading