Adding the compression tutorial on GPT distillation and quantization #2197

minjiaz · 2022-08-09T00:49:23Z

I have resolved most compatibility issues in GPT distillation and quantization. In this PR, I put back the tutorial on GPT compression examples and also updated the evaluation results of wikitex-t2 and lambada based on the recent tests. The plan is to merge this PR after the gpt examples are merged into the DeepSpeed-Megatron repo.

…ining distillation and quantization for GPT

@RezaYazdaniAminabadi

* Fix the layer-past for GPT based models (microsoft#2196) * Add gradient_average flag support for sparse grads (microsoft#2188) * Add gradient_average flag support for sparse grads * formatting fixes * Add tests Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Adding additional instructiosn in the compression tutorial on pre-training distillation and quantization for GPT (microsoft#2197) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Log user config exactly (microsoft#2201) * Fix the tensor-slicing copy for qkv parameters (microsoft#2198) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Refactor Distributed Tests (microsoft#2180) Refactor Distributed unit tests * fix table syntax (microsoft#2204) Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Correctly detect offload configuration (microsoft#2208) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * add cuda 11.7 (microsoft#2211) * add cuda 11.7 * formatting * use torch 1.9 (microsoft#2215) * [zero-3] print warning once and support torch parameter (microsoft#2127) * print warning only once. * add support for torch param and only warn on gpu 0 * remove type checking. will be done on a new PR with more tests. Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Add support of OPT models (microsoft#2205) * add opt replace policy * simplify inf. api * fix opt replace policy * fix use-cash & add relu * Add support of custom MLP act. function * Revert "simplify inf. api" This reverts commit 9e910fc. * fix the inference API (temp. solution) * fix code formatting * add unit tests for OPT models. * refactor pre-attention layer norm configuration * add support of opt-350m model * refactor the HF model config initialization * fix hf model config issue Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * fix typos in readme. (microsoft#2218) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * [device abstraction] add device abstraction to allow other device than CUDA be used * Fix regression w. dist_init_required (microsoft#2225) * add doc for new bert example (microsoft#2224) * Remove the random-generator from context during inference (microsoft#2228) * Fix the tensor-slicing copy for qkv parameters * remove the random-generator from context during inference * formatting Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * allow saving ckpt w/o ckpt json + bloom copy fix (microsoft#2237) * Correctly detect zero_offload (microsoft#2213) * Correctly detect offload configuration * Correctly detect offload configuration * Handle deprecated cpu offload setting * Correcly detect zero_offload setting * Minor tweak Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * update videos (microsoft#2249) * Refactor dist tests: Checkpointing (microsoft#2202) Refactor distributed tests: checkpointing Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Make OPT policy backward compatible with pre-OPT transformers versions (microsoft#2254) * fix ds-inference without policy (microsoft#2247) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * bump to 0.7.2 * Enable contiguous gradients with Z1+MoE (microsoft#2250) MoE training with zero stage 1 only works with `contiguous gradients=True`. * [rebase-202208] additional changes needed when rebase to 202208 * [rebase] cleanup direct cuda usage after merge * Correctly detect CPU optimizer usage (microsoft#2257) * Correctly detect CPU optimizer usage * Update nv-transformers-v100.yml (microsoft#2259) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [precommit] fix pre-commit issues * Update half precision header guards (microsoft#2261) * fix microsoft#2240: wrong time unit in flops_profiler (microsoft#2241) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * bump to 0.7.3 * Add blob storage to CI runners (microsoft#2260) Add blob storage to CI runners and enable for transformers cache on inference tests * Update replace_module.py, test-gptj.py related fix (microsoft#2269) Fix RuntimeError: Boolean value of Tensor with more than one value is ambiguous when running test-gptj.py * Fix OrderedDict import for python3.6 (microsoft#2267) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Ds inference/fix mp2 (microsoft#2270) * Trajepl: nebula load fix (microsoft#2182) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: chenguo <chenguo@microsoft.com> * prevent torch ext folder mkdir at tmp (microsoft#2274) * Ds-inference Int8 support through ZeroQuant technology (microsoft#2217) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * add a new unit test for cuda ops (microsoft#2278) Co-authored-by: cmikeh2 <connorholmes@microsoft.com> * Add to codeowners file (microsoft#2279) * [pin_memory] make pin_memory select device type * Memory Access Utility (microsoft#2276) Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * Fp32 accuracy bug fix (microsoft#2285) Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org> Co-authored-by: Arash Bakhtiari <arashb@users.noreply.github.com> * Refactor universal checkpointing and tensor fragments (microsoft#2253) * Refactor universal checkpointing and tensor fragments * Formatting * [ds-inference] fix progress bar (microsoft#2286) when loading the non-sharded checkpoint update the progress bar (fix by @RezaYazdaniAminabadi) - I've just tested it to work. Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Offload all gradients to nvme (microsoft#2282) * fused bias relu unittest (microsoft#2297) * fix for pytest picking up local deepspeed dir instead of installed deepspeed (microsoft#2299) * Fix for Zero3 when MP>1 and at least one batch param undefined (microsoft#2289) Co-authored-by: anthony.301 <anthony.301@mri.cluster> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [downstream] merge from xpu support downstream * Unit test for bias add kernel (microsoft#2298) * added unit test * Update pt_binding.cpp * formatting * Update test_bias_add.py * Update relu.cu with mem_access_utils (microsoft#2306) * Add tensor parallel inference unit tests (microsoft#2232) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Sam Ade Jacobs <samjacobs@microsoft.com> * Fix the residual add mp scaling for GPTNeoX (microsoft#2310) * Add unit tests for residual_add kernels (microsoft#2307) * add inference eval scripts (microsoft#2303) * Upgrade P40 tests to torch 1.8 (microsoft#2316) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * ZeRO-Inference blog (microsoft#2271) * ZeRO-Inference blog * ZeRO-Inference blog * Format fixes * Apply feedback * Feedback * Update docs/_posts/2022-08-27-zero-inference.md Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update docs/_posts/2022-08-27-zero-inference.md Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Address feedback * Format fixes * More tweaks * long sequence, nvme offload * Add image Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * ZeRO-Inference blog - wrap up (microsoft#2321) * ZeRO-Inference blog - Update README (microsoft#2322) * refactor to use mem_access (microsoft#2317) * add quant unit test (microsoft#2315) * add quant unit test * add codeowner * format fix * fix undefined symbol: curandSetPseudoRandomGeneratorSeed * modify ref fn name and add comment * add comments * add 4bit quant 16groups * fix * modify groups in ref code * parameterize tensor shape * single param * detach tensor * remove -lcurand flag * add back -lcurand flag Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * only override forward if using cuda-graph (microsoft#2291) * Add more options to inference benchmark (microsoft#2325) * bump to 0.7.4 * MOE residual matmult unit test (microsoft#2323) MOE residual matmul unit tests Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * [device] port cuda device to literal_device() in new tests * MOE matmult with memaccess (microsoft#2336) * Fix formatting * Remove redundant variable * Refactor residual add kernels (microsoft#2333) Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * [accel_runtime] add pin_memory to accelerator runtime interface. * mem access for quantize kernel (microsoft#2331) * mem access for quantize kernel * format * format fp32 * modify quant kernel * modify quant kernel2 * modify format * format * fix comments in pytest * fix comments in pytest * format * rerun Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Connor Holmes <connorholmes@microsoft.com> * increase min pre-commit versions (microsoft#2346) * Extend scratch buffer for long prompts (microsoft#2212) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * fix zero docs (microsoft#2350) * Inference profiling updates/fixes (microsoft#2348) (microsoft#2349) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Kernel Data Conversion Utility (microsoft#2327) * Unify macro definitions and constants in a single file * Conversion utility implementation. * Fix reversion from formatting * Bugfixes after testing with correct DeepSpeed * Inline markers are available on both HIP + CUDA * Add Onebit Optimzers in __init__ (microsoft#2340) Co-authored-by: Saeyeol Lee <sylee@si-anlaytics.ai> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * [accelerator abstraction] merge from microsoft#2320 * docs(mixture-of-experts-inference): fix typo in tuto (microsoft#2345) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * download cifar to blob storage (microsoft#2342) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Refactor gptj_residual_add kernels for better readability (microsoft#2358) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * Updated issue templates (microsoft#2363) * Update issue templates * fix cuda invalid config error in dequant kernel (microsoft#2362) * format * remove round fn * Add missing pytest fixture scope (microsoft#2353) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Extend residual_add kernel tests to conver pre_attn_norm (microsoft#2354) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Refactor fused_bias_residual kernels for better readability (microsoft#2356) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Capture error message during sweep tests (microsoft#2351) * Collect error messages in results.csv Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * fix an exception when recursively casting dicts to fp16 (microsoft#2370) * Refactor remaining distributed tests (microsoft#2216) * batch of refactored tests * more test refactoring * fp16 test refactor * more refactors * added DistributedFixture class * applied DistributedFixture to first batch of tests as a trial * added DistributedFixture test and documentation * last tests * fixes for refactored tests * remove subdirs in workflow files * fix pytest syntax error * fix another syntax error * update imports * use DistFixture with elastic checkpoint test * missing import * update to shared class tmpdir for elastic test * moved test files * avoid duplicate test file name * last refactor and moving test files * formatting * fix broken import * testing forked AMD tests * update abstract method * use blob storage for accelerate and transformers tests * upgrade torch for acclerate CI Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix the MLP output tensor's shape (microsoft#2380) * allow building with latest CUDA (11.8), it is backwards compatible (microsoft#2390) * pin transformers version for unit tests (microsoft#2402) * Change type to tuple in replace_wo_policy isinstance check (microsoft#2387) Update the isinstance check inside the `replace_wo_policy` function to `tuple` and `str` instead of `dict`, since the layers are provided as a `tuple` type. Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> Co-authored-by: Molly Smith <mosm@microsoft.com> Co-authored-by: Lok Chand Koppaka <lokoppak@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * Checkpoint backwards-compatbility workaround (microsoft#2384) * Add predicated global load (microsoft#2373) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * change call site of literal_device, on_accel_device and accel_runtime to get_accelerator() call * add new interface definition from olruwase/accelerator_abstraction * MII blog post (microsoft#2418) Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * Fix figure reference (microsoft#2419) * [docs] update news items * [docs] add mii repo link * Add SLURM Multinode Runner (microsoft#2404) Signed-off-by: Dashiell Stander <dstander@protonmail.com> Co-authored-by: Dashiell Stander <dashiell@ip-172-31-45-20.ec2.internal> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Fix issue with corrupted output on long generation for GPT (microsoft#2359) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * MII blog title update on Readme * DeepSpeed-MII title change in website * Fix GPT Neo-X multi-gpu inference (microsoft#2401) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * MII-Public and MII-Azure subheading in mii post * CI fixes related to triton (microsoft#2422) * [docs] update mii blog title (microsoft#2423) * add SD injection policy (microsoft#2381) Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * [accelerator abstraction] remove name() from interface, device_name() should be used. * merge with master (ec13da6) * fix checkpoint loading when it is a dictionary (microsoft#2425) * Make error regex more generic in collect_results.py (microsoft#2415) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * fixes microsoft#2389 (microsoft#2411) truncating expert param storage for checkpointing Co-authored-by: Alexander Jipa <azzhipa@amazon.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Fix for inference gpt-j test (microsoft#2430) * fix for gpt-j failing due to tokenizer error * limit number of gpt-j tokens generated due to low memory * Fixing bug 2361 (microsoft#2410) * fixing bug 2361 * adding pytest for config initialization * chaning expected output to FusedAdam * remove print statement * running yapf on modified files * running pre-commit formatting Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Universal checkpoint for zero stage 1 (microsoft#2284) * Refactor universal checkpointing and tensor fragments * Formatting * Support zero stage1; Expand TP dim * Remove debug prints * Detect sharded optimizer state * Format fixes * Encode reshaping guide * More symbolic constants Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * only add deps if extra is explictly called (microsoft#2432) * Add TestInjectionPolicy inference unittest class for testing custom injection policies (microsoft#2426) This PR adds a TestInjectionPolicy inference unittest class for testing custom injection policies. This test differs from the existing tests in that the injection_policy dictionary is explicitly specified when calling the DeepSpeed init_inference API. The google/t5-v1_1-small text2text-generation model and the roberta-large fill-mask model are added as tests with the injection policy explicitly specified. This is done to expand our unittest coverage to test the path where the replace_wo_policy function is invoked (see microsoftGH-2387). Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * [memory estimators] new config args sync (microsoft#2431) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * parallelize writing of layer checkpoint files across data parallel instances (microsoft#1419) * parallelize layer checkpoints across data parallel groups * use partition_uniform to determine start/end index values * formatting fix * config: add option for parallel write of layer checkpoints in pipeline stage * yapf fixes * enable parallel layer write according to config param * avoid extraneous makedir when rank 0 writes all layers Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix broken link to DeepSpeed Megatron fork (microsoft#2440) Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> * bump to 0.7.5 * [OpBuilder] Add op builder abstraction * convert op builder usage in merged code * merge diff files from upstream * [OpBuilder] add create_op_builder interface in abstract_accelerator.py * remove files that is deleted from upstream * [OpBuilder] add left over op builder usage in tests * [OpBuilder] fix op builder usage in tests * [OpBuilder] fix <op builder>.NAME usage in tests to follow op builder abstraction design * import get_accelerator from deepspeed.accelerator directly * [OpBuilder] remove unused function and sync with main * add missing import * revert changes in device.py to avoid conflict with main * fix alexnet_model to use /tmp instead of /blob * Mingzhi/solve pr108 b (microsoft#115) * move ALL_OPs from __init__.py to all_Op.py to solve circular import * delete deepspeedexamples * fix import * fix regression (microsoft#117) * fix pin_memory * fix regression * fix error Signed-off-by: Dashiell Stander <dstander@protonmail.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Mikhail Druzhinin <dipetm@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Minjia Zhang <33713995+minjiaz@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Kamal Raj <kamalraj97@gmail.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Arash Bakhtiari <arashb@users.noreply.github.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Zhihong Chen <gdst_czh@163.com> Co-authored-by: Siddharth Singh <siddharth9820@gmail.com> Co-authored-by: Connor Holmes <connorholmes@microsoft.com> Co-authored-by: 叶志晟 <yzs981130@126.com> Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com> Co-authored-by: trajep <trajepl@gmail.com> Co-authored-by: chenguo <chenguo@microsoft.com> Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Quentin Anthony <qganthony@yahoo.com> Co-authored-by: anthony.301 <anthony.301@mri.cluster> Co-authored-by: Sam Ade Jacobs <samjacobs@microsoft.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> Co-authored-by: Saeyeol Lee <78332687+l4d2boomer@users.noreply.github.com> Co-authored-by: Saeyeol Lee <sylee@si-anlaytics.ai> Co-authored-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org> Co-authored-by: Matt Smith <matt@mjksmith.com> Co-authored-by: Thomas-MMJ <112830596+Thomas-MMJ@users.noreply.github.com> Co-authored-by: lekurile <113481193+lekurile@users.noreply.github.com> Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> Co-authored-by: Molly Smith <mosm@microsoft.com> Co-authored-by: Lok Chand Koppaka <lokoppak@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Dashiell Stander <dstander@protonmail.com> Co-authored-by: Dashiell Stander <dashiell@ip-172-31-45-20.ec2.internal> Co-authored-by: Andrey Chernykh <andrew.chernyh@gmail.com> Co-authored-by: Alexander Jipa <alexander.jipa@gmail.com> Co-authored-by: Alexander Jipa <azzhipa@amazon.com> Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> Co-authored-by: Adam Moody <moody20@llnl.gov> Co-authored-by: AGUL <mingzhi.liu@intel.com>

Adding additional instructiosn in the compression tutorial on pre-tra…

0948d60

…ining distillation and quantization for GPT

minjiaz requested review from jeffra, conglongli, yaozhewei and xiaoxiawu-microsoft August 9, 2022 00:49

minjiaz requested review from samyam, tjruwase, ShadenSmith, awan-10, cli99, eltonzheng, RezaYazdaniAminabadi, duli2012, mrwyattii, arashb and samadejacobs as code owners August 9, 2022 00:49

conglongli approved these changes Aug 9, 2022

View reviewed changes

conglongli enabled auto-merge (squash) August 9, 2022 01:03

tjruwase and others added 2 commits August 9, 2022 05:29

Merge branch 'master' into minjiaz/compression-tutorial-addon

337b8df

Merge branch 'master' into minjiaz/compression-tutorial-addon

bccc227

jeffra disabled auto-merge August 9, 2022 16:14

jeffra approved these changes Aug 9, 2022

View reviewed changes

jeffra merged commit f82846d into master Aug 9, 2022

jeffra deleted the minjiaz/compression-tutorial-addon branch August 9, 2022 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding the compression tutorial on GPT distillation and quantization #2197

Adding the compression tutorial on GPT distillation and quantization #2197

minjiaz commented Aug 9, 2022

Adding the compression tutorial on GPT distillation and quantization #2197

Adding the compression tutorial on GPT distillation and quantization #2197

Conversation

minjiaz commented Aug 9, 2022