Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] add device abstraction to allow other device than CUDA be used #2221

Merged
merged 86 commits into from
Mar 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
0a849d5
[device abstraction] add device abstraction to allow other device tha…
delock Aug 16, 2022
e4f40f0
Merge branch '202208-base' into 202208
delock Aug 24, 2022
4a216ea
[rebase-202208] additional changes needed when rebase to 202208
delock Aug 24, 2022
2137642
Merge branch '20220824-base' into 20220824
delock Aug 24, 2022
089657e
[rebase] cleanup direct cuda usage after merge
delock Aug 24, 2022
d5a8424
[precommit] fix pre-commit issues
delock Aug 25, 2022
96d0765
Merge branch 'master' into gma/device-abstraction
tjruwase Aug 30, 2022
ac64c7a
[pin_memory] make pin_memory select device type
delock Sep 1, 2022
02c3a57
Merge branch 'master' into gma/device-abstraction
delock Sep 8, 2022
522b24b
[downstream] merge from xpu support downstream
delock Sep 9, 2022
a3b1e02
Merge branch 'master' into gma/device-abstraction
tjruwase Sep 12, 2022
4557c33
Merge branch 'master' into gma/device-abstraction
tjruwase Sep 13, 2022
2ef7d6c
Merge branch 'up-master' into gma/merge-upstream-20220921
delock Sep 21, 2022
9656321
[device] port cuda device to literal_device() in new tests
delock Sep 21, 2022
65729e3
[accel_runtime] add pin_memory to accelerator runtime interface.
delock Sep 22, 2022
f94d53e
[accelerator abstraction] merge from #2320
delock Sep 26, 2022
6005abe
Merge branch 'up-master' into gma/device-abstraction
delock Sep 26, 2022
31c0997
change call site of literal_device, on_accel_device and accel_runtime…
delock Oct 12, 2022
1785c26
add new interface definition from olruwase/accelerator_abstraction
delock Oct 12, 2022
17203a4
[accelerator abstraction] remove name() from interface, device_name()…
delock Oct 14, 2022
e8daea6
merge with master (ec13da6ba7cabc44bb4745a64a208b8580792954)
delock Oct 14, 2022
cfd23ed
Merge branch 'up-master' into gma/device-abstraction
delock Oct 14, 2022
13bbbdf
[OpBuilder] Add op builder abstraction
delock Oct 23, 2022
06e39a5
Merge branch 'up-master' into gma/device-abstraction
delock Oct 23, 2022
257490f
convert op builder usage in merged code
delock Oct 23, 2022
c93b999
[OpBuilder] add create_op_builder interface in abstract_accelerator.py
delock Oct 23, 2022
9858d42
[OpBuilder] fix op builder usage in tests
delock Oct 23, 2022
68ce006
[OpBuilder] fix <op builder>.NAME usage in tests to follow op builder…
delock Oct 23, 2022
4b62dab
import get_accelerator from deepspeed.accelerator directly
delock Oct 23, 2022
c5b2070
[OpBuilder] remove unused function and sync with main
delock Oct 23, 2022
9532843
add missing get_accelerator import
delock Oct 25, 2022
0729695
fix obsolete name in CPU Adam which should be create_op_builder
delock Oct 25, 2022
be517d8
fix create_op_builder calls
delock Oct 25, 2022
3af870f
fix misuse of new accelerator abstraction interface in tests
delock Oct 25, 2022
8fa64b9
Merge from downstream for bug fixing
delock Oct 28, 2022
4873538
merge from downstream
delock Nov 3, 2022
61b10b0
remove SYCL_KERNEL specific code
delock Nov 4, 2022
457d281
Merge branch 'up-master(9cfcf7431a02a)' into gma/device-abstraction
delock Nov 8, 2022
fea4604
Merge branch 'up-master(6f77da1bae506)' into gma/device-abstraction
delock Nov 10, 2022
f80a907
Merge branch 'up-master(3ca9878d8e92a)' into gma/device-abstraction
delock Nov 10, 2022
3b0b14c
merge from downstream for bugs fixes
delock Nov 10, 2022
b375e46
Merge branch 'up-master(be5ec506bd5219a)' into gma/device-abstraction
delock Nov 11, 2022
18b3c95
fix torch.cuda in new files
delock Nov 11, 2022
97695f5
use OpBuilder name symbol, improve env_report, fix typo, fix get_acce…
delock Nov 13, 2022
93e157b
Merge branch 'master' into gma/device-abstraction
tjruwase Nov 13, 2022
b1c5384
fix missing () in get_accelerator for ds_attention.py
delock Nov 14, 2022
91fb948
import deepspeed.accelerator.get_accelerator only when torch_availabl…
delock Nov 14, 2022
8f89c2b
Merge branch 'up-master' into gma/device-abstraction
delock Dec 1, 2022
26e628d
Change reference of InferenceSpecializedBuilder to name string, Infer…
delock Dec 1, 2022
91f5cb2
convert new code with CUDA references
delock Dec 1, 2022
5a1ae0e
remove unneeded get_accelerator import in op_builder/__init__.py
delock Dec 1, 2022
05842b6
[setup] fix build error when pytorch is not installed in environment
delock Dec 1, 2022
24d2b38
Handle the case when torch is not installed during deepspeed installa…
delock Dec 1, 2022
c26e5d4
Merge branch 'master' into gma/device-abstraction
tjruwase Dec 2, 2022
4116ba5
Merge branch 'up-master' into gma/device-abstraction
delock Jan 8, 2023
bea648f
port new cuda specific code
delock Jan 8, 2023
94253d4
revert changes in __init__.py since new mechanism no longer requires …
delock Jan 8, 2023
2acad48
Merge branch 'up-master' into gma/device-abstraction
delock Jan 27, 2023
77af66a
use old op builder interface
delock Jan 27, 2023
8ec0905
Merge branch 'up-master' into gma/device-abstraction
delock Jan 27, 2023
bd9d275
remove bypass code in set_accelerator_visible
delock Jan 27, 2023
f1e75ff
revert changes in quantizer according to latest op builder interface
delock Jan 27, 2023
9860282
Merge branch 'master' into gma/device-abstraction
delock Jan 30, 2023
c26da46
port additional torch.cuda code in deepspeed
delock Jan 27, 2023
cb46cf4
Merge branch 'master' into gma/device-abstraction
delock Jan 31, 2023
b74a47c
Merge branch 'master' into gma/device-abstraction
delock Feb 3, 2023
6e55729
Merge branch 'master' into gma/device-abstraction
delock Feb 6, 2023
3c186d2
Merge branch 'master' into gma/device-abstraction
delock Feb 7, 2023
667c878
follow comments
delock Feb 9, 2023
d693dad
Merge branch 'up-master' into gma/device-abstraction
delock Feb 9, 2023
7a9e7ea
fix format
delock Feb 9, 2023
538148b
fix new code with cuda specific code
delock Feb 9, 2023
af8cee2
Merge branch 'master' into gma/device-abstraction
delock Feb 11, 2023
3dd816c
Merge branch 'master' into gma/device-abstraction
delock Feb 15, 2023
abf31b6
Merge branch 'master' into gma/device-abstraction
delock Feb 17, 2023
9539def
Merge branch 'master' into gma/device-abstraction
delock Feb 20, 2023
6ac4de4
Merge branch 'master' into gma/device-abstraction
delock Feb 22, 2023
b551304
Merge branch 'master' into gma/device-abstraction
delock Feb 22, 2023
238dc1e
port cuda specific code in module injection
delock Feb 23, 2023
da254d7
Merge branch 'master' into gma/device-abstraction
delock Feb 24, 2023
33ace54
Merge branch 'master' into gma/device-abstraction
delock Feb 26, 2023
3d572bb
Merge branch 'up-master' into gma/device-abstraction
delock Mar 1, 2023
4f9f6c2
add licensing message
delock Mar 1, 2023
e92fd92
Merge branch 'master' into gma/device-abstraction
delock Mar 2, 2023
136ba27
Merge branch 'master' into gma/device-abstraction
tjruwase Mar 7, 2023
9569b46
Merge branch 'master' into gma/device-abstraction
jeffra Mar 7, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 17 additions & 10 deletions benchmarks/communication/all_gather.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

from benchmarks.communication.utils import *
from benchmarks.communication.constants import *
from deepspeed.accelerator import get_accelerator

import time

Expand Down Expand Up @@ -85,16 +86,20 @@ def run_all_gather(local_rank, args):
try:
mat = torch.ones(world_size,
M,
dtype=getattr(torch,
args.dtype)).cuda(local_rank)
dtype=getattr(
torch,
args.dtype)).to(
get_accelerator().device_name(local_rank))
sync_all()
input = ((mat.mul_(float(global_rank))).view(-1))
# Delete original mat to avoid OOM
del mat
torch.cuda.empty_cache()
get_accelerator().empty_cache()
output = torch.zeros(input.nelement() * world_size,
dtype=getattr(torch,
args.dtype)).cuda(local_rank)
dtype=getattr(
torch,
args.dtype)).to(
get_accelerator().device_name(local_rank))
except RuntimeError as e:
if 'out of memory' in str(e):
if dist.get_rank() == 0:
Expand Down Expand Up @@ -123,15 +128,17 @@ def run_all_gather(local_rank, args):
try:
mat = torch.ones(elements_per_gpu,
dtype=getattr(torch,
args.dtype)).cuda(local_rank)
args.dtype)).to(
get_accelerator().device_name(local_rank))
# multiply each GPU's tensor by the rank to ease debugging
input = ((mat.mul_(float(global_rank))).view(-1))
# Delete original mat to avoid OOM
del mat
torch.cuda.empty_cache()
output = torch.zeros(elements_per_gpu * world_size,
dtype=getattr(torch,
args.dtype)).cuda(local_rank)
get_accelerator().empty_cache()
output = torch.zeros(
elements_per_gpu * world_size,
dtype=getattr(torch,
args.dtype)).to(get_accelerator().device_name(local_rank))
except RuntimeError as e:
if 'out of memory' in str(e):
if dist.get_rank() == 0:
Expand Down
10 changes: 7 additions & 3 deletions benchmarks/communication/all_reduce.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

from benchmarks.communication.utils import *
from benchmarks.communication.constants import *
from deepspeed.accelerator import get_accelerator

import time

Expand Down Expand Up @@ -64,8 +65,10 @@ def run_all_reduce(local_rank, args):
try:
mat = torch.ones(world_size,
M,
dtype=getattr(torch,
args.dtype)).cuda(local_rank)
dtype=getattr(
torch,
args.dtype)).to(
get_accelerator().device_name(local_rank))
sync_all()
input = ((mat.mul_(float(global_rank))).view(-1))
except RuntimeError as e:
Expand All @@ -88,7 +91,8 @@ def run_all_reduce(local_rank, args):
try:
mat = torch.ones(elements_per_gpu,
dtype=getattr(torch,
args.dtype)).cuda(local_rank)
args.dtype)).to(
get_accelerator().device_name(local_rank))
input = ((mat.mul_(float(global_rank))).view(-1))
except RuntimeError as e:
if 'out of memory' in str(e):
Expand Down
19 changes: 12 additions & 7 deletions benchmarks/communication/all_to_all.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

from benchmarks.communication.utils import *
from benchmarks.communication.constants import *
from deepspeed.accelerator import get_accelerator

import time

Expand Down Expand Up @@ -63,8 +64,10 @@ def run_all_to_all(local_rank, args):
try:
mat = torch.ones(world_size,
M,
dtype=getattr(torch,
args.dtype)).cuda(local_rank)
dtype=getattr(
torch,
args.dtype)).to(
get_accelerator().device_name(local_rank))
assert mat.numel() % world_size == 0, f"tensor cannot be divided in {world_size} chunks"
sync_all()
input = ((mat.mul_(float(global_rank))).view(-1))
Expand All @@ -88,15 +91,17 @@ def run_all_to_all(local_rank, args):
try:
mat = torch.ones(elements_per_gpu,
dtype=getattr(torch,
args.dtype)).cuda(local_rank)
args.dtype)).to(
get_accelerator().device_name(local_rank))
assert mat.numel() % world_size == 0, f"tensor with {mat.numel()} elements cannot be divided in {world_size} chunks"
input = ((mat.mul_(float(global_rank))).view(-1))
# Delete original mat to avoid OOM
del mat
torch.cuda.empty_cache()
output = torch.zeros(elements_per_gpu,
dtype=getattr(torch,
args.dtype)).cuda(local_rank)
get_accelerator().empty_cache()
output = torch.zeros(
elements_per_gpu,
dtype=getattr(torch,
args.dtype)).to(get_accelerator().device_name(local_rank))
except RuntimeError as e:
if 'out of memory' in str(e):
if dist.get_rank() == 0:
Expand Down
10 changes: 7 additions & 3 deletions benchmarks/communication/broadcast.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import torch
from benchmarks.communication.utils import *
from benchmarks.communication.constants import *
from deepspeed.accelerator import get_accelerator

import time

Expand Down Expand Up @@ -65,8 +66,10 @@ def run_broadcast(local_rank, args):
try:
mat = torch.ones(world_size,
M,
dtype=getattr(torch,
args.dtype)).cuda(local_rank)
dtype=getattr(
torch,
args.dtype)).to(
get_accelerator().device_name(local_rank))
sync_all()
input = ((mat.mul_(float(global_rank))).view(-1))
except RuntimeError as e:
Expand All @@ -89,7 +92,8 @@ def run_broadcast(local_rank, args):
try:
mat = torch.ones(elements_per_gpu,
dtype=getattr(torch,
args.dtype)).cuda(local_rank)
args.dtype)).to(
get_accelerator().device_name(local_rank))
input = ((mat.mul_(float(global_rank))).view(-1))
except RuntimeError as e:
if 'out of memory' in str(e):
Expand Down
3 changes: 2 additions & 1 deletion benchmarks/communication/constants.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
'''Copyright The Microsoft DeepSpeed Team'''
from deepspeed.accelerator import get_accelerator

DEFAULT_WARMUPS = 5
DEFAULT_TRIALS = 50
DEFAULT_TYPE = 'float'
DEFAULT_BACKEND = 'nccl'
DEFAULT_BACKEND = get_accelerator().communication_backend_name()
DEFAULT_UNIT = 'Gbps'
DEFAULT_DIST = 'deepspeed'
DEFAULT_MAXSIZE = 24
10 changes: 7 additions & 3 deletions benchmarks/communication/pt2pt.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

from benchmarks.communication.utils import *
from benchmarks.communication.constants import *
from deepspeed.accelerator import get_accelerator

import time

Expand Down Expand Up @@ -83,8 +84,10 @@ def run_pt2pt(local_rank, args):
try:
mat = torch.ones(world_size,
M,
dtype=getattr(torch,
args.dtype)).cuda(local_rank)
dtype=getattr(
torch,
args.dtype)).to(
get_accelerator().device_name(local_rank))
sync_all()
input = ((mat.mul_(float(global_rank))).view(-1))
except RuntimeError as e:
Expand All @@ -107,7 +110,8 @@ def run_pt2pt(local_rank, args):
try:
mat = torch.ones(elements_per_gpu,
dtype=getattr(torch,
args.dtype)).cuda(local_rank)
args.dtype)).to(
get_accelerator().device_name(local_rank))
input = ((mat.mul_(float(global_rank))).view(-1))
except RuntimeError as e:
if 'out of memory' in str(e):
Expand Down
13 changes: 7 additions & 6 deletions benchmarks/communication/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import math
import argparse
from benchmarks.communication.constants import *
from deepspeed.accelerator import get_accelerator

global dist

Expand All @@ -14,7 +15,7 @@ def init_torch_distributed(backend):
import torch.distributed as dist
torch.distributed.init_process_group(backend)
local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(local_rank)
get_accelerator().set_device(local_rank)


def init_deepspeed_comm(backend):
Expand All @@ -23,7 +24,7 @@ def init_deepspeed_comm(backend):
import deepspeed.comm as dist
deepspeed.init_distributed(dist_backend=backend)
local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(local_rank)
get_accelerator().set_device(local_rank)


def init_processes(local_rank, args):
Expand Down Expand Up @@ -101,14 +102,13 @@ def get_metric_strings(args, tput, busbw, duration):


def sync_all():
torch.cuda.synchronize()
get_accelerator().synchronize()
dist.barrier()


def max_numel(comm_op, dtype, mem_factor, local_rank, args):
dtype_size = _element_size(dtype)
max_memory_per_gpu = torch.cuda.get_device_properties(
local_rank).total_memory * mem_factor
max_memory_per_gpu = get_accelerator().total_memory(local_rank) * mem_factor
if comm_op == 'all_reduce' or comm_op == 'pt2pt' or comm_op == 'broadcast':
elements_per_gpu = int(max_memory_per_gpu // dtype_size)
elif comm_op == 'all_gather':
Expand Down Expand Up @@ -185,7 +185,8 @@ def benchmark_parser():
parser.add_argument("--backend",
type=str,
default=DEFAULT_BACKEND,
choices=['nccl'],
choices=['nccl',
'ccl'],
help='Communication library to use')
parser.add_argument("--dist",
type=str,
Expand Down
7 changes: 4 additions & 3 deletions benchmarks/inference/bert-bench.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import deepspeed
import argparse
from transformers import pipeline
from deepspeed.accelerator import get_accelerator

parser = argparse.ArgumentParser()
parser.add_argument("--model", "-m", type=str, help="hf model name")
Expand Down Expand Up @@ -46,7 +47,7 @@ def print_latency(latency_set, title, warmup=3):
print("\t999 Latency: {0:8.2f} ms".format(p999 * 1000))


deepspeed.init_distributed("nccl")
deepspeed.init_distributed()

print(args.model, args.max_tokens, args.dtype)

Expand Down Expand Up @@ -75,10 +76,10 @@ def print_latency(latency_set, title, warmup=3):
times = []
mtimes = []
for i in range(args.trials):
torch.cuda.synchronize()
get_accelerator().synchronize()
start = time.time()
r = pipe(f"Hello I'm a {mask} model")
torch.cuda.synchronize()
get_accelerator().synchronize()
end = time.time()
responses.append(r)
times.append((end - start))
Expand Down
7 changes: 4 additions & 3 deletions benchmarks/inference/gpt-bench.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import deepspeed
import argparse
from transformers import pipeline
from deepspeed.accelerator import get_accelerator

parser = argparse.ArgumentParser()
parser.add_argument("--model", "-m", type=str, help="hf model name")
Expand Down Expand Up @@ -63,7 +64,7 @@ def print_latency(latency_set, title, warmup=3):
print("\t999 Latency: {0:8.2f} ms".format(p999 * 1000))


deepspeed.init_distributed("nccl")
deepspeed.init_distributed()

if args.local_rank == 0:
print("BENCHMARK SETTINGS:")
Expand Down Expand Up @@ -102,10 +103,10 @@ def print_latency(latency_set, title, warmup=3):
times = []
mtimes = []
for i in range(args.trials):
torch.cuda.synchronize()
get_accelerator().synchronize()
start = time.time()
r = pipe("DeepSpeed is", do_sample=False, max_new_tokens=args.max_tokens)
torch.cuda.synchronize()
get_accelerator().synchronize()
end = time.time()
responses.append(r)
times.append(end - start) # / (args.max_tokens - 3))
Expand Down
13 changes: 8 additions & 5 deletions deepspeed/module_inject/containers/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import torch

from deepspeed.ops.transformer.inference.config import DeepSpeedInferenceConfig
from deepspeed.accelerator import get_accelerator


class BaseConvolutionContainer(ABC):
Expand Down Expand Up @@ -216,12 +217,14 @@ def copy_data_to_new_module(self):
self.module.mlp.attn_nb = self.attn_nb
else:
self.module.mlp.attn_nw.data.copy_(
self.attn_nw.to(torch.cuda.current_device()))
self.attn_nw.to(get_accelerator().current_device_name()))
self.module.mlp.attn_nb.data.copy_(
self.attn_nb.to(torch.cuda.current_device()))
self.attn_nb.to(get_accelerator().current_device_name()))

self.module.norm_w.data.copy_(self.input_nw.to(torch.cuda.current_device()))
self.module.norm_b.data.copy_(self.input_nb.to(torch.cuda.current_device()))
self.module.norm_w.data.copy_(
self.input_nw.to(get_accelerator().current_device_name()))
self.module.norm_b.data.copy_(
self.input_nb.to(get_accelerator().current_device_name()))

def transpose(self):
self.transpose_attention()
Expand All @@ -241,5 +244,5 @@ def transpose_impl(self, data):
data = data.contiguous()
data.reshape(-1).copy_(data.transpose(-1, -2).contiguous().reshape(-1))
data = data.reshape(data.shape[-1], data.shape[-2])
data.to(torch.cuda.current_device())
data.to(get_accelerator().current_device_name())
return data
Loading