Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Embedding] Add GPU fused embedding ops. #64

Open
wants to merge 49 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
a566c1c
Update fused embedding modelzoo perf benchmark
RandyInterfish Jan 10, 2022
4567918
Update perf benchmark under docs
RandyInterfish Jan 10, 2022
222e123
Merge branch 'main' of https://github.com/nvzhou/DeepRec-public into …
RandyInterfish Jan 10, 2022
fe129e8
Add nvtx to fused embedding ops
RandyInterfish Jan 14, 2022
13906e4
refactor a little big
RandyInterfish Jan 20, 2022
95a08ff
Update: add unique sub op and use cu event to sync
RandyInterfish Jan 27, 2022
f2ca5ea
Minor change
RandyInterfish Feb 9, 2022
1cf4c57
minor update
RandyInterfish Feb 10, 2022
22e936e
minor update
RandyInterfish Feb 11, 2022
3fab0e8
Update: minor change
RandyInterfish Feb 14, 2022
d9765a5
temp update
RandyInterfish Feb 14, 2022
3b00778
kernel impl compile pass
RandyInterfish Feb 15, 2022
7589dc1
Update: pre-lookup ready. Unit tests all passed
RandyInterfish Feb 22, 2022
d65cec1
postlookup grad done except for unit test
RandyInterfish Mar 1, 2022
cca0b38
Unit test passed
RandyInterfish Mar 10, 2022
279a6de
python api works right
RandyInterfish Mar 10, 2022
6569c30
Update: try to optimize
RandyInterfish Mar 15, 2022
6c29295
Split pre_embedding_lookup
RandyInterfish Mar 15, 2022
84209bd
Update: python a[i
RandyInterfish Mar 16, 2022
78dab1d
Merge branch 'main' into features/gpu_embedding_fusion
RandyInterfish Mar 16, 2022
8c5b929
Merge branch 'features/gpu_embedding_fusion' into features/gpu_embedd…
RandyInterfish Mar 16, 2022
572c7d8
Update: modifying partition_select op
RandyInterfish Mar 17, 2022
c449de3
Update: modify partition_select
RandyInterfish Mar 18, 2022
22506c5
3rd version. Code modifying complete. No compile yet
RandyInterfish Mar 18, 2022
85fc956
Update: compilee pass
RandyInterfish Mar 18, 2022
04d4e60
add more partition strategies
RandyInterfish Mar 19, 2022
241e086
Add one test
RandyInterfish Mar 19, 2022
75a3bd4
Add more unit tests
RandyInterfish Mar 19, 2022
b738015
post op ut passed
RandyInterfish Mar 19, 2022
9ce8da1
ut all passed
RandyInterfish Mar 19, 2022
cf96164
optimize prune and fill moore
RandyInterfish Mar 19, 2022
99cd77e
Minor fixed
RandyInterfish Mar 20, 2022
e9115c3
Update: fix bug and update perf number for modelzoo
RandyInterfish Mar 22, 2022
091748e
Merge branch 'main' of https://github.com/nvzhou/DeepRec-public into …
RandyInterfish Mar 30, 2022
d59c9ce
update api def and golden
RandyInterfish Mar 30, 2022
3ecff7f
Update doc
RandyInterfish Apr 7, 2022
6a13c5b
Ajust interface to V2
RandyInterfish Apr 21, 2022
896d87e
Merge branch 'main' into features/gpu_embedding_fusion
RandyInterfish Apr 26, 2022
901fa8a
Make embedding fusion v1 and v2 compatible
RandyInterfish Apr 26, 2022
132b2ae
Update: delete comments
RandyInterfish Jun 28, 2022
400e0bc
Merge branch 'main' of https://github.com/alibaba/DeepRec into featur…
RandyInterfish Jun 28, 2022
676a324
temp
RandyInterfish Jul 1, 2022
18c62ad
prune and fill with sparse weight seems fine
RandyInterfish Jul 4, 2022
6daa4f4
added sparse_weight to postlookup and grad
RandyInterfish Jul 5, 2022
adf5dae
sparse_weight seems okay
RandyInterfish Jul 13, 2022
d352bfc
Merge branch 'main' of https://github.com/alibaba/DeepRec into featur…
RandyInterfish Jul 13, 2022
14a680b
modify modelzoo
RandyInterfish Jul 15, 2022
aa9f524
Fix unique_with_counts pre-volta hang issue
RandyInterfish Jul 29, 2022
4dfc639
Merge branch 'main' of https://github.com/alibaba/DeepRec into featur…
RandyInterfish Sep 6, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Prev Previous commit
Merge branch 'main' of https://github.com/alibaba/DeepRec into featur…
…es/gpu_embedding_fusion
  • Loading branch information
RandyInterfish committed Sep 6, 2022
commit 4dfc639113fb8fc84b98cea738a663068812bd17
20 changes: 0 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,24 +43,12 @@ DeepRec has super large-scale distributed training capability, supporting model

**CPU Platform**

```
registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-developer:deeprec-dev-cpu-py36-ubuntu18.04
```

Docker Hub repository

``````
alideeprec/deeprec-build:deeprec-dev-cpu-py36-ubuntu18.04
``````

**GPU Platform**

```
registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-developer:deeprec-dev-gpu-py36-cu110-ubuntu18.04
```

Docker Hub repository

```
alideeprec/deeprec-build:deeprec-dev-gpu-py36-cu110-ubuntu18.04
```
Expand Down Expand Up @@ -100,19 +88,11 @@ $ pip3 install /tmp/tensorflow_pkg/tensorflow-1.15.5+${version}-cp36-cp36m-linux

#### Image for CPU

```
registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-training:deeprec2206-cpu-py36-ubuntu18.04
```
Docker Hub repository
```
alideeprec/deeprec-release:deeprec2206-cpu-py36-ubuntu18.04
```

#### Image for GPU CUDA11.0
```
registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-training:deeprec2206-gpu-py36-cu110-ubuntu18.04
```
Docker Hub repository

```
alideeprec/deeprec-release:deeprec2206-gpu-py36-cu110-ubuntu18.04
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 0 additions & 2 deletions triton/BUILD → addons/triton/BUILD
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
# Description:
# TRITON API.

exports_files(["tf_triton_version_script.lds"])

cc_library(
name = "triton_tf",
visibility = ["//visibility:public"],
Expand Down
File renamed without changes.
File renamed without changes.
23 changes: 23 additions & 0 deletions cibuild/Dockerfile/Dockerfile.py3.6-cu112-ubuntu18.04
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
FROM nvidia/cuda:11.2.2-cudnn8-devel-ubuntu18.04

RUN apt-get update && \
apt-get install -y --allow-unauthenticated \
wget \
cmake \
git \
unzip \
curl \
libssl-dev \
libcurl4-openssl-dev \
zlib1g-dev \
python3 \
python3-dev \
python3-pip \
&& apt-get clean && \
ln -sf python3 /usr/bin/python && \
ln -sf pip3 /usr/bin/pip

RUN pip install astor==0.8.1
RUN pip install numpy==1.16.6
RUN pip install protobuf==3.17.3
RUN pip --no-deps keras-preprocessing==1.0.5
23 changes: 23 additions & 0 deletions cibuild/Dockerfile/Dockerfile.py3.6-cu117-ubuntu18.04
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
FROM nvidia/cuda:11.7.1-cudnn8-devel-ubuntu18.04

RUN apt-get update && \
apt-get install -y --allow-unauthenticated \
wget \
cmake \
git \
unzip \
curl \
libssl-dev \
libcurl4-openssl-dev \
zlib1g-dev \
python3 \
python3-dev \
python3-pip \
&& apt-get clean && \
ln -sf python3 /usr/bin/python && \
ln -sf pip3 /usr/bin/pip

RUN pip install astor==0.8.1
RUN pip install numpy==1.16.6
RUN pip install protobuf==3.17.3
RUN pip install --no-deps keras-preprocessing==1.0.5
2 changes: 1 addition & 1 deletion cibuild/gpu-ut/gpu-python-ut.sh
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ for i in $(seq 1 3); do
[ $i -gt 1 ] && echo "WARNING: cmd execution failed, will retry in $((i-1)) times later" && sleep 2
ret=0
bazel test -c opt --config=cuda --verbose_failures --test_env='NVIDIA_TF32_OVERRIDE=0' \
--run_under=//tensorflow/tools/ci_build/gpu_build:parallel_gpu_execute \
--run_under=//tensorflow/tools/ci_build/gpu_build:parallel_gpu_execute --config=opt \
--test_timeout="300,450,1200,3600" --local_test_jobs=20 --test_output=errors \
-- $TF_BUILD_BAZEL_TARGET && break || ret=$?
done
Expand Down
2 changes: 1 addition & 1 deletion configure.py
Original file line number Diff line number Diff line change
Expand Up @@ -1586,7 +1586,7 @@ def main():
print('Preconfigured Bazel build configs. You can use any of the below by '
'adding "--config=<>" to your build command. See .bazelrc for more '
'details.')
config_info_line('mkl', 'Build with MKL support.')
config_info_line('mkl_threadpool', 'Build with oneDNN support.')
config_info_line('monolithic', 'Config for mostly static monolithic build.')
config_info_line('gdr', 'Build with GDR support.')
config_info_line('verbs', 'Build with libverbs support.')
Expand Down
55 changes: 55 additions & 0 deletions docs/AdamW-Optimizer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# AdamW Optimizer
## 介绍
AdamW优化器支持Embedding Variable,相对于Adam优化器增加了了权重衰减的功能。

这是Loshch ilov & Hutter (https://arxiv.org/abs/1711.05101)的 "Decoupled Weight Decay Regularization"中描述的AdamW优化器的一个实现。


## 用户接口
训练时只需要定义`tf.train.AdamWOptimizer`即可,和其他TF原生Optimizer使用方式相同。具体定义如下:
```python
class AdamWOptimizer(DecoupledWeightDecayExtension, adam.AdamOptimizer):
def __init__(self,
weight_decay,
learning_rate=0.001,
beta1=0.9,
beta2=0.999,
epsilon=1e-8,
use_locking=False,
name="AdamW"):

# 调用方法:
optimizer = tf.train.AdamWOptimizer(
weight_decay=weight_decay_new
learning_rate=learning_rate_new,
beta1=0.9,
beta2=0.999,
epsilon=1e-8)
```
## 使用示例
```python
import tensorflow as tf

var = tf.get_variable("var_0", shape=[10,16],
initializer=tf.ones_initializer(tf.float32))

emb = tf.nn.embedding_lookup(var, tf.cast([0,1,2,5,6,7], tf.int64))
fun = tf.multiply(emb, 2.0, name='multiply')
loss = tf.reduce_sum(fun, name='reduce_sum')

gs= tf.train.get_or_create_global_step()
opt = tf.train.AdamWOptimizer(weight_decay=0.01, learning_rate=0.1)

g_v = opt.compute_gradients(loss)
train_op = opt.apply_gradients(g_v)

init = tf.global_variables_initializer()

sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
with tf.Session(config=sess_config) as sess:
sess.run([init])
print(sess.run([emb, train_op, loss]))
print(sess.run([emb, train_op, loss]))
print(sess.run([emb, train_op, loss]))
```

42 changes: 8 additions & 34 deletions docs/DeepRec-Compile-And-Install.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,44 +4,28 @@

**CPU Base Docker Image**

```
registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-developer:deeprec-base-cpu-py36-ubuntu18.04
```

Docker Hub repository
```
alideeprec/deeprec-base:deeprec-base-cpu-py36-ubuntu18.04
```

**GPU(cuda11.0) Base Docker Image**

```
registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-developer:deeprec-base-gpu-py36-cu110-ubuntu18.04
```
**GPU Base Docker Image**

Docker Hub repository
```
alideeprec/deeprec-base:deeprec-base-gpu-py36-cu110-ubuntu18.04
```
| CUDA VERSION | IMAGE |
| ------------ | --------------------------------------------------------------- |
| CUDA 11.0.3 | alideeprec/deeprec-base:deeprec-base-gpu-py36-cu110-ubuntu18.04 |
| CUDA 11.2.2 | alideeprec/deeprec-base:deeprec-base-gpu-py36-cu112-ubuntu18.04 |
| CUDA 11.4.2 | alideeprec/deeprec-base:deeprec-base-gpu-py36-cu114-ubuntu18.04 |
| CUDA 11.6.1 | alideeprec/deeprec-base:deeprec-base-gpu-py36-cu116-ubuntu18.04 |
| CUDA 11.7.1 | alideeprec/deeprec-base:deeprec-base-gpu-py36-cu117-ubuntu18.04 |

**CPU Dev Docker (with bazel cache)**

```
registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-developer:deeprec-dev-cpu-py36-ubuntu18.04
```

Docker Hub repository
```
alideeprec/deeprec-build:deeprec-dev-cpu-py36-ubuntu18.04
```

**GPU(cuda11.0) Dev Docker (with bazel cache)**

```
registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-developer:deeprec-dev-gpu-py36-cu110-ubuntu18.04
```

Docker Hub repository
```
alideeprec/deeprec-build:deeprec-dev-gpu-py36-cu110-ubuntu18.04
```
Expand Down Expand Up @@ -110,22 +94,12 @@ pip3 install /tmp/tensorflow_pkg/tensorflow-1.15.5+${version}-cp36-cp36m-linux_x

**GPU CUDA11.0镜像**

```
registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-training:deeprec2206-gpu-py36-cu110-ubuntu18.04
```

Docker Hub repository
```
alideeprec/deeprec-release:deeprec2206-gpu-py36-cu110-ubuntu18.04
```

**CPU镜像**

```
registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-training:deeprec2206-cpu-py36-ubuntu18.04
```

Docker Hub repository
```
alideeprec/deeprec-release:deeprec2206-cpu-py36-ubuntu18.04
```
Expand Down
2 changes: 1 addition & 1 deletion docs/Embedding-Variable-GPU.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,4 +37,4 @@ with tf.device('/gpu:0'):
initializer=tf.ones_initializer(tf.dtypes.float32))
```

注意:GPU版本的EmbeddingVariable暂时无法和TensorFlow自带Saver一起使用,我们后面会修复这个问题
注意:目前GPU EV不支持incremental checkpoint,如果使用的话EV相关的OP会被放置到CPU上,这个问题我们后续会修复
72 changes: 69 additions & 3 deletions docs/Embedding-Variable.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ W = tf.feature_column.embedding_column(categorical_column=columns,
initializer=tf.ones_initializer(tf.dtypes.float32))

ids={}
ids["col_emb"] = tf.SparseTensor(indices=[[0,0],[1,1],[2,2],[3,3],[4,4]], values=tf.cast([1,2,3,4,5], tf.dtypes.int64), dense_shape=[5, 4])
ids["col_emb"] = tf.SparseTensor(indices=[[0,0],[1,1],[2,2],[3,3],[4,4]], values=tf.cast([1,2,3,4,5], tf.dtypes.int64), dense_shape=[5, 5])

emb = tf.feature_column.input_layer(ids, [W])
fun = tf.multiply(emb, 2.0, name='multiply')
Expand Down Expand Up @@ -137,7 +137,7 @@ W = feature_column.embedding_column(sparse_id_column=columns,
initializer=tf.ones_initializer(tf.dtypes.float32))

ids={}
ids["col_emb"] = tf.SparseTensor(indices=[[0,0],[1,1],[2,2],[3,3],[4,4]], values=tf.cast([1,2,3,4,5], tf.dtypes.int64), dense_shape=[5, 4])
ids["col_emb"] = tf.SparseTensor(indices=[[0,0],[1,1],[2,2],[3,3],[4,4]], values=tf.cast([1,2,3,4,5], tf.dtypes.int64), dense_shape=[5, 5])

emb = feature_column_ops.input_from_feature_columns(columns_to_tensors=ids, feature_columns=[W])
fun = tf.multiply(emb, 2.0, name='multiply')
Expand All @@ -147,6 +147,69 @@ g_v = opt.compute_gradients(loss)
train_op = opt.apply_gradients(g_v)
init = tf.global_variables_initializer()

with tf.Session() as sess:
sess.run(init)
print("init global done")
print(sess.run([emb, train_op,loss]))
print(sess.run([emb, train_op,loss]))
print(sess.run([emb, train_op,loss]))
```
使用`sequence_categorical_column_with_embedding`接口:
```python
import tensorflow as tf
from tensorflow.python.feature_column import sequence_feature_column


columns = sequence_feature_column.sequence_categorical_column_with_embedding(key="col_emb", dtype=tf.dtypes.int32)
W = tf.feature_column.embedding_column(categorical_column=columns,
dimension=3,
initializer=tf.ones_initializer(tf.dtypes.float32))

ids={}
ids["col_emb"] = tf.SparseTensor(indices=[[0,0],[0,1],[1,1],[2,2],[3,3],[4,4]], \
values=tf.cast([1,3,2,3,4,5], tf.dtypes.int64),
dense_shape=[5, 5])

emb, length = tf.contrib.feature_column.sequence_input_layer(ids, [W])
fun = tf.multiply(emb, 2.0, name='multiply')
loss = tf.reduce_sum(fun, name='reduce_sum')
opt = tf.train.FtrlOptimizer(0.1, l1_regularization_strength=2.0, l2_regularization_strength=0.00001)
g_v = opt.compute_gradients(loss)
train_op = opt.apply_gradients(g_v)
init = tf.global_variables_initializer()

with tf.Session() as sess:
sess.run(init)
print("init global done")
print(sess.run([emb, train_op,loss]))
print(sess.run([emb, train_op,loss]))
print(sess.run([emb, train_op,loss]))
```
使用`weighted_categorical_column`接口:
```python
import tensorflow as tf


categorical_column = tf.feature_column.categorical_column_with_embedding("col_emb", dtype=tf.dtypes.int64)

ids={}
ids["col_emb"] = tf.SparseTensor(indices=[[0,0],[0,1],[1,1],[2,2],[3,3],[4,3],[4,4]], \
values=tf.cast([1,3,2,3,4,5,3], tf.dtypes.int64), dense_shape=[5, 5])
ids['weight'] = [[2.0],[5.0],[4.0],[8.0],[3.0],[1.0],[2.5]]

columns = tf.feature_column.weighted_categorical_column(categorical_column, 'weight')

W = tf.feature_column.embedding_column(categorical_column=columns,
dimension=3,
initializer=tf.ones_initializer(tf.dtypes.float32))
emb = tf.feature_column.input_layer(ids, [W])
fun = tf.multiply(emb, 2.0, name='multiply')
loss = tf.reduce_sum(fun, name='reduce_sum')
opt = tf.train.FtrlOptimizer(0.1, l1_regularization_strength=2.0, l2_regularization_strength=0.00001)
g_v = opt.compute_gradients(loss)
train_op = opt.apply_gradients(g_v)
init = tf.global_variables_initializer()

with tf.Session() as sess:
sess.run(init)
print("init global done")
Expand Down Expand Up @@ -186,9 +249,11 @@ emb_var = tf.feature_column.categorical_column_with_embedding("var", ev_option=e
class InitializerOption(object):
def __init__(self,
initializer = None,
default_value_dim = 4096):
default_value_dim = 4096,
default_value_no_permission = .0):
self.initializer = initializer
self.default_value_dim = default_value_dim
self.default_value_no_permission = default_value_no_permission
if default_value_dim <=0:
print("default value dim must larger than 1, the default value dim is set to default 4096.")
default_value_dim = 4096
Expand All @@ -197,6 +262,7 @@ class InitializerOption(object):

- `initializer`:Embedding Variable使用的Initializer,如果不配置的话则会被设置EV默认设置为truncated normal initializer。
- `default value dim`:生成的default value的数量,设置可以参考hash bucket size或是特征的数量,默认是4096。
- `default value no permission`:当使用准入功能时,如果特征未准入,返回的Embedding默认值。



Binary file added docs/Embedding-Variable/img_2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion docs/Estimator-Compile-And-Install.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,8 @@ alideeprec/deeprec-build:deeprec-dev-gpu-py36-cu110-ubuntu18.04
由于DeepRec新增了分布式grpc++、star_server等protocol,在使用DeepRec配合原生Estimator会存在像grpc++, star_server功能使用时无法通过Estimator检查的问题,因为我们提供了针对DeepRec版本的Estimator.

代码库:[https://github.com/AlibabaPAI/estimator](https://github.com/AlibabaPAI/estimator)
分支:deeprec

开发分支:master,最新Release分支:deeprec2206

## Estimator编译

Expand Down
Loading
You are viewing a condensed version of this merge commit. You can view the full changes here.