Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rank task graph merge master #9440

Merged
merged 208 commits into from
Nov 22, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
208 commits
Select commit Hold shift + click to select a range
72cd819
Use Primitive in Scalar Pow Grad (#8620)
MARD1NO Sep 15, 2022
fc3b675
Add higher order derivative for loss function (#9070)
pingzhuu Sep 15, 2022
1a5c00e
Add higher order derivative for softmax and activation (#9032)
pingzhuu Sep 15, 2022
b12c84e
add higher order derivative for pool (#9096)
pingzhuu Sep 15, 2022
493f1bd
Cross Encropy 支持 probability 的 target (#9064)
marigoold Sep 16, 2022
cc18748
Fix nvjpegDecodeParamsSetROI (#9101)
liujuncheng Sep 16, 2022
1233c26
add series op : adaptive_max_pool1d/2d/3d (#9023)
doombeaker Sep 16, 2022
89f8504
one_embedding physical_block_size change to 4096 (#9017)
guo-ran Sep 16, 2022
5059731
OneEmbedding add ONEFLOW_ONE_EMBEDDING_DISABLE_PIPELINE (#9098)
guo-ran Sep 16, 2022
2137d64
develop eager AMP (#9088)
hjchen2 Sep 18, 2022
d33f161
refine worker seed (#9102)
Flowingsun007 Sep 19, 2022
8239803
Dev GroupNorm (#7784)
MARD1NO Sep 19, 2022
91fec7e
Introduce bfloat16 type (#9067)
clackhan Sep 19, 2022
3a0145e
Refine check in ibverbs (#8974)
shangguanshiyuan Sep 19, 2022
3c8a3ce
Support padding_idx in OneEmbedding (#8998)
MARD1NO Sep 20, 2022
dae54d4
launch oneflow kernels in code generated with MLIR (#8980)
howin98 Sep 20, 2022
d78228f
interpolate api align (#9118)
BBuf Sep 21, 2022
8878f0b
Fix masked select op bug (#9120)
BBuf Sep 21, 2022
2d60b9a
align with pytorch RANK env (#9111)
BBuf Sep 21, 2022
60a1bcb
Add oneflow hub (#9116)
BBuf Sep 21, 2022
38e85c2
fix where op data_type infer bug (#9121)
BBuf Sep 21, 2022
a743e36
fix like op infer dtype (#9127)
guo-ran Sep 22, 2022
fa3c1ca
elementwise.cuh remove template parameter tail (#9128)
liujuncheng Sep 22, 2022
364ec76
fix_global_tensor_detach_bug (#9134)
clackhan Sep 22, 2022
c445338
Add deform_conv2d op (#9095)
small1945 Sep 22, 2022
301a99e
Fix inplace mul 0size check bug (#9132)
BBuf Sep 23, 2022
7c3e9a3
Align round op to support round half to even (#9135)
small1945 Sep 23, 2022
ac5fa96
rm dict in module apply (#9137)
strint Sep 23, 2022
1c60df1
one_embedding support broadcast table_ids (#9109)
guo-ran Sep 23, 2022
7c59c35
refine error message for framework (#9104)
farmerzhang1 Sep 23, 2022
9867aff
Fix loss scale precision (#9126)
leaves-zwx Sep 23, 2022
3adbc8c
one embedding eager (#8984)
guo-ran Sep 24, 2022
3ef2eb1
module.to aligned with pytorch (#9083)
daquexian Sep 24, 2022
049f6b9
eager global zero_grad update sbp from b to p (#8853)
daquexian Sep 25, 2022
37b38f6
Support inplace scatter (#9016)
mosout Sep 25, 2022
25691a2
Dev linalg cross (#8979)
mosout Sep 25, 2022
8026787
add nansum (#9113)
marigoold Sep 26, 2022
8487d48
Feat eager global tensor indexing (#9138)
wyg1997 Sep 26, 2022
f616758
Add lr_scale for optimizers (#9008)
leaves-zwx Sep 26, 2022
d052e00
fix_ctc_loss_error_with_float_target_input (#9143)
clackhan Sep 26, 2022
2f816e2
Inplace masked fill (#9133)
doombeaker Sep 27, 2022
867e377
Fix numpy>=1.23.0 advance indexing code (#9139)
wyg1997 Sep 27, 2022
5a13db4
add_tensor_new_full_func (#9149)
clackhan Sep 27, 2022
794fe3f
As strided regist more dtype (#9150)
BBuf Sep 27, 2022
f67ff82
Auto Parallel (#8891)
Yipeng1994 Sep 27, 2022
e36c160
refine oneflow op infer dtype error message (#9155)
BBuf Sep 27, 2022
6e17442
Fix to_global PyArg_ParseTupleAndKeywords (#9158)
liujuncheng Sep 27, 2022
d92264b
Implement exponential_ and multinomial (#9073)
Ldpe2G Sep 27, 2022
28620d7
Disable IB when there no active IB devices (#9115)
liujuncheng Sep 28, 2022
19fbaf0
fix lru_cache offset (#9162)
guo-ran Sep 28, 2022
96d8a33
Rename cast to global and cast from global (#9151)
clackhan Sep 28, 2022
af52056
Refine datatype error message part2 (#9168)
BBuf Sep 28, 2022
352ba70
support tensor.triu_ (#9159)
Flowingsun007 Sep 28, 2022
fcf205c
tensor.copy_ support stride (#9142)
Flowingsun007 Sep 28, 2022
34a7304
PersistentTable add read_only flag (#9145)
liujuncheng Sep 29, 2022
3999e44
avg_pool_nd support half (#9170)
Flowingsun007 Sep 29, 2022
54c528b
fix new_ones size paramater (#9161)
Ldpe2G Sep 29, 2022
9bf3ee8
hot-fix (#9191)
Flowingsun007 Sep 29, 2022
ff43e19
skip env var check and calculate local rank if not given (#9183)
daquexian Sep 29, 2022
2431e49
set to_contiguous to amp clear list (#9171)
Flowingsun007 Sep 29, 2022
fca713f
add tensor.nansum (#9182)
marigoold Sep 30, 2022
21f0bf9
Add slight cost for different sbp in 1 device (#9172)
Yipeng1994 Sep 30, 2022
474a453
refine_to_contiguous_dtype_register (#9196)
Flowingsun007 Sep 30, 2022
62ed326
skip autocast for non-user op (#9199)
hjchen2 Sep 30, 2022
104f52d
`copy_` support numpy fp16 (#9189)
daquexian Sep 30, 2022
4d011fd
fix matmul 0 size input error (#9147)
Ldpe2G Oct 1, 2022
d849396
Feat functional scalar tensor parameter (#9190)
wyg1997 Oct 6, 2022
67e9bab
Fix broadcast fmod grad (#8865)
shangguanshiyuan Oct 6, 2022
1b2879c
Feat straighten compress memory (#9094)
Yipeng1994 Oct 7, 2022
d3b430f
Add contains magic method (#9185)
BBuf Oct 8, 2022
8d9ed86
Build cuda 11.8 (#9204)
liujuncheng Oct 8, 2022
f6b594a
export unsorted segment sum (#9206)
guo-ran Oct 8, 2022
b112d9d
Optimize OneEmbedding Save Snapshot (#9112)
MARD1NO Oct 8, 2022
48d3c78
Add Tensor.scatter_add & refine scatter (#9201)
mosout Oct 9, 2022
fca9c38
optimize layernorm need padding cols perf (#9195)
guo-ran Oct 10, 2022
aabb48a
Support Inplace behavior in Type Promotion (#9200)
MARD1NO Oct 10, 2022
14fd135
Fix Broadcast Matmul check (#9213)
MARD1NO Oct 10, 2022
be986e2
Export MultiTensor Update and FuseUpdateCast to GraphConfig (#9209)
MARD1NO Oct 10, 2022
c72c11c
fix bug of matmul dim check in `oneflow.bmm` (#9215)
marigoold Oct 10, 2022
3e66217
Regist arange fp16 (#9202)
BBuf Oct 10, 2022
bf2f23f
Fix graph out argstree type judge (#9211)
strint Oct 11, 2022
a1ccdd4
fix ConcatFunctor error message (#9225)
liujuncheng Oct 11, 2022
ed29710
Check async errors after kernel launched (#9226)
liujuncheng Oct 11, 2022
aee190b
Skip unnecessary passes (#9219)
liujuncheng Oct 11, 2022
211061f
one_embedding fix typo (#9230)
guo-ran Oct 11, 2022
ee222a1
[GetAsyncError] Add op name to error message (#9228)
liujuncheng Oct 11, 2022
1639c2b
[JobBuildAndInferCtx]Remove an inefficient check (#9229)
liujuncheng Oct 12, 2022
5429e72
Fix linalg cross 0-size input error (#9232)
liujuncheng Oct 12, 2022
d5a090d
Add silu to amp list (#9233)
liujuncheng Oct 12, 2022
6e3e521
Disable CUDA virtual arch compilation (#9236)
liujuncheng Oct 12, 2022
0a11909
Support set/get_default_dtype interface (#9227)
wyg1997 Oct 12, 2022
dee3bf4
Enhance doctest error message (#9237)
wyg1997 Oct 12, 2022
be54056
Feat: script to import oneflow as torch globally (#9160)
farmerzhang1 Oct 12, 2022
a041551
add time and mem log tools (#9164)
strint Oct 12, 2022
65fed36
support bool for `oneflow.nn.functional.pad` (#9234)
marigoold Oct 13, 2022
52d9502
Feat: rand/randn support float16 kernel (#9238)
wyg1997 Oct 13, 2022
1c69248
reduc auto tick generate time (#9235)
strint Oct 13, 2022
259b8a3
TensorIndexing support float16 (#9247)
wyg1997 Oct 13, 2022
cb68854
Add cudnn handle pool (#9243)
clackhan Oct 13, 2022
54da4f3
Added error message for CUDA device incompatibility (#9250)
liujuncheng Oct 13, 2022
8fdc8bc
Fix autograd.Function memory leak (#9249)
wyg1997 Oct 13, 2022
13876f4
Feat speed up mem reuse (#9210)
Yipeng1994 Oct 14, 2022
2d66762
fix bug: segfult when argmax has 0 size tensor as input (#9242)
Ldpe2G Oct 14, 2022
0481335
fix_half_check_of_reduce_mean (#9014)
Flowingsun007 Oct 14, 2022
95c0725
Support float16 for initializer operators (#9253)
wyg1997 Oct 14, 2022
1cd75ed
Add half clamp (#9241)
MARD1NO Oct 14, 2022
f40d711
[CUDA]CheckVersionCompatibility (#9257)
liujuncheng Oct 14, 2022
f8c0ead
Feat: monkeypatching pytorch (#9256)
farmerzhang1 Oct 15, 2022
855f09a
support destory_rdma (#9246)
Flowingsun007 Oct 17, 2022
c449acb
add bincount (#9156)
marigoold Oct 17, 2022
93d19f3
ONEFLOW_STREAM_ENABLE_H2D_STREAM (#9205)
lixinqi Oct 17, 2022
e227e5a
Modify generator.manual_seed to return generator rather than None (#9…
marigoold Oct 18, 2022
c5d1465
Dev add tensor bernoulli (#9261)
marigoold Oct 18, 2022
295cb60
Multi tensor update (#9252)
rejoicesyc Oct 18, 2022
914c39e
fix a typo in readme (#9268)
QiJune Oct 19, 2022
a6c7f6b
support nested asyncs.thread (#9270)
lixinqi Oct 19, 2022
f97f09f
OneEmbedding add smart decay sparse adam (#9176)
guo-ran Oct 19, 2022
eb83ef9
upgrade clang-tidy used in ninja of_tidy (#9263)
daquexian Oct 20, 2022
047a856
Feat/compile time count (#9245)
strint Oct 20, 2022
aec4221
fix random_normal (#9274)
guo-ran Oct 21, 2022
d5f4f4a
Flip and upsample bilinear support fp16 (#9284)
BBuf Oct 21, 2022
9dcd8f2
Fix PruneAmpWhiteIdentityOpPass (#9276)
leaves-zwx Oct 21, 2022
ff15bd2
support api flow.randn_like (#9283)
Flowingsun007 Oct 21, 2022
78ee5d3
remove dry run, add sanitizers to ci (#8670)
daquexian Oct 22, 2022
22eabed
add build config for RTX 40xx GPUs (#9290)
zhaoyongke Oct 23, 2022
6a5ee73
Bool support for triu (#9291)
liujuncheng Oct 24, 2022
87921d1
Refix PruneAmpWhiteIdentityOpPass (#9294)
leaves-zwx Oct 24, 2022
7105814
fix concat #8833 (#9275)
hhhfccz Oct 24, 2022
b21eee7
support half for masked_fill (#9292)
liujuncheng Oct 24, 2022
c433cb8
Fix BatchNorm performance (#9298)
liujuncheng Oct 24, 2022
47ea07e
slice update cpu kernel multi_thread loop (#9264)
BBuf Oct 24, 2022
41bfe98
fix inplace bug in `tensor.masked_fill_` (#9295)
marigoold Oct 25, 2022
bd6b00f
fix_inplace_copy_bug (#9301)
clackhan Oct 25, 2022
b2091bc
FusedMultiHeadAttentionInference (#9287)
liujuncheng Oct 25, 2022
1bd4505
Fix compile warnings (#9302)
liujuncheng Oct 25, 2022
7934a4a
Set the default value of CUDA_STATIC to OFF when CUDA version is grea…
liujuncheng Oct 26, 2022
723d1bb
Reduce pass time cost (#9281)
strint Oct 26, 2022
c586c72
Refactor get sbp signature (#9304)
Yipeng1994 Oct 27, 2022
de2f20d
Fix type error for entering a single tensor using concat op (#9316)
small1945 Oct 27, 2022
ee07704
Add more sbp signature print functions for log and debug (#9293)
leaves-zwx Oct 28, 2022
5591585
Release/nightly cu118 (#9308)
jackalcooper Oct 29, 2022
40d4d91
Fix different dtype in slice_update (#9331)
wyg1997 Oct 29, 2022
09c408d
Fix FlattenOp GetSbp (#9322)
leaves-zwx Oct 29, 2022
ec90a01
Refactor ONEFLOW_MLIR_PREFER_NHWC to support more ops (#9335)
jackalcooper Oct 30, 2022
fcf3d39
distributions.Categorical support logits not None (#9332)
Ldpe2G Oct 31, 2022
61475d0
avoid extra gpu memory usage in flow.save (#9328)
daquexian Oct 31, 2022
4e155dd
Use primitive to replace Ndarray::BroadcastBinary (#9311)
liujuncheng Oct 31, 2022
d4ecad2
Block forward support modification (#9336)
strint Oct 31, 2022
c4b7abd
Add log sum exp api (#9333)
clackhan Oct 31, 2022
e925a44
Feat: isclose and allclose (#9280)
farmerzhang1 Nov 1, 2022
a8bd9b1
Refactor random op with consistent data (#9299)
wyg1997 Nov 1, 2022
43e3fe2
bool tensor slice_update use masked_fill when possible (#9324)
Flowingsun007 Nov 1, 2022
f84772b
Move tensor apis to cpython (#9303)
Flowingsun007 Nov 1, 2022
9840775
Add gelu_tanh op and kernel (#9343)
leaves-zwx Nov 1, 2022
d3f13ed
refine_test_maxpool2d_channel_last (#9344)
Flowingsun007 Nov 2, 2022
a3841f5
Refactor normal initializer (#9307)
wyg1997 Nov 2, 2022
e99a309
Support fp16 in constant folding (#9337)
mosout Nov 2, 2022
60b7ec5
fix exp overflow with minus max trick (#9353)
Ldpe2G Nov 2, 2022
65c5740
Fix occasional bug in random_op data test (#9354)
wyg1997 Nov 3, 2022
75768e9
Dev add gumbel softmax (#9208)
hhhfccz Nov 3, 2022
36a21f6
Fix the inconsistent behavior of slice update (#9321)
small1945 Nov 3, 2022
204dd70
enable autocast for that op which has nocast arguments (#9362)
hjchen2 Nov 4, 2022
7dc55e3
Add NHWC format for group norm (#9368)
liujuncheng Nov 4, 2022
65fd4d9
Enable ZeRO with auto parallel (#9288)
Yipeng1994 Nov 4, 2022
1ce12c4
Feat unbalanced split nd sbp (#9310)
Yipeng1994 Nov 4, 2022
b41b3ac
Add upsample_nearest_2d to amp clear list (#9366)
liujuncheng Nov 4, 2022
b347659
fix cuda integral type closeness computation (#9346)
farmerzhang1 Nov 5, 2022
9111fdd
Add fused linear (#9369)
liujuncheng Nov 5, 2022
252ccea
Support fp16 on some cpu operators (#9374)
mosout Nov 5, 2022
020699f
Scalar math kernels support inplace (#9372)
liujuncheng Nov 7, 2022
463cca8
Optimize GroupNorm NHWC with FastDivmod (#9373)
liujuncheng Nov 7, 2022
90d39a7
GradAcc Mem V5: Part 0-4 (#8961)
chengtbf Nov 7, 2022
95cf4f8
fix the bug of fill_tensor_ of support fp16 & autocast (#9375)
Yipeng1994 Nov 7, 2022
db44958
Allocate in instruction computation (#9282)
lixinqi Nov 7, 2022
2304f04
Disable conv algorithm search in eager mode (#9376)
liujuncheng Nov 7, 2022
d33298f
Add FusedGroupNormSilu (#9387)
liujuncheng Nov 7, 2022
80d6dff
Update fmt (#9392)
mosout Nov 8, 2022
e1d5b63
FusedConvBias (#9395)
liujuncheng Nov 9, 2022
2cfabdf
fix batchnorm infer dtype failed in half inference (#9388)
BBuf Nov 10, 2022
9b49046
fix_logsumexp_overflow_error (#9385)
clackhan Nov 10, 2022
1667cdf
Refactor uniform initializer (#9384)
small1945 Nov 10, 2022
968235e
Feat module to local (#9400)
strint Nov 10, 2022
b0bec65
Update tensor contrustor to fix issue #9403 (#9404)
daquexian Nov 10, 2022
98ae463
Optimize fast_gelu half specialization (#9408)
liujuncheng Nov 11, 2022
f5e2be4
Impl of fused_bias_add_scale_mask_softmax_dropout (#9401)
leaves-zwx Nov 11, 2022
848de58
Fix bug when autograd.grad meet tensor.grad is not None (#9402)
wyg1997 Nov 13, 2022
b2bbfdf
Optimize UpsampleNearest2D 2X (#9415)
liujuncheng Nov 14, 2022
2595a9d
Add MaxUnpool op (#9309)
marigoold Nov 14, 2022
3d132f4
bypass StopIteration error in dataloader delete_shm (#9393)
daquexian Nov 14, 2022
2a04b71
Impl of fused_fast_gelu_mul (#9397)
leaves-zwx Nov 14, 2022
623b481
Add autograd engine debug graph (#9412)
wyg1997 Nov 14, 2022
b2ae7e2
Optimize transpose identity (#9416)
liujuncheng Nov 14, 2022
96d72af
Optimize fmha transpose (#9417)
liujuncheng Nov 15, 2022
be258a2
Fix the usage of argument end_factor in LinearLR (#9421)
leaves-zwx Nov 15, 2022
8e7e149
Fix lazy scalar tensor indexing (#9420)
wyg1997 Nov 15, 2022
d40f214
GroupedMatmulBias (#9413)
liujuncheng Nov 15, 2022
1894d07
Optim upsample backward (#9424)
BBuf Nov 15, 2022
ab9d76c
Speed up the training (#9278)
Yipeng1994 Nov 17, 2022
ee60d5e
Profiling item (#9394)
lixinqi Nov 17, 2022
2308037
KernelPriority (#9427)
liujuncheng Nov 17, 2022
e61d4a3
Graph rename v2 (#9351)
strint Nov 17, 2022
0185f14
Cherry-pick IR changes (#9430)
jackalcooper Nov 18, 2022
ad3bc85
[hotfix] remove cuda half unittest in maxunpool (#9436)
marigoold Nov 18, 2022
501cdbb
Fix checkpoint v2 (#9437)
jackalcooper Nov 18, 2022
49618c5
merge master
strint Nov 18, 2022
7989453
fix to pass compile
strint Nov 18, 2022
497f1af
rm check to pass test
strint Nov 18, 2022
a8a79e9
fix TaskGraph init
strint Nov 22, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .github/workflows/canary.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ jobs:
- name: Checkout Oneflow-Inc/oneflow
if: ${{ github.event.inputs.oneflow-ref == '' }}
uses: actions/checkout@v2
- uses: Oneflow-Inc/get-oneflow@support-iree-ci
- uses: Oneflow-Inc/get-oneflow@support-cu118
name: Build manylinux
id: build-cuda
with:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/on_merge.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,6 @@ jobs:
if: github.event.pull_request.merged == true
runs-on: ubuntu-latest
steps:
- uses: Oneflow-Inc/get-oneflow/update-benchmark-history@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/update-benchmark-history@support-cu118
name: Update benchmark history
timeout-minutes: 10
7 changes: 4 additions & 3 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ jobs:
with:
ref: ${{ github.event.pull_request.head.sha }}
repository: ${{github.event.pull_request.head.repo.full_name}}
- uses: Oneflow-Inc/get-oneflow/cache-complete/matrix/build@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/cache-complete/matrix/build@support-cu118
name: find cache
id: find-cache
timeout-minutes: 5
Expand All @@ -45,6 +45,7 @@ jobs:
release
oneflow-src: ${{ env.ONEFLOW_SRC }}
entries: |
cu118
cu116
cu112
cu102
Expand Down Expand Up @@ -74,7 +75,7 @@ jobs:
python3 -m pip install -U setuptools wheel --user
python3 -m pip install oss2 --user
- uses: actions/checkout@v2
- uses: Oneflow-Inc/get-oneflow@support-iree-ci
- uses: Oneflow-Inc/get-oneflow@support-cu118
name: Build ${{ matrix.entry }}
if: ${{ matrix.entry !='cpu' }}
with:
Expand All @@ -97,7 +98,7 @@ jobs:
3.8
3.9
3.10
- uses: Oneflow-Inc/get-oneflow@support-iree-ci
- uses: Oneflow-Inc/get-oneflow@support-cu118
name: Build ${{ matrix.entry }}
if: ${{ matrix.entry =='cpu' }}
with:
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/simple.yml
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,7 @@ jobs:
repository: Oneflow-Inc/conda-env
ref: 30a7f00eb48ee9009d85a848e720823e5054c66b
path: conda-env
- uses: Oneflow-Inc/get-oneflow@support-iree-ci
- uses: Oneflow-Inc/get-oneflow@support-cu118
name: Build with gcc7
if: ${{ matrix.build-type == 'gcc7'}}
with:
Expand All @@ -254,7 +254,7 @@ jobs:
oneflow-build-env: conda
conda-env-file: conda-env/dev/gcc7/environment-v2.yml
conda-env-name: oneflow-dev-gcc7-v2
- uses: Oneflow-Inc/get-oneflow@support-iree-ci
- uses: Oneflow-Inc/get-oneflow@support-cu118
name: Build with clang10
if: ${{ matrix.build-type == 'clang10'}}
with:
Expand Down
103 changes: 82 additions & 21 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ env:
FLOW_VISION_SRC: flow_vision
FLOW_VISION_COMMIT: ca8ebc663b58667cf8cd1b6ef0c861522780b7bb
LIBAI_SRC: libai
LIBAI_COMMIT: 7d31d9781e5f2d559dc0820f599e0bed798488ca
LIBAI_COMMIT: 94eb85ff0131e8dfce953a3a916de7a4f897c647
ONEFLOW_FACE_SRC: oneflow_face
ONEFLOW_FACE_COMMIT: 110a97e8d5737a1f1856281a7df556a5ac8f06de
ONEFLOW_IREE_SRC: oneflow_iree
Expand All @@ -29,7 +29,7 @@ jobs:
runs-on: ubuntu-latest
if: github.event.pull_request.draft == false && github.base_ref == 'master' && contains(github.event.pull_request.requested_reviewers.*.login, 'oneflow-ci-bot')
steps:
- uses: Oneflow-Inc/get-oneflow/priority-pr@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/priority-pr@support-cu118
name: Check priority PR closed
id: save-cache
timeout-minutes: 5
Expand Down Expand Up @@ -163,7 +163,7 @@ jobs:
fi
echo "is_secrets_accessible=1" >> $GITHUB_ENV
- name: Wait for GPU slot
uses: Oneflow-Inc/get-oneflow/wait-for-gpu@support-iree-ci
uses: Oneflow-Inc/get-oneflow/wait-for-gpu@support-cu118
if: env.is_secrets_accessible == '1'
timeout-minutes: 90
continue-on-error: true
Expand All @@ -187,7 +187,7 @@ jobs:
with:
ref: ${{ github.event.pull_request.head.sha }}
repository: ${{github.event.pull_request.head.repo.full_name}}
- uses: Oneflow-Inc/get-oneflow/cache-complete/matrix/build@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/cache-complete/matrix/build@support-cu118
name: find cache
id: find-cache
timeout-minutes: 5
Expand All @@ -201,6 +201,8 @@ jobs:
entries: |
cu102
cpu
cpu-asan-ubsan
cpu-tsan
llvm13

build-oneflow:
Expand Down Expand Up @@ -234,7 +236,7 @@ jobs:
with:
ref: ${{ github.event.pull_request.head.sha }}
repository: ${{github.event.pull_request.head.repo.full_name}}
- uses: Oneflow-Inc/get-oneflow/cache-complete@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/cache-complete@support-cu118
name: Save cache if successful
id: save-cache
timeout-minutes: 5
Expand All @@ -248,7 +250,7 @@ jobs:
run: |
echo "::error file=test.yml,line=204,col=10::steps.save-cache.outputs.cache-hit != matrix.cache-hit"
exit 1
- uses: Oneflow-Inc/get-oneflow@support-iree-ci
- uses: Oneflow-Inc/get-oneflow@support-cu118
name: Build manylinux ${{ matrix.entry }}
id: build-cpu
if: ${{ matrix.entry =='cpu' && !matrix.cache-hit }}
Expand All @@ -270,7 +272,28 @@ jobs:
python-versions: |
3.7
3.8
- uses: Oneflow-Inc/get-oneflow@support-iree-ci
- uses: Oneflow-Inc/get-oneflow@support-cu118
name: Build manylinux ${{ matrix.entry }}
id: build-cpu-sanitizers
if: ${{ (matrix.entry == 'cpu-asan-ubsan' || matrix.entry == 'cpu-tsan') && !matrix.cache-hit }}
with:
cmake-init-cache: ${{ env.ONEFLOW_SRC }}/cmake/caches/ci/${{ matrix.entry }}.cmake
build-script: ${{ env.ONEFLOW_SRC }}/ci/manylinux/build.sh
run-lit: false
oneflow-src: ${{ env.ONEFLOW_SRC }}
oneflow-build-env: manylinux
wheelhouse-dir: ${{ env.WHEELHOUSE_DIR }}
clear-wheelhouse-dir: true
self-hosted: ${{ contains(matrix.runs-on, 'self-hosted') }}
cuda-version: none
manylinux-cache-dir: ${{ env.MANYLINUX_CACHE_DIR }}
docker-run-use-system-http-proxy: false
docker-run-use-lld: true
retry-failed-build: true
clean-ccache: ${{ contains(github.event.pull_request.labels.*.name, 'need-clean-ccache') }}
python-versions: |
3.8
- uses: Oneflow-Inc/get-oneflow@support-cu118
name: Build manylinux ${{ matrix.entry }}
id: build-cuda
if: ${{ matrix.entry =='cu102' && !matrix.cache-hit }}
Expand All @@ -290,7 +313,7 @@ jobs:
clean-ccache: ${{ contains(github.event.pull_request.labels.*.name, 'need-clean-ccache') }}
python-versions: |
3.7
- uses: Oneflow-Inc/get-oneflow@support-iree-ci
- uses: Oneflow-Inc/get-oneflow@support-cu118
name: Build ${{ matrix.entry }}
if: ${{ matrix.entry == 'llvm13' && !matrix.cache-hit }}
with:
Expand Down Expand Up @@ -329,7 +352,7 @@ jobs:
})
- name: Upload packed liboneflow
if: ${{ !fromJson(matrix.cache-hit) && matrix.entry != 'llvm13' && matrix.entry != 'cu102_xla' }}
uses: Oneflow-Inc/get-oneflow/digest/upload@support-iree-ci
uses: Oneflow-Inc/get-oneflow/digest/upload@support-cu118
timeout-minutes: 10
with:
digest: ${{ steps.save-cache.outputs.build-digest }}
Expand All @@ -340,7 +363,7 @@ jobs:
dst-dir: cpack
- name: Upload whl
if: ${{ !fromJson(matrix.cache-hit) && matrix.entry != 'llvm13' && matrix.entry != 'cu102_xla' }}
uses: Oneflow-Inc/get-oneflow/digest/upload@support-iree-ci
uses: Oneflow-Inc/get-oneflow/digest/upload@support-cu118
timeout-minutes: 10
with:
digest: ${{ steps.save-cache.outputs.build-digest }}
Expand All @@ -365,7 +388,7 @@ jobs:
with:
ref: ${{ github.event.pull_request.head.sha }}
repository: ${{github.event.pull_request.head.repo.full_name}}
- uses: Oneflow-Inc/get-oneflow/cache-complete/matrix/test@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/cache-complete/matrix/test@support-cu118
name: find cache
id: find-cache
timeout-minutes: 5
Expand Down Expand Up @@ -396,7 +419,7 @@ jobs:
with:
ref: ${{ github.event.pull_request.head.sha }}
repository: ${{github.event.pull_request.head.repo.full_name}}
- uses: Oneflow-Inc/get-oneflow/cache-complete/matrix/test@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/cache-complete/matrix/test@support-cu118
name: find cache
id: find-cache
timeout-minutes: 5
Expand Down Expand Up @@ -472,7 +495,7 @@ jobs:
if: ${{ contains(matrix.runs-on, 'self-hosted') }}
run: |
docker rm -f ${{ env.TEST_CONTAINER_NAME }} || true
- uses: Oneflow-Inc/get-oneflow/cache-complete@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/cache-complete@support-cu118
name: Save cache if successful
id: save-cache
timeout-minutes: 5
Expand All @@ -488,7 +511,7 @@ jobs:
exit 1
- name: Download wheel and packed liboneflow
if: ${{ !fromJson(matrix.cache-hit) && contains(matrix.runs-on, 'self-hosted') }}
uses: Oneflow-Inc/get-oneflow/digest/download@support-iree-ci
uses: Oneflow-Inc/get-oneflow/digest/download@support-cu118
id: download-digest
timeout-minutes: 10
with:
Expand All @@ -498,7 +521,7 @@ jobs:
ssh-tank-path: ${{ env.SSH_TANK_PATH }}
- name: Get primary node
if: ${{ !fromJson(matrix.cache-hit) && contains(matrix.runs-on, 'self-hosted') }}
uses: Oneflow-Inc/get-oneflow/master-address@support-iree-ci
uses: Oneflow-Inc/get-oneflow/master-address@support-cu118
id: get-primary-node
with:
rank: ${{ matrix.rank }}
Expand Down Expand Up @@ -631,7 +654,7 @@ jobs:
TEST_CONTAINER_NAME: "pr-${{ github.event.pull_request.number }}-run-id-${{ github.run_id }}-${{ matrix.entry }}-test"
TEST_MANYLINUX_CONTAINER_NAME: "pr-${{ github.event.pull_request.number }}-run-id-${{ github.run_id }}-${{ matrix.entry }}-test-manylinux"
TEST_WITH_TF_IMG_TAG: registry.cn-beijing.aliyuncs.com/oneflow/test-with-tf-2.3.0:2f831e9354298a11447578e869d983959feb046f
TEST_MANYLINUX_IMG_TAG: registry.cn-beijing.aliyuncs.com/oneflow/manylinux2014_x86_64_cuda10.2:4fd9cc268bbe59c6245ca3941b8264fd256a8670
TEST_MANYLINUX_IMG_TAG: registry.cn-beijing.aliyuncs.com/oneflow/manylinux2014_x86_64_cuda10.2:190c92408855fe17ae664f2de1a9d6f484b2da2b
SSH_TANK_HOST: 192.168.1.13
SSH_TANK_PATH: /tank
METRICS_DIR: metrics
Expand Down Expand Up @@ -689,7 +712,7 @@ jobs:
if: ${{ contains(matrix.runs-on, 'self-hosted') }}
run: |
docker rm -f ${{ env.TEST_MANYLINUX_CONTAINER_NAME }} || true
- uses: Oneflow-Inc/get-oneflow/cache-complete@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/cache-complete@support-cu118
name: Save cache if successful
id: save-cache
timeout-minutes: 5
Expand All @@ -705,14 +728,34 @@ jobs:
exit 1
- name: Download wheel and packed liboneflow
if: ${{ !fromJson(matrix.cache-hit) && contains(matrix.runs-on, 'self-hosted') }}
uses: Oneflow-Inc/get-oneflow/digest/download@support-iree-ci
uses: Oneflow-Inc/get-oneflow/digest/download@support-cu118
id: download-digest
timeout-minutes: 10
with:
digest: ${{ steps.save-cache.outputs.build-digest }}
entry: ${{ matrix.compute-platform }}
ssh-tank-host: ${{ env.SSH_TANK_HOST }}
ssh-tank-path: ${{ env.SSH_TANK_PATH }}
- name: Download ASAN and UBSAN wheel and packed liboneflow
if: ${{ !fromJson(matrix.cache-hit) && contains(matrix.runs-on, 'self-hosted') && matrix.device == 'cpu' }}
uses: Oneflow-Inc/get-oneflow/digest/download@support-cu118
id: asan-ubsan-download-digest
timeout-minutes: 10
with:
digest: ${{ steps.save-cache.outputs.build-digest }}
entry: cpu-asan-ubsan
ssh-tank-host: ${{ env.SSH_TANK_HOST }}
ssh-tank-path: ${{ env.SSH_TANK_PATH }}
- name: Download TSAN wheel and packed liboneflow
if: ${{ !fromJson(matrix.cache-hit) && contains(matrix.runs-on, 'self-hosted') && matrix.device == 'cpu' }}
uses: Oneflow-Inc/get-oneflow/digest/download@support-cu118
id: tsan-download-digest
timeout-minutes: 10
with:
digest: ${{ steps.save-cache.outputs.build-digest }}
entry: cpu-tsan
ssh-tank-host: ${{ env.SSH_TANK_HOST }}
ssh-tank-path: ${{ env.SSH_TANK_PATH }}
- name: Enable TF container
if: ${{ fromJSON(matrix.is-single-client) }}
run: |
Expand Down Expand Up @@ -765,6 +808,11 @@ jobs:
if: ${{ !fromJson(matrix.cache-hit) && contains(matrix.runs-on, 'self-hosted') && !fromJson(matrix.is-xla) }}
run: |
unzip ${{ env.ONEFLOW_CPACK_PATH }}/liboneflow-ci-linux.zip
- name: Unzip packed sanitized liboneflow
if: ${{ !fromJson(matrix.cache-hit) && contains(matrix.runs-on, 'self-hosted') && !fromJson(matrix.is-xla) && matrix.device == 'cpu' }}
run: |
unzip ${{ steps.asan-ubsan-download-digest.outputs.entry-dir }}/cpack/liboneflow-ci-linux.zip -d asan-ubsan
unzip ${{ steps.tsan-download-digest.outputs.entry-dir }}/cpack/liboneflow-ci-linux.zip -d tsan
- name: Start container
if: ${{ !fromJson(matrix.cache-hit) && contains(matrix.runs-on, 'self-hosted') }}
working-directory: ${{ env.ONEFLOW_SRC }}
Expand Down Expand Up @@ -825,6 +873,13 @@ jobs:
timeout-minutes: 20
run: |
docker exec -e ONEFLOW_SERVING_DEBUG=1 ${{ env.TEST_MANYLINUX_CONTAINER_NAME }} ./liboneflow-ci-linux/bin/oneflow_cpp_api_testexe --gtest_filter=-Api.embedding*
- name: Exe test (C++ API with sanitizers)
if: ${{ !fromJson(matrix.cache-hit) && matrix.test-type == 'misc' && matrix.device == 'cpu' }}
timeout-minutes: 10
run: |
docker exec -e UBSAN_OPTIONS=suppressions=.ubsan-suppressions -e ASAN_OPTIONS=strict_string_checks=1:detect_stack_use_after_return=1 -e LSAN_OPTIONS=suppressions=.lsan-suppressions ${{ env.TEST_MANYLINUX_CONTAINER_NAME }} ./asan-ubsan/liboneflow-ci-linux/bin/oneflow_cpp_api_testexe --gtest_filter=Api.graph_\*
# Run 5 times to avoid false positive because of occasional lack of stack info
docker exec -e TSAN_OPTIONS="history_size=7 suppressions=.tsan-suppressions" ${{ env.TEST_MANYLINUX_CONTAINER_NAME }} bash -c "./tsan/liboneflow-ci-linux/bin/oneflow_cpp_api_testexe || ./tsan/liboneflow-ci-linux/bin/oneflow_cpp_api_testexe || ./tsan/liboneflow-ci-linux/bin/oneflow_cpp_api_testexe || ./tsan/liboneflow-ci-linux/bin/oneflow_cpp_api_testexe || ./tsan/liboneflow-ci-linux/bin/oneflow_cpp_api_testexe"
- name: Test container
if: ${{ !fromJson(matrix.cache-hit) && contains(matrix.runs-on, 'self-hosted') }}
run: |
Expand Down Expand Up @@ -950,7 +1005,7 @@ jobs:
timeout-minutes: 30
if: ${{ !fromJson(matrix.cache-hit) && matrix.test-type == 'misc' && matrix.device == 'cuda' }}
run: |
docker exec -e ONEFLOW_TEST_DEVICE_NUM=4 -w $PWD/${{ env.ONEFLOW_FACE_SRC }} ${{ env.TEST_CONTAINER_NAME }} python3 -m oneflow.distributed.launch --nproc_per_node 4 -m unittest -f tests/train/test_train.py
docker exec -e ONEFLOW_TEST_DEVICE_NUM=4 -w $PWD/${{ env.ONEFLOW_FACE_SRC }} ${{ env.TEST_CONTAINER_NAME }} python3 -m oneflow.distributed.launch --nproc_per_node 4 -m pytest tests/train/test_train.py
- name: oneflow_iree test
timeout-minutes: 45
if: ${{ !fromJson(matrix.cache-hit) && matrix.test-type == 'misc' }}
Expand Down Expand Up @@ -978,10 +1033,16 @@ jobs:
if: ${{ !fromJson(matrix.cache-hit) && matrix.test-type == 'misc' }}
run: |
docker exec -e ONEFLOW_TEST_DIR=$PWD/python/oneflow/test/tensor ${{ env.TEST_CONTAINER_NAME }} bash ci/test/generic_test_multi_client.sh
- name: Test mocking torch by script
run: |
docker exec ${{ env.TEST_CONTAINER_NAME }} bash -x ci/test/test_mock_script.sh
- name: Test mocking torch by function
run: |
docker exec ${{ env.TEST_CONTAINER_NAME }} bash -x ci/test/test_mock_function.sh
- name: Benchmark Test
timeout-minutes: 100
if: ${{ !fromJson(matrix.cache-hit) && matrix.test-type == 'benchmark' && matrix.device == 'cuda' }}
uses: Oneflow-Inc/get-oneflow/pytest-benchmark@support-iree-ci
uses: Oneflow-Inc/get-oneflow/pytest-benchmark@support-cu118
with:
collect-path: ${{ env.FLOW_VISION_SRC }}/benchmark
container-name: ${{ env.TEST_CONTAINER_NAME }}
Expand Down Expand Up @@ -1043,7 +1104,7 @@ jobs:
ref: ${{ github.event.pull_request.head.sha }}
repository: ${{github.event.pull_request.head.repo.full_name}}
fetch-depth: 0
- uses: Oneflow-Inc/get-oneflow/cache-complete@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/cache-complete@support-cu118
name: Save cache if successful
id: save-cache
timeout-minutes: 5
Expand Down
1 change: 1 addition & 0 deletions .lsan-suppressions
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
leak:CommandT
9 changes: 9 additions & 0 deletions .tsan-suppressions
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# These four group of functions are designed to be thread unsafe,
# it's user's responsibility to use them correctly.
race:ThreadUnsafe
race:thread_unsafe
race:flying_instruction_cnt
race:total_erased_instruction_cnt
race:ToShape
# glog
race:google::
2 changes: 2 additions & 0 deletions .ubsan-suppressions
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# llvm
vptr:Class.cpp
Loading