Skip to content

Commit

Permalink
【MetaX】Merge Metax's modifications to mxmaca/2.6 branch (#68534)
Browse files Browse the repository at this point in the history
* fix windows bug for common lib (#60308)

* fix windows bug

* fix windows bug

* fix windows bug

* fix windows bug

* fix windows bug

* fix windows bug

* Update inference_lib.cmake

* [Dy2St] Disable `test_bert` on CPU (#60173) (#60324)

Co-authored-by: gouzil <66515297+gouzil@users.noreply.github.com>

* [Cherry-pick] fix weight quant kernel bug when n div 64 != 0 (#60184)

* fix weight-only quant kernel error for n div 64 !=0

* code style fix

* tile (#60261)

* add chunk allocator posix_memalign return value check (#60208) (#60495)

* fix chunk allocator posix_memalign return value check;test=develop

* fix chunk allocator posix_memalign return value check;test=develop

* fix chunk allocator posix_memalign return value check;test=develop

* update 2023 security advisory, test=document_fix (#60532)

* fix fleetutil get_online_pass_interval bug2; test=develop (#60545)

* fix fused_rope diff (#60217) (#60593)

* [cherry-pick]fix fleetutil get_online_pass_interval bug3 (#60620)

* fix fleetutil get_online_pass_interval bug3; test=develop

* fix fleetutil get_online_pass_interval bug3; test=develop

* fix fleetutil get_online_pass_interval bug3; test=develop

* [cherry-pick]update pdsa-2023-019 (#60649)

* update 2023 security advisory, test=document_fix

* update pdsa-2023-019, test=document_fix

* [Dy2St][2.6] Disable `test_grad` on release/2.6 (#60662)

* fix bug of ci (#59926) (#60785)

* [Dy2St][2.6] Disable `test_transformer` on `release/2.6` and update README (#60786)

* [Dy2St][2.6] Disable `test_transformer` on release/2.6 and update README

* [Docs] Update latest release version in README (#60691)

* restore order

* [Dy2St][2.6] Increase `test_transformer` and `test_mobile_net` ut time (#60829) (#60875)

* [Cherry-pick] fix set_value with scalar grad (#60930)

* Fix set value grad (#59034)

* first fix the UT

* fix set value grad

* polish code

* add static mode backward test

* always has input valuetensor

* add dygraph test

* Fix shape error in combined-indexing setitem (#60447)

* add ut

* fix shape error in combine-indexing

* fix ut

* Set value with scalar (#60452)

* set_value with scalar

* fix ut

* remove test_pir

* remove one test since 2.6 not support uint8-add

* [cherry-pick] This PR enable offset of generator for custom device. (#60616) (#60772)

* fix core dump when fallback gather_nd_grad and MemoryAllocateHost (#61067)

* fix qat tests (#61211) (#61284)

* [Security] fix draw security problem (#61161) (#61338)

* fix draw security problem

* fix _decompress security problem (#61294) (#61337)

* Fix CVE-2024-0521 (#61032) (#61287)

This uses shlex for safe command parsing to fix arbitrary code injection

Co-authored-by: ndren <andreien@proton.me>

* [Security] fix security problem for prune_by_memory_estimation (#61382)

* OS Command Injection prune_by_memory_estimation fix

* Fix StyleCode

* [Security] fix security problem for run_cmd (#61285) (#61398)

* fix security problem for run_cmd

* [Security] fix download security problem (#61162) (#61388)

* fix download security problem

* check eval for security (#61389)

* [cherry-pick] adapt c_embedding to phi namespace for custom devices (#60774) (#61045)

Co-authored-by: Tian <121000916+SylarTiaNII@users.noreply.github.com>

* [CherryPick] Fix issue 60092 (#61427)

* fix issue 60092

* update

* update

* update

* Fix unique (#60840) (#61044)

* fix unique kernel, row to num_out

* cinn(py-dsl): skip eval string in python-dsl (#61380) (#61586)

* remove _wget (#61356) (#61569)

* remove _wget

* remove _wget

* remove wget test

* fix layer_norm decompose dtyte bugs, polish codes (#61631)

* fix doc style (#61688)

* merge (#61866)

* [security] refine _get_program_cache_key (#61827) (#61896)

* security, refine _get_program_cache_key

* repeat_interleave support bf16 dtype (#61854) (#61899)

* repeat_interleave support bf16 dtype

* support bf16 on cpu

* Support Fake GroupWise Quant (#61900)

* fix launch when elastic run (#61847) (#61878)

* [Paddle-TRT] fix solve (#61806)

* [Cherry-Pick] Fix CacheKV Quant Bug (#61966)

* fix cachekv quant problem

* add unittest

* Sychronized the paddle2.4 adaptation changes

* clear third_part dependencies

* change submodules to right commits

* build pass with cpu only

* build success with maca

* build success with cutlass and fused kernels

* build with flash_attn and mccl

* build with test, fix some bugs

* fix some bugs

* fixed some compilation bugs

* fix bug in previous commit

* fix bug with split when col_size biger than 256

* add row_limit to show full kernel name

* add env.sh

Change-Id: I6fded2761a44af952a4599691e19a1976bd9b9d1

* add shape record

Change-Id: I273f5a5e97e2a31c1c8987ee1c3ce44a6acd6738

* modify paddle version

Change-Id: I97384323c38066e22562a6fe8f44b245cbd68f98

* wuzhao optimized the performance of elementwise kernel.

Change-Id: I607bc990415ab5ff7fb3337f628b3ac765d3186c

* fix split when dtype is fp16

Change-Id: Ia55d31d11e6fa214d555326a553eaee3e928e597

* fix bug in previous commit

Change-Id: I0fa66120160374da5a774ef2c04f133a54517069

* adapt flash_attn  new capi

Change-Id: Ic669be18daee9cecbc8542a14e02cdc4b8d429ba

* change eigen path

Change-Id: I514c0028e16d19a3084656cc9aa0838a115fc75c

* modify mcname -> replaced_name

Change-Id: Idc520d2db200ed5aa32da9573b19483d81a0fe9e

* fix some build bugs

Change-Id: I50067dfa3fcaa019b5736f4426df6d4e5f64107d

* add PADDLE_ENABLE_SAME_RAND_A100

Change-Id: I2d4ab6ed0b5fac3568562860b0ba1c4f8e346c61
done

* remove redundant warning, add patch from 2.6.1

Change-Id: I958d5bebdc68eb42fe433c76a3737330e00a72aa

* improve VectorizedBroadcastKernel

(cherry picked from commit 19069b26c0bf05a80cc834162db072f6b8aa2536)
Change-Id: Iaf5719d72ab52adbedc40d4788c52eb1ce4d517c
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* fix bugs

(cherry picked from commit b007853a75dbd5de63028f4af82c15a5d3d81f7c)
Change-Id: Iaec0418c384ad2c81c354ef09d81f3e9dfcf82f1
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* split ElementwiseDivGrad

(cherry picked from commit eb6470406b7d440c135a3f7ff68fbed9494e9c1f)
Change-Id: I60e8912be8f8d40ca83a54af1493adfa2962b2d6
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* in VectorizedElementwiseKernel, it can now use vecSize = 8

(cherry picked from commit a873000a6c3bc9e2540e178d460e74e15a3d4de5)
Change-Id: Ia703b1e9e959558988fcd09182387da839d33922
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve ModulatedDeformableCol2imCoordGpuKernel:1.block size 512->64;2.FastDivMod;3.fix VL1;4.remove DmcnGetCoordinateWeight divergent branches.

(cherry picked from commit 82c914bdd29f0eef87a52b229ff84bc456a1beeb)
Change-Id: I60b1fa9a9c89ade25e6b057c38e08616a24fa5e3
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Optimize depthwise_conv2d_grad compute (InputGrad):
1.use shared memory to optimize data load from global memory;
2.different blocksize for different input shape
3.FastDivMod for input shape div, >> and & for stride div.

(cherry picked from commit b34a5634d848f3799f5a8bcf884731dba72d3b20)
Change-Id: I0d8f22f2a2b9d99dc9fbfc1fb69b7bed66010229
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve VectorizedBroadcastKernel with LoadType =
 2(kMixed)

(cherry picked from commit 728b9547f65e096b45f39f096783d2bb49e8556f)
Change-Id: I282dd8284a7cde54061780a22b397133303f51e5
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* fix ElementwiseDivGrad

(cherry picked from commit 5f99c31904e94fd073bdd1696c3431cccaa376cb)
Change-Id: I3ae0d6c01eec124d12fa226a002b10d0c40f820c
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Revert "Optimize depthwise_conv2d_grad compute (InputGrad):"

This reverts commit b34a5634d848f3799f5a8bcf884731dba72d3b20.

(cherry picked from commit 398f5cde81e2131ff7014edfe1d7beaaf806adbb)
Change-Id: I637685b91860a7dea6df6cbba0ff2cf31363e766
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve ElementwiseDivGrad and ElementwiseMulGrad

(cherry picked from commit fe32db418d8f075e083f31dca7010398636a6e67)
Change-Id: I4f7e0f2b5afd4e704ffcd7258def63afc43eea9c
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve FilterBBoxes

(cherry picked from commit fe4655e86b92f5053fa886af49bf199307960a05)
Change-Id: I35003420292359f8a41b19b7ca2cbaae17dc5b45
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve deformable_conv_grad op:1.adaptive block size;2.FastDivMod;3.move ldg up.

(cherry picked from commit a7cb0ed275a3488f79445ef31456ab6560e9de43)
Change-Id: Ia89df4e5a26de64baae4152837d2ce3076c56df1
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve ModulatedDeformableIm2colGpuKernel:1.adaptive block size;2.FastDivMod;3.move ldg up.

(cherry picked from commit 4fb857655d09f55783d9445b91a2d953ed14d0b8)
Change-Id: I7df7f3af7b4615e5e96d33b439e5276be6ddb732
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve KeBNBackwardData:replace 1.0/sqrt with rsqrt

(cherry picked from commit 333cba7aca1edf7a0e87623a0e55e230cd1e9451)
Change-Id: Ic808d42003677ed543621eb22a797f0ab7751baa
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Improve KeBNBackwardData, FilterGradAddupGpuKernel kernels. Improve nonzero and masked_select (forward only) OP.

(cherry picked from commit c907b40eb3f9ded6ee751e522c2a97a353ac93bd)
Change-Id: I7f4845405e64e7599134a8c497f464ac04dead88
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Optimize depthwise_conv2d:
1. 256 Blocksize launch for small shape inputgrad;
2. FastDivMod in inputgrad and filtergrad;
3. shared memory to put output_grad_data in small shape.

(cherry picked from commit f9f29bf7b8d929fb95eb1153a79d8a6b96d5b6d2)
Change-Id: I1a3818201784031dbedc320286ea5f4802dbb6b1
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Improve CheckFiniteAndUnscaleKernel by splitting the kernel into multiple tensors.

(cherry picked from commit 3bd200f262271a333b3947326442b86af7fb6da1)
Change-Id: I57c94cc5e709be8926e1b21da14b653cb18eabc3
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Revert "Improve CheckFiniteAndUnscaleKernel by splitting the kernel into multiple tensors."

This reverts commit 3bd200f262271a333b3947326442b86af7fb6da1.

(cherry picked from commit 86ed8adaa8c20d3c824eecb0ee1e10d365bcea37)
Change-Id: I5b8b7819fdf99255c65fe832d5d77f8e439bdecb
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve ScatterInitCUDAKernel and ScatterCUDAKernel

(cherry picked from commit cddb01a83411c45f68363248291c0c4685e60b24)
Change-Id: Ie106ff8d65c21a8545c40636f021b73f3ad84587
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* fix bugs and make the code easier to read

(cherry picked from commit 07ea3acf347fda434959c8c9cc3533c0686d1836)
Change-Id: Id7a727fd18fac4a662f8af1bf6c6b5ebc6233c9f
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Optimize FilterGard and InputGradSpL

Use tmp to store ldg data in the loop so calculate and ldg time
can fold each other.

(cherry picked from commit 7ddab49d868cdb6deb7c3e17c5ef9bbdbab86c3e)
Change-Id: I46399594d1d7f76b78b9860e483716fdae8fc7d6
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Improve CheckFiniteAndUnscaleKernel by putting address access to shared memory and making single thread do more tasks.

(cherry picked from commit 631ffdda2847cda9562e591dc87b3f529a51a978)
Change-Id: Ie9ffdd872ab06ff34d4daf3134d6744f5221e41e
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Optimize SwinTransformer

1.LayerNormBackward: remove if statement, now will always loop VPT
times for ldg128 in compiler, bool flag to control if write action
will be taken or not;
2.ContiguousCaseOneFunc: tmp saving division result for less division

(cherry picked from commit 422d676507308d26f6107bed924424166aa350d3)
Change-Id: I37aab7e2f97ae6b61c0f50ae4134f5eb1743d429
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Optimize LayerNormBackwardComputeGradInputWithSmallFeatureSize

Set BlockDim.z to make blockSize always be 512, each block can
handle several batches.
Then all threads will loop 4 times for better performance.

(cherry picked from commit 7550c90ca29758952fde13eeea74857ece41908b)
Change-Id: If24de87a0af19ee07e29ac2e7e237800f0181148
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve KeMatrixTopK:1.fix private memory;2.modify max grid size;3.change it to 64 warp reduce.

(cherry picked from commit a346af182b139dfc7737e5f6473dc394b21635d7)
Change-Id: I6c8d8105fd77947c662e6d22a0d15d7bad076bde
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Modify LayerNorm Optimization

Might have lossdiff with old optimization without atomicAdd.

(cherry picked from commit 80b0bcaa9a307c94dbeda658236fd75e104ccccc)
Change-Id: I4a7c4ec2a0e885c2d581dcebc74464830dae7637
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve roi_align op:1.adaptive block size;2.FastDivMod.

(cherry picked from commit cc421d7861c359740de0d2870abcfde4354d8c71)
Change-Id: I55c049e951f93782af1c374331f44b521ed75dfe
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* add workaround for parameters dislocation when calling BatchedGEMM<float16>.

Change-Id: I5788c73a9c45f65e60ed5a88d16a473bbb888927

* fix McFlashAttn string

Change-Id: I8b34f02958ddccb3467f639daaac8044022f3d34

* [C500-27046] fix wb issue

Change-Id: I77730da567903f43ef7a9992925b90ed4ba179c7

* Support compiling external ops

Change-Id: I1b7eb58e7959daff8660ce7889ba390cdfae0c1a

* support flash attn varlen api and support arm build

Change-Id: I94d422c969bdb83ad74262e03efe38ca85ffa673

* Add a copyright notice

Change-Id: I8ece364d926596a40f42d973190525d9b8224d99

* Modify some third-party dependency addresses to public network addresses

---------

Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>
Co-authored-by: risemeup1 <62429225+risemeup1@users.noreply.github.com>
Co-authored-by: Nyakku Shigure <sigure.qaq@gmail.com>
Co-authored-by: gouzil <66515297+gouzil@users.noreply.github.com>
Co-authored-by: Wang Bojun <105858416+wwbitejotunn@users.noreply.github.com>
Co-authored-by: lizexu123 <39205361+lizexu123@users.noreply.github.com>
Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com>
Co-authored-by: Vigi Zhang <VigiZhang@users.noreply.github.com>
Co-authored-by: tianhaodongbd <137985359+tianhaodongbd@users.noreply.github.com>
Co-authored-by: zyfncg <zhangyunfei07@baidu.com>
Co-authored-by: JYChen <zoooo0820@qq.com>
Co-authored-by: zhaohaixu <49297029+zhaohaixu@users.noreply.github.com>
Co-authored-by: Spelling <33216444+raining-dark@users.noreply.github.com>
Co-authored-by: zhouzj <41366441+zzjjay@users.noreply.github.com>
Co-authored-by: wanghuancoder <wanghuan29@baidu.com>
Co-authored-by: ndren <andreien@proton.me>
Co-authored-by: Nguyen Cong Vinh <80946737+vn-ncvinh@users.noreply.github.com>
Co-authored-by: Ruibin Cheung <beinggod@foxmail.com>
Co-authored-by: Tian <121000916+SylarTiaNII@users.noreply.github.com>
Co-authored-by: Yuanle Liu <yuanlehome@163.com>
Co-authored-by: zhuyipin <yipinzhu@outlook.com>
Co-authored-by: 6clc <chaoliu.lc@foxmail.com>
Co-authored-by: Wenyu <wenyu.lyu@gmail.com>
Co-authored-by: Xianduo Li <30922914+lxd-cumt@users.noreply.github.com>
Co-authored-by: Wang Xin <xinwang614@gmail.com>
Co-authored-by: Chang Xu <molixu7@gmail.com>
Co-authored-by: wentao yu <yuwentao126@126.com>
Co-authored-by: zhink <33270771+zhink@users.noreply.github.com>
Co-authored-by: handiz <35895648+ZhangHandi@users.noreply.github.com>
Co-authored-by: zhimin Pan <zhimin.pan@metax-tech.com>
Co-authored-by: m00891 <Zequn.Yang@metax-tech.com>
Co-authored-by: shuliu <shupeng.liu@metax-tech.com>
Co-authored-by: Yanxin Zhou <yanxin.zhou@metax-tech.com>
Co-authored-by: Zhao Wu <zhao.wu@metax-tech.com>
Co-authored-by: m00932 <xiangrong.yi@metax-tech.com>
Co-authored-by: Fangzhou Feng <fangzhou.feng@metax-tech.com>
Co-authored-by: junwang <jun.wang@metax-tech.com>
Co-authored-by: m01097 <qimeng.du@metax-tech.com>
  • Loading branch information
1 parent e032331 commit b102bc4
Show file tree
Hide file tree
Showing 310 changed files with 8,435 additions and 4,921 deletions.
40 changes: 13 additions & 27 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
[submodule "third_party/protobuf"]
path = third_party/protobuf
url = https://github.com/protocolbuffers/protobuf.git
tag = paddle
ignore = dirty
[submodule "third_party/pocketfft"]
path = third_party/pocketfft
Expand All @@ -21,10 +22,11 @@
[submodule "third_party/utf8proc"]
path = third_party/utf8proc
url = https://github.com/JuliaStrings/utf8proc.git
tag = v2.6.1
ignore = dirty
[submodule "third_party/warpctc"]
path = third_party/warpctc
url = https://github.com/baidu-research/warp-ctc.git
url = http://pdegit.metax-internal.com/pde-ai/warp-ctc.git
ignore = dirty
[submodule "third_party/warprnnt"]
path = third_party/warprnnt
Expand All @@ -33,10 +35,12 @@
[submodule "third_party/xxhash"]
path = third_party/xxhash
url = https://github.com/Cyan4973/xxHash.git
tag = v0.6.5
ignore = dirty
[submodule "third_party/pybind"]
path = third_party/pybind
url = https://github.com/pybind/pybind11.git
tag = v2.4.3
ignore = dirty
[submodule "third_party/threadpool"]
path = third_party/threadpool
Expand All @@ -45,39 +49,25 @@
[submodule "third_party/zlib"]
path = third_party/zlib
url = https://github.com/madler/zlib.git
tag = v1.2.8
ignore = dirty
[submodule "third_party/glog"]
path = third_party/glog
url = https://github.com/google/glog.git
ignore = dirty
[submodule "third_party/eigen3"]
path = third_party/eigen3
url = https://gitlab.com/libeigen/eigen.git
ignore = dirty
[submodule "third_party/snappy"]
path = third_party/snappy
url = https://github.com/google/snappy.git
ignore = dirty
[submodule "third_party/cub"]
path = third_party/cub
url = https://github.com/NVIDIA/cub.git
ignore = dirty
[submodule "third_party/cutlass"]
path = third_party/cutlass
url = https://github.com/NVIDIA/cutlass.git
ignore = dirty
[submodule "third_party/xbyak"]
path = third_party/xbyak
url = https://github.com/herumi/xbyak.git
tag = v5.81
ignore = dirty
[submodule "third_party/mkldnn"]
path = third_party/mkldnn
url = https://github.com/oneapi-src/oneDNN.git
ignore = dirty
[submodule "third_party/flashattn"]
path = third_party/flashattn
url = https://github.com/PaddlePaddle/flash-attention.git
ignore = dirty
[submodule "third_party/gtest"]
path = third_party/gtest
url = https://github.com/google/googletest.git
Expand All @@ -98,15 +88,11 @@
path = third_party/rocksdb
url = https://github.com/Thunderbrook/rocksdb
ignore = dirty
[submodule "third_party/absl"]
path = third_party/absl
url = https://github.com/abseil/abseil-cpp.git
ignore = dirty
[submodule "third_party/jitify"]
path = third_party/jitify
url = https://github.com/NVIDIA/jitify.git
[submodule "third_party/cutlass"]
path = third_party/cutlass
url = http://pdegit.metax-internal.com/pde-ai/cutlass.git
ignore = dirty
[submodule "third_party/cccl"]
path = third_party/cccl
url = https://github.com/NVIDIA/cccl.git
[submodule "third_party/eigen3"]
path = third_party/eigen3
url = ssh://gerrit.metax-internal.com:29418/MACA/library/mcEigen
ignore = dirty
23 changes: 20 additions & 3 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# 2024 - Modified by MetaX Integrated Circuits (Shanghai) Co., Ltd. All Rights Reserved.
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
Expand All @@ -24,7 +25,7 @@ endif()
# https://cmake.org/cmake/help/v3.0/policy/CMP0026.html?highlight=cmp0026
cmake_policy(SET CMP0026 OLD)
cmake_policy(SET CMP0079 NEW)
set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_CURRENT_SOURCE_DIR}/cmake")
set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_CURRENT_SOURCE_DIR}/cmake" $ENV{CMAKE_MODULE_PATH})
set(PADDLE_SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR})
set(PADDLE_BINARY_DIR ${CMAKE_CURRENT_BINARY_DIR})

Expand Down Expand Up @@ -92,6 +93,7 @@ endif()

if(WITH_GPU AND NOT APPLE)
enable_language(CUDA)
set(CMAKE_CUDA_COMPILER_VERSION 11.6)
message(STATUS "CUDA compiler: ${CMAKE_CUDA_COMPILER}, version: "
"${CMAKE_CUDA_COMPILER_ID} ${CMAKE_CUDA_COMPILER_VERSION}")
endif()
Expand Down Expand Up @@ -255,7 +257,7 @@ option(WITH_SYSTEM_BLAS "Use system blas library" OFF)
option(WITH_DISTRIBUTE "Compile with distributed support" OFF)
option(WITH_BRPC_RDMA "Use brpc rdma as the rpc protocal" OFF)
option(ON_INFER "Turn on inference optimization and inference-lib generation"
ON)
OFF)
option(WITH_CPP_DIST "Install PaddlePaddle C++ distribution" OFF)
option(WITH_GFLAGS "Compile PaddlePaddle with gflags support" OFF)
################################ Internal Configurations #######################################
Expand Down Expand Up @@ -283,7 +285,7 @@ option(
OFF)
option(WITH_LITE "Compile Paddle Fluid with Lite Engine" OFF)
option(WITH_CINN "Compile PaddlePaddle with CINN" OFF)
option(WITH_NCCL "Compile PaddlePaddle with NCCL support" ON)
option(WITH_NCCL "Compile PaddlePaddle with NCCL support" OFF)
option(WITH_RCCL "Compile PaddlePaddle with RCCL support" ON)
option(WITH_XPU_BKCL "Compile PaddlePaddle with BAIDU KUNLUN XPU BKCL" OFF)
option(WITH_CRYPTO "Compile PaddlePaddle with crypto support" ON)
Expand Down Expand Up @@ -474,6 +476,21 @@ if(WITH_GPU)
# so include(cudnn) needs to be in front of include(third_party/lite)
include(cudnn) # set cudnn libraries, must before configure
include(tensorrt)

include_directories("$ENV{MACA_PATH}/tools/cu-bridge/include")
include_directories("$ENV{MACA_PATH}/include")
include_directories("$ENV{MACA_PATH}/include/mcblas")
include_directories("$ENV{MACA_PATH}/include/mcr")
include_directories("$ENV{MACA_PATH}/include/mcdnn")
include_directories("$ENV{MACA_PATH}/include/mcsim")
include_directories("$ENV{MACA_PATH}/include/mcsparse")
include_directories("$ENV{MACA_PATH}/include/mcfft")
include_directories("$ENV{MACA_PATH}/include/mcrand")
include_directories("$ENV{MACA_PATH}/include/common")
include_directories("$ENV{MACA_PATH}/include/mcsolver")
include_directories("$ENV{MACA_PATH}/include/mctx")
include_directories("$ENV{MACA_PATH}/include/mcpti")
include_directories("$ENV{MACA_PATH}/mxgpu_llvm/include")
# there is no official support of nccl, cupti in windows
if(NOT WIN32)
include(cupti)
Expand Down
183 changes: 183 additions & 0 deletions NOTICE
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
The following files may have been modified by MetaX Integrated Circuits (Shanghai) Co., Ltd. in 2024.

.gitmodules
CMakeLists.txt
cmake/cuda.cmake
cmake/cudnn.cmake
cmake/cupti.cmake
cmake/external/brpc.cmake
cmake/external/cryptopp.cmake
cmake/external/cutlass.cmake
cmake/external/dgc.cmake
cmake/external/dlpack.cmake
cmake/external/eigen.cmake
cmake/external/flashattn.cmake
cmake/external/jemalloc.cmake
cmake/external/lapack.cmake
cmake/external/libmct.cmake
cmake/external/mklml.cmake
cmake/external/protobuf.cmake
cmake/external/pybind11.cmake
cmake/external/utf8proc.cmake
cmake/flags.cmake
cmake/generic.cmake
cmake/inference_lib.cmake
cmake/nccl.cmake
cmake/third_party.cmake
env.sh
paddle/fluid/distributed/fleet_executor/test/interceptor_ping_pong_with_brpc_test.cc
paddle/fluid/eager/api/manual/eager_manual/forwards/multiply_fwd_func.cc
paddle/fluid/eager/auto_code_generator/eager_generator.cc
paddle/fluid/eager/auto_code_generator/generator/eager_gen.py
paddle/fluid/framework/details/build_strategy.cc
paddle/fluid/framework/distributed_strategy.proto
paddle/fluid/inference/api/resource_manager.cc
paddle/fluid/inference/api/resource_manager.h
paddle/fluid/inference/tensorrt/plugin/layernorm_shift_partition_op.cu
paddle/fluid/inference/tensorrt/plugin/matmul_op_int8_plugin.h
paddle/fluid/inference/tensorrt/plugin/preln_residual_bias_plugin.cu
paddle/fluid/memory/allocation/CMakeLists.txt
paddle/fluid/memory/allocation/allocator_facade.cc
paddle/fluid/operators/CMakeLists.txt
paddle/fluid/operators/correlation_op.cu
paddle/fluid/operators/elementwise/elementwise_op_function.h
paddle/fluid/operators/fused/CMakeLists.txt
paddle/fluid/operators/fused/attn_gemm_int8.h
paddle/fluid/operators/fused/cublaslt.h
paddle/fluid/operators/fused/fused_gate_attention.h
paddle/fluid/operators/fused/fused_gemm_epilogue_op.cu
paddle/fluid/operators/fused/fused_layernorm_residual_dropout_bias.h
paddle/fluid/operators/fused/fused_multi_transformer_int8_op.cu
paddle/fluid/operators/fused/fused_multi_transformer_op.cu
paddle/fluid/operators/fused/fused_multi_transformer_op.cu.h
paddle/fluid/operators/fused/fused_softmax_mask.cu.h
paddle/fluid/operators/math/inclusive_scan.h
paddle/fluid/operators/matmul_op.cc
paddle/fluid/operators/row_conv_op.cu
paddle/fluid/operators/sparse_attention_op.cu
paddle/fluid/platform/cuda_graph_with_memory_pool.cc
paddle/fluid/platform/device/gpu/cuda/cuda_helper.h
paddle/fluid/platform/device/gpu/cuda_helper_test.cu
paddle/fluid/platform/device/gpu/gpu_types.h
paddle/fluid/platform/device_context.h
paddle/fluid/platform/dynload/CMakeLists.txt
paddle/fluid/platform/dynload/cublas.h
paddle/fluid/platform/dynload/cublasLt.cc
paddle/fluid/platform/dynload/cublasLt.h
paddle/fluid/platform/dynload/cusparseLt.h
paddle/fluid/platform/init.cc
paddle/fluid/platform/init_phi_test.cc
paddle/fluid/pybind/eager_legacy_op_function_generator.cc
paddle/fluid/pybind/fleet_py.cc
paddle/fluid/pybind/pybind.cc
paddle/phi/api/profiler/profiler.cc
paddle/phi/backends/dynload/CMakeLists.txt
paddle/phi/backends/dynload/cublas.h
paddle/phi/backends/dynload/cublasLt.cc
paddle/phi/backends/dynload/cublasLt.h
paddle/phi/backends/dynload/cuda_driver.h
paddle/phi/backends/dynload/cudnn.h
paddle/phi/backends/dynload/cufft.h
paddle/phi/backends/dynload/cupti.h
paddle/phi/backends/dynload/curand.h
paddle/phi/backends/dynload/cusolver.h
paddle/phi/backends/dynload/cusparse.h
paddle/phi/backends/dynload/cusparseLt.h
paddle/phi/backends/dynload/dynamic_loader.cc
paddle/phi/backends/dynload/flashattn.h
paddle/phi/backends/dynload/nccl.h
paddle/phi/backends/dynload/nvjpeg.h
paddle/phi/backends/dynload/nvrtc.h
paddle/phi/backends/dynload/nvtx.h
paddle/phi/backends/gpu/cuda/cuda_device_function.h
paddle/phi/backends/gpu/cuda/cuda_helper.h
paddle/phi/backends/gpu/forwards.h
paddle/phi/backends/gpu/gpu_context.cc
paddle/phi/backends/gpu/gpu_context.h
paddle/phi/backends/gpu/gpu_decls.h
paddle/phi/backends/gpu/gpu_resources.cc
paddle/phi/backends/gpu/gpu_resources.h
paddle/phi/backends/gpu/rocm/rocm_device_function.h
paddle/phi/core/custom_kernel.cc
paddle/phi/core/distributed/check/nccl_dynamic_check.h
paddle/phi/core/distributed/comm_context_manager.h
paddle/phi/core/enforce.h
paddle/phi/core/flags.cc
paddle/phi/core/visit_type.h
paddle/phi/kernels/funcs/aligned_vector.h
paddle/phi/kernels/funcs/blas/blas_impl.cu.h
paddle/phi/kernels/funcs/blas/blaslt_impl.cu.h
paddle/phi/kernels/funcs/broadcast_function.h
paddle/phi/kernels/funcs/concat_and_split_functor.cu
paddle/phi/kernels/funcs/cublaslt.h
paddle/phi/kernels/funcs/deformable_conv_functor.cu
paddle/phi/kernels/funcs/distribution_helper.h
paddle/phi/kernels/funcs/dropout_impl.cu.h
paddle/phi/kernels/funcs/elementwise_base.h
paddle/phi/kernels/funcs/elementwise_grad_base.h
paddle/phi/kernels/funcs/fused_gemm_epilogue.h
paddle/phi/kernels/funcs/gemm_int8_helper.h
paddle/phi/kernels/funcs/inclusive_scan.h
paddle/phi/kernels/funcs/layer_norm_impl.cu.h
paddle/phi/kernels/funcs/math_cuda_utils.h
paddle/phi/kernels/funcs/reduce_function.h
paddle/phi/kernels/funcs/scatter.cu.h
paddle/phi/kernels/funcs/top_k_function_cuda.h
paddle/phi/kernels/funcs/weight_only_gemv.cu
paddle/phi/kernels/fusion/cutlass/utils/cuda_utils.h
paddle/phi/kernels/fusion/gpu/attn_gemm.h
paddle/phi/kernels/fusion/gpu/fused_dropout_add_utils.h
paddle/phi/kernels/fusion/gpu/fused_dropout_helper.h
paddle/phi/kernels/fusion/gpu/fused_layernorm_residual_dropout_bias.h
paddle/phi/kernels/fusion/gpu/fused_linear_param_grad_add_kernel.cu
paddle/phi/kernels/fusion/gpu/fused_softmax_mask_upper_triangle_utils.h
paddle/phi/kernels/fusion/gpu/fused_softmax_mask_utils.h
paddle/phi/kernels/fusion/gpu/mmha_util.cu.h
paddle/phi/kernels/gpu/accuracy_kernel.cu
paddle/phi/kernels/gpu/amp_kernel.cu
paddle/phi/kernels/gpu/batch_norm_grad_kernel.cu
paddle/phi/kernels/gpu/contiguous_kernel.cu
paddle/phi/kernels/gpu/decode_jpeg_kernel.cu
paddle/phi/kernels/gpu/deformable_conv_grad_kernel.cu
paddle/phi/kernels/gpu/depthwise_conv.h
paddle/phi/kernels/gpu/dist_kernel.cu
paddle/phi/kernels/gpu/flash_attn_grad_kernel.cu
paddle/phi/kernels/gpu/flash_attn_kernel.cu
paddle/phi/kernels/gpu/flash_attn_utils.h
paddle/phi/kernels/gpu/gelu_funcs.h
paddle/phi/kernels/gpu/generate_proposals_kernel.cu
paddle/phi/kernels/gpu/group_norm_kernel.cu
paddle/phi/kernels/gpu/interpolate_grad_kernel.cu
paddle/phi/kernels/gpu/kthvalue_kernel.cu
paddle/phi/kernels/gpu/llm_int8_linear_kernel.cu
paddle/phi/kernels/gpu/masked_select_kernel.cu
paddle/phi/kernels/gpu/nonzero_kernel.cu
paddle/phi/kernels/gpu/roi_align_grad_kernel.cu
paddle/phi/kernels/gpu/roi_align_kernel.cu
paddle/phi/kernels/gpu/strided_copy_kernel.cu
paddle/phi/kernels/gpu/top_k_kernel.cu
paddle/phi/kernels/gpu/top_p_sampling_kernel.cu
paddle/phi/kernels/gpu/unique_consecutive_functor.h
paddle/phi/kernels/gpu/unique_kernel.cu
paddle/phi/kernels/gpudnn/conv_cudnn_v7.h
paddle/phi/kernels/gpudnn/softmax_gpudnn.h
paddle/phi/kernels/impl/deformable_conv_grad_kernel_impl.h
paddle/phi/kernels/impl/llm_int8_matmul_kernel_impl.h
paddle/phi/kernels/impl/matmul_kernel_impl.h
paddle/phi/kernels/impl/multi_dot_kernel_impl.h
paddle/phi/kernels/primitive/datamover_primitives.h
paddle/phi/kernels/primitive/kernel_primitives.h
paddle/phi/tools/CMakeLists.txt
paddle/utils/flat_hash_map.h
patches/eigen/TensorReductionGpu.h
python/paddle/base/framework.py
python/paddle/distributed/launch/controllers/watcher.py
python/paddle/profiler/profiler_statistic.py
python/paddle/utils/cpp_extension/cpp_extension.py
python/paddle/utils/cpp_extension/extension_utils.py
test/CMakeLists.txt
test/cpp/CMakeLists.txt
test/cpp/jit/CMakeLists.txt
test/cpp/new_executor/CMakeLists.txt
test/legacy_test/test_flash_attention.py
tools/ci_op_benchmark.sh
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ PaddlePaddle is originated from industrial practices with dedication and commitm

## Installation

### Latest PaddlePaddle Release: [v2.5](https://github.com/PaddlePaddle/Paddle/tree/release/2.5)
### Latest PaddlePaddle Release: [v2.6](https://github.com/PaddlePaddle/Paddle/tree/release/2.6)

Our vision is to enable deep learning for everyone via PaddlePaddle.
Please refer to our [release announcement](https://github.com/PaddlePaddle/Paddle/releases) to track the latest features of PaddlePaddle.
Expand Down
4 changes: 2 additions & 2 deletions README_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,9 @@

## 安装

### PaddlePaddle最新版本: [v2.5](https://github.com/PaddlePaddle/Paddle/tree/release/2.5)
### PaddlePaddle 最新版本: [v2.6](https://github.com/PaddlePaddle/Paddle/tree/release/2.6)

跟进PaddlePaddle最新特性请参考我们的[版本说明](https://github.com/PaddlePaddle/Paddle/releases)
跟进 PaddlePaddle 最新特性请参考我们的[版本说明](https://github.com/PaddlePaddle/Paddle/releases)

### 安装最新稳定版本:
```
Expand Down
2 changes: 1 addition & 1 deletion README_ja.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ PaddlePaddle は、工業化に対するコミットメントを持つ工業的

## インストール

### PaddlePaddle の最新リリース: [v2.5](https://github.com/PaddlePaddle/Paddle/tree/release/2.5)
### PaddlePaddle の最新リリース: [v2.6](https://github.com/PaddlePaddle/Paddle/tree/release/2.6)

私たちのビジョンは、PaddlePaddle を通じて、誰もが深層学習を行えるようにすることです。
PaddlePaddle の最新機能を追跡するために、私たちの[リリースのお知らせ](https://github.com/PaddlePaddle/Paddle/releases)を参照してください。
Expand Down
Loading

0 comments on commit b102bc4

Please sign in to comment.