Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU linear operators #163

Open
wants to merge 50 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
5c2e71c
init design (no working implementation)
Abdelrahman912 Dec 18, 2024
8636a49
init setup
Abdelrahman912 Dec 20, 2024
34d8b79
make cuda work for the test kernel
Abdelrahman912 Dec 21, 2024
f7bd306
add gpuy subdof and init gpu update operator
Abdelrahman912 Jan 6, 2025
7eedd51
add subdof iterator
Abdelrahman912 Jan 6, 2025
16846e4
minor fix in sdh iterator
Abdelrahman912 Jan 6, 2025
5787e82
first working example (not refined)
Abdelrahman912 Jan 9, 2025
c58a14f
working example (not refined)
Abdelrahman912 Jan 9, 2025
324dbd0
Merge branch 'main' into gpu-operators
Abdelrahman912 Jan 9, 2025
139fc83
minor adjustment
Abdelrahman912 Jan 9, 2025
babc06c
minor refinement
Abdelrahman912 Jan 13, 2025
fe064cf
init add coeffs
Abdelrahman912 Jan 13, 2025
6bdae08
add to extension (doesn't dispatch to ext tho)
Abdelrahman912 Jan 14, 2025
c5c8b0b
minor edit
Abdelrahman912 Jan 14, 2025
4d3f7de
move to ext (working implementation)
Abdelrahman912 Jan 15, 2025
ac9c807
minor fix
Abdelrahman912 Jan 15, 2025
0d075a8
add coesfficients tests for gpu
Abdelrahman912 Jan 17, 2025
863ab05
minor refinements in coefficients
Abdelrahman912 Jan 17, 2025
0daf36c
minor fix
Abdelrahman912 Jan 17, 2025
7ba3139
restructure PR913
Abdelrahman912 Jan 21, 2025
077572f
change gpu -> device
Abdelrahman912 Jan 21, 2025
34c4536
some restructuring for memory allocation
Abdelrahman912 Jan 24, 2025
0e41a01
fix extension
Abdelrahman912 Jan 25, 2025
f512b88
optimize global mem
Abdelrahman912 Jan 25, 2025
cfc7cdd
minor fix for cuda op
Abdelrahman912 Jan 25, 2025
b59927d
pre kernel launch adapt for dh
Abdelrahman912 Feb 1, 2025
eb93c8d
Merge branch 'main' into gpu-operators
Abdelrahman912 Feb 1, 2025
d4d3bab
minor fix
Abdelrahman912 Feb 1, 2025
23f5413
init fix dh adapt error
Abdelrahman912 Feb 5, 2025
5e5216d
fix memory leak
Abdelrahman912 Feb 6, 2025
d50f7fa
pr review fix
Abdelrahman912 Feb 7, 2025
32c5763
init qp
Abdelrahman912 Feb 11, 2025
7bcdd88
minor fix
Abdelrahman912 Feb 11, 2025
8fe74d2
change operators and fuse mem alloc
Abdelrahman912 Feb 12, 2025
b6be5d4
subdof loop (i.e. get rid of unsafe cuda convert)
Abdelrahman912 Feb 13, 2025
110c7ca
Merge branch 'gpu-operators' into add-quadpoints-in-quadvalues
Abdelrahman912 Feb 13, 2025
ba95bae
use quad points instead of quad values
Abdelrahman912 Feb 13, 2025
97c6b00
minor fix
Abdelrahman912 Feb 14, 2025
69530cc
fix 1
Abdelrahman912 Feb 17, 2025
7339d1c
add complex test function and other stuff
Abdelrahman912 Feb 17, 2025
59fa5b4
fix CI (hopefully!)
Abdelrahman912 Feb 18, 2025
72a8f0e
minor fix
Abdelrahman912 Feb 18, 2025
569ef85
add n_threads and n_blocks as kwargs + change cuda default strategy n…
Abdelrahman912 Feb 18, 2025
a2a8bcd
fix SVector
Abdelrahman912 Feb 18, 2025
a0e28f4
add benchmarks + solve major bugs in memory allocation
Abdelrahman912 Feb 19, 2025
2215395
remove nested closures
Abdelrahman912 Feb 19, 2025
81a6529
add line + remove unnecessary code
Abdelrahman912 Feb 19, 2025
0c1d58f
Merge branch 'main' into gpu-operators
Abdelrahman912 Feb 19, 2025
d1bd648
add benchmarks
Abdelrahman912 Feb 20, 2025
291f3be
change analytical function + add PEALinearOperator benchmark
Abdelrahman912 Feb 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -55,11 +55,12 @@ julia = "1.10"

[extras]
Aqua = "4c88cf16-eb10-579e-8560-4a9242c79595"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
DelimitedFiles = "8bb1440f-4735-579b-a4ab-409b98df4dab"
Pkg = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"
StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
Tensors = "48a634ad-e948-5137-8d70-aa71f2a747f4"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Aqua", "DelimitedFiles", "Pkg", "StaticArrays", "Tensors", "Test"]
test = ["Aqua", "DelimitedFiles", "Pkg", "StaticArrays", "Tensors", "Test", "CUDA"]
67 changes: 67 additions & 0 deletions benchmarks/benchmarks-cuda-linear-form.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
using BenchmarkTools, Thunderbolt, StaticArrays,Ferrite , CUDA

# The Following is needed to enforce grid to use Float32.
left = Tensor{1, 3, Float32}((-1.0, -1.0,-1.0))
right = Tensor{1, 3, Float32}((1.0, 1.0, 1.0))

grid = generate_grid(Hexahedron , (500,100,100),left,right)

ip_collection = LagrangeCollection{1}()
qr_collection = QuadratureRuleCollection(2)
dh = DofHandler(grid)
add!(dh, :u, getinterpolation(ip_collection, first(grid.cells)))
close!(dh)
cs = CartesianCoordinateSystem(grid)
protocol = AnalyticalTransmembraneStimulationProtocol(
AnalyticalCoefficient((x,t) -> cos(2Ο€ * t) * exp(-norm(x)^2), CoordinateSystemCoefficient(cs)),
[SVector((0.f0, 1.f0))]
)



#############################
# CPU operator Benchmarking #
#############################

linop = Thunderbolt.LinearOperator(
zeros(ndofs(dh)),
protocol,
qr_collection,
dh,
)

@benchmark Thunderbolt.update_operator!($linop,$0.0)


#############################
# GPU operator Benchmarking #
#############################

cuda_strategy = Thunderbolt.CudaAssemblyStrategy()
# Notes on launch configuration:
# These values are based on optimal occupancy of my GPU (Nvidia GeForce RTX 3050 Ti w 4GB VRAM) from Nsight Compute.
# The number of threads per block is 384, and the number of blocks (no. SMs) is 20.
cuda_op = Thunderbolt.init_linear_operator(cuda_strategy,protocol, qr_collection, dh;n_threads = 384,n_blocks = 20);

## benchmark with BenchmarkTools
@btime Thunderbolt.update_operator!($cuda_op,$0.f0)

# benchmark with CUDA/Nvidia tools
# Nsight Compute command: ncu --mode=launch julia
# Note: run twice to get the correct time.
Thunderbolt.update_operator!(cuda_op,0.f0) # warm up
CUDA.@profile trace=true Thunderbolt.update_operator!(cuda_op,0.f0)


######################################
# CPU threaded operator Benchmarking #
######################################

plinop = Thunderbolt.PEALinearOperator(
zeros(ndofs(dh)),
qr_collection,
protocol,
dh,
);

@benchmark Thunderbolt.update_operator!($plinop,$0.0)
13 changes: 10 additions & 3 deletions benchmarks/benchmarks-linear-form.jl
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
using BenchmarkTools, Thunderbolt, StaticArrays
using BenchmarkTools, Thunderbolt, StaticArrays,Ferrite

grid = generate_grid(celltype, (1,1,1))
grid = generate_grid(Hexahedron , (1,1,1))
cell_cache = Ferrite.CellCache(grid)
reinit!(cell_cache,1)

Expand All @@ -9,12 +9,19 @@ ip = getinterpolation(ip_collection, grid.cells[1])
qr_collection = QuadratureRuleCollection(2)
qr = getquadraturerule(qr_collection, grid.cells[1])
cv = CellValues(qr, ip)
dh = DofHandler(grid)
add!(dh, :u, getinterpolation(ip_collection, first(grid.cells)))
close!(dh)
sdh = first(dh.subdofhandlers)
ac = AnalyticalCoefficient(
(x,t) -> norm(x)+t,
CoordinateSystemCoefficient(
CartesianCoordinateSystem(grid)
)
)

coeff_cache = Thunderbolt.setup_coefficient_cache(ac, qr, sdh)

b = zeros(8)
element_cache = Thunderbolt.AnalyticalCoefficientElementCache(ac, [SVector((0.0,1.0))], cv)
element_cache = Thunderbolt.AnalyticalCoefficientElementCache(coeff_cache, [SVector((0.0,1.0))], cv)
@btime Thunderbolt.assemble_element!($b, $cell_cache, $element_cache, 0.0)
98 changes: 31 additions & 67 deletions ext/CuThunderboltExt.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3,83 +3,42 @@ module CuThunderboltExt
using Thunderbolt

import CUDA:
CUDA, CuArray, CuVector, CUSPARSE,
threadIdx, blockIdx, blockDim, @cuda,
CUDABackend, launch_configuration
CUDA, CuArray, CuVector, CUSPARSE,blockDim,blockIdx,gridDim,threadIdx,
threadIdx, blockIdx, blockDim, @cuda, @cushow,
CUDABackend, launch_configuration, device, cu,cudaconvert

import Thunderbolt:
UnPack.@unpack,
SimpleMesh,
SparseMatrixCSR, SparseMatrixCSC,
AbstractSemidiscreteFunction, AbstractPointwiseFunction, solution_size,
AbstractPointwiseSolverCache,
GPUDofHandlerData, GPUSubDofHandlerData, GPUDofHandler,
GPUGrid
AbstractPointwiseSolverCache,assemble_element!,
LinearOperator,QuadratureRuleCollection,
AnalyticalCoefficientElementCache,AnalyticalCoefficientCache,CartesianCoordinateSystemCache,
setup_element_cache,update_operator!,init_linear_operator,FieldCoefficientCache, CudaAssemblyStrategy, floattype,inttype,
convert_vec_to_concrete,deep_adapt,AbstractElementAssembly,GeneralLinearOperator

import Ferrite:
AbstractDofHandler

import Adapt:
Adapt, adapt_structure, adapt

# ---------------------- Generic part ------------------------
function _convert_subdofhandler_to_gpu(cell_dofs, cell_dof_soffset, sdh::SubDofHandler)
GPUSubDofHandler(
cell_dofs,
cell_dofs_offset,
adapt(typeof(cell_dofs), collect(sdh.cellset)),
Tuple(sym for sym in sdh.field_names),
Tuple(sym for sym in sdh.field_n_components),
sdh.ndofs_per_cell.x,
)
end
import Thunderbolt.FerriteUtils:
StaticInterpolationValues,StaticCellValues, allocate_device_mem,
CellIterator, mem_size, cellmem,ncells,celldofsview,
DeviceDofHandlerData, DeviceSubDofHandler, DeviceDofHandler, DeviceGrid,
cellfe, AbstractDeviceGlobalMem, AbstractDeviceSharedMem,AbstractDeviceCellIterator,AbstractCellMem,
FeMemShape, KeMemShape, KeFeMemShape, DeviceCellIterator,DeviceOutOfBoundCellIterator,DeviceCellCache,
FeCellMem, KeCellMem, KeFeCellMem,NoCellMem,AbstractMemShape

function Adapt.adapt_structure(to::Type{CUDABackend}, dh::DofHandler{sdim}) where sdim
grid = adapt_structure(to, dh.grid)
# field_names = Tuple(sym for sym in dh.field_names)
IndexType = eltype(dh.cell_dofs)
IndexVectorType = CuVector{IndexType}
cell_dofs = adapt(IndexVectorType, dh.cell_dofs)
cell_dofs_offset = adapt(IndexVectorType, dh.cell_dofs_offset)
cell_to_sdh = adapt(IndexVectorType, dh.cell_to_subdofhandler)
subdofhandlers = Tuple(i->_convert_subdofhandler_to_gpu(cell_dofs, cell_dofs_offset, sdh) for sdh in dh.subdofhandlers)
gpudata = GPUDofHandlerData(
grid,
subdofhandlers,
# field_names,
cell_dofs,
cell_dofs_offset,
cell_to_sdh,
dh.ndofs.x,
)
return GPUDofHandler(dh, gpudata)
end

import Ferrite:
AbstractDofHandler,get_grid,CellIterator,get_node_coordinate,getcoordinates,get_coordinate_eltype,getcells,
get_node_ids,get_coordinate_type,nnodes

import StaticArrays:
SVector,MVector

import Adapt:
Adapt, adapt_structure, adapt, @adapt_structure

function Adapt.adapt_structure(to::Type{CUDABackend}, grid::Grid{sdim, cell_type, T}) where {sdim, cell_type, T}
node_type = typeof(first(grid.nodes))
cells = Adapt.adapt_structure(to, grid.cells)
nodes = Adapt.adapt_structure(to, grid.nodes)
#TODO subdomain info
return GPUGrid{sdim, cell_type, T, typeof(cells), typeof(nodes)}(cells, nodes)
end

# function Thunderbolt.setup_operator(protocol::Thunderbolt.AnalyticalTransmembraneStimulationProtocol, solver::Thunderbolt.AbstractSolver, dh::GPUDofHandler, field_name::Symbol, qr)
# ip = dh.dh.subdofhandlers[1].field_interpolations[1]
# ip_g = Ferrite.geometric_interpolation(typeof(getcells(Ferrite.get_grid(dh), 1)))
# qr = QuadratureRule{Ferrite.getrefshape(ip_g)}(Ferrite.getorder(ip_g)+1)
# cv = CellValues(qr, ip, ip_g) # TODO replace with GPUCellValues
# return PEALinearOperator(
# zeros(ndofs(dh)),
# AnalyticalCoefficientElementCache(
# protocol.f,
# protocol.nonzero_intervals,
# cv,
# ),
# dh,
# )
# end
# ---------------------- Generic part ------------------------

# Pointwise cuda solver wrapper
function _gpu_pointwise_step_inner_kernel_wrapper!(f, t, Ξ”t, cache::AbstractPointwiseSolverCache)
Expand All @@ -99,8 +58,8 @@ function Thunderbolt._pointwise_step_outer_kernel!(f::AbstractPointwiseFunction,
return true
end

_allocate_matrix(dh::GPUDofHandler, A::SparseMatrixCSR, ::CuVector) = CuSparseMatrixCSR(A)
_allocate_matrix(dh::GPUDofHandler, A::SparseMatrixCSC, ::CuVector) = CuSparseMatrixCSC(A)
_allocate_matrix(dh::DeviceDofHandler, A::SparseMatrixCSR, ::CuVector) = CuSparseMatrixCSR(A)
_allocate_matrix(dh::DeviceDofHandler, A::SparseMatrixCSC, ::CuVector) = CuSparseMatrixCSC(A)

Thunderbolt.create_system_vector(::Type{<:CuVector{T}}, f::AbstractSemidiscreteFunction) where T = CUDA.zeros(T, solution_size(f))
Thunderbolt.create_system_vector(::Type{<:CuVector{T}}, dh::DofHandler) where T = CUDA.zeros(T, ndofs(dh))
Expand All @@ -121,4 +80,9 @@ function Thunderbolt.adapt_vector_type(::Type{<:CuVector}, v::VT) where {VT <: V
return CuVector(v)
end

include("cuda/cuda_operator.jl")
include("cuda/cuda_memalloc.jl")
include("cuda/cuda_adapt.jl")
include("cuda/cuda_iterator.jl")

end
67 changes: 67 additions & 0 deletions ext/cuda/cuda_adapt.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
###################
## adapt Buffers ##
###################

@adapt_structure KeFeGlobalMem
@adapt_structure FeGlobalMem
@adapt_structure KeGlobalMem

#####################################
## Shallow adaption for DofHandler ##
#####################################
function Adapt.adapt_structure(strategy::CudaAssemblyStrategy, dh::DofHandler)
IT = inttype(strategy)
grid = _adapt(strategy, dh.grid)
cell_dofs = dh.cell_dofs .|> (i -> convert(IT,i)) |> cu
cell_dofs_offset = dh.cell_dofs_offset .|> (i -> convert(IT,i)) |> cu
cell_to_sdh = dh.cell_to_subdofhandler .|> (i -> convert(IT,i)) |> cu
dh_data = DeviceDofHandlerData(
grid,
cell_dofs,
cell_dofs_offset,
cell_to_sdh,
convert(IT,dh.ndofs))
subdofhandlers = dh.subdofhandlers .|> (sdh -> _adapt(strategy, sdh,dh_data))
return DeviceDofHandler(dh,subdofhandlers)
end

_symbols_to_int(symbols,IT::Type) = 1:length(symbols) .|> (sym -> convert(IT, sym))


function _adapt(strategy::CudaAssemblyStrategy, sdh::SubDofHandler,dh_data::DeviceDofHandlerData)
IT = inttype(strategy)
cellset = sdh.cellset |> collect .|> (x -> convert(IT, x)) |> cu
field_names = _symbols_to_int(sdh.field_names,IT) |> cu
field_interpolations = sdh.field_interpolations |> convert_vec_to_concrete |> cu
ndofs_per_cell = sdh.ndofs_per_cell
return DeviceSubDofHandler(cellset, field_names, field_interpolations, ndofs_per_cell,dh_data)
end

function _adapt(::CudaAssemblyStrategy, grid::Grid{sdim, cell_type, T}) where {sdim, cell_type, T}
node_type = typeof(first(grid.nodes))
cells = grid.cells |> convert_vec_to_concrete |> cu
nodes = grid.nodes |> cu
#TODO subdomain info
return DeviceGrid{sdim, cell_type, T, typeof(cells), typeof(nodes)}(cells, nodes)
end

######################
## adapt Coefficients ##
######################
function Adapt.adapt_structure(::CudaAssemblyStrategy, element_cache::AnalyticalCoefficientElementCache)
cc = adapt_structure(CuArray, element_cache.cc)
nz_intervals = adapt(CuArray, element_cache.nonzero_intervals )
sv = adapt_structure(CuArray, element_cache.cv)
return AnalyticalCoefficientElementCache(cc, nz_intervals, sv)
end

function Adapt.adapt_structure(::CudaAssemblyStrategy, cysc::FieldCoefficientCache)
elementwise_data = adapt(CuArray, cysc.elementwise_data)
cv = adapt_structure(CuArray, cysc.cv)
return FieldCoefficientCache(elementwise_data, cv)
end
function Adapt.adapt_structure(::CudaAssemblyStrategy, sphdf::SpatiallyHomogeneousDataField)
timings = adapt(CuArray, sphdf.timings )
data = adapt(CuArray, sphdf.data )
return SpatiallyHomogeneousDataField(timings, data)
end
Loading
Loading