GPU linear operators #163

Abdelrahman912 · 2024-12-18T23:25:17Z

Just initial rough ideas for the design of GPU linear operators 🧑‍🎄

src/discretization/operator.jl

termi-official

Just a quick review.

Can you also add a benchmark script for CPU vs GPU in benchmarks/operators/linear-operators/? You can use https://github.com/termi-official/Thunderbolt.jl/blob/main/benchmarks/benchmarks-linear-form.jl as baseline.

Project.toml

termi-official · 2025-01-13T14:45:23Z

src/discretization/operator.jl

+struct BackendCUDA <: AbstractBackend end
+struct BackendCPU <: AbstractBackend end


Is there any reason why we do not use KernelAbstractions backends for the dispatch? If there is not, then please wait before adapting it. We might need to discuss this one in more detail (or even a separtate PR).

The things is that KA doesn't allow dynamic memory allocations for shared arrays all the interfaces they provide are static shared and I try to use dynamic mem for local vectors and matrices if it can fit the memory. I thought that we can use their interfaces but it would be installing a whole library for just using their structs wouldn't be the optimal thing to do.

termi-official · 2025-01-13T14:46:49Z

src/ferrite-addons/PR883.jl

@@ -10,7 +10,7 @@ struct QuadratureValuesIterator{VT,XT}
        return new{V, Nothing}(v, nothing)
    end
    function QuadratureValuesIterator(v::V, cell_coords::VT) where {V, VT <: AbstractArray}
-        reinit!(v, cell_coords)
+        #reinit!(v, cell_coords)


because cell value is a shared instance across kernels so it can't store the coords, right?! so I rather store the coords in the cell cache which is unique to every thread

I see. I need some more time to think about this tho and how to make this compatible with the CPU assembly.

Can we comment this in again and make reinit!(v, cell_coords) an no-op on the GPU here?

src/ferrite-addons/PR883.jl

termi-official

Next batch of comments.

ext/cuda/cuda_adapt.jl

termi-official · 2025-01-15T13:11:51Z

ext/cuda/cuda_adapt.jl

+    nodes = Adapt.adapt_structure(to, grid.nodes |> cu)
+    #TODO subdomain info
+    return GPUGrid{sdim, cell_type, T, typeof(cells), typeof(nodes)}(cells, nodes)
+end


Suggested change

end

end

ext/cuda/cuda_adapt.jl

src/ferrite-addons/PR913.jl

src/ferrite-addons/gpu/gpudofhandler.jl

src/gpu/gpu_operator.jl

ext/cuda/cuda_adapt.jl

termi-official

Thanks for the work on this!

Here now the next batch of comments from my side.

ext/cuda/cuda_memalloc.jl

ext/cuda/cuda_adapt.jl

src/gpu/gpu_utils.jl

termi-official · 2025-02-05T19:38:07Z

src/modeling/core/coefficients.jl

+evaluate_coefficient(coeff::SpectralTensorCoefficientCache, cell_cache, qp::QuadraturePoint, t) = _evaluate_coefficient(coeff, cell_cache, qp, t)
+
+
+evaluate_coefficient(coeff::SpectralTensorCoefficientCache, cell_cache::FerriteUtils.DeviceCellCache, qp::FerriteUtils.StaticQuadratureValues, t) = _evaluate_coefficient(coeff, cell_cache, qp, t)


Actually the idea is that all the stuff like FerriteUtils.StaticQuadratureValues goes into the coefficient caches, to exactly avoid this kind of issue.

sorry, but I didn't really get what you mean by moving FerriteUtils.StaticQuadratureValues to coefficient cache?

src/utils.jl

test/gpu/runtests.jl

test/gpu/test_coefficients.jl

termi-official

Looks great already. I added some minor comments, mostly regarding formatting and code style.

I think as a final step we should do some benchmarking. For this purpose, can you do somthing analogue to

https://github.com/termi-official/Thunderbolt.jl/blob/main/benchmarks/benchmarks-linear-form.jl

for different (sufficiently small) grid sizes to compare CPU and GPU assembly performance? It can be simply two files generating some machine-readable output for now.

src/modeling/core/coefficients.jl

ext/cuda/cuda_operator.jl

ext/cuda/cuda_memalloc.jl

ext/cuda/cuda_operator.jl

termi-official · 2025-02-14T08:59:54Z

src/ferrite-addons/PR883.jl

    detJdV::T
    N::SVector{NumN, N_t}
    dNdx::SVector{NumN, dNdx_t}
    M::SVector{NumM, M_t}
+    weight::T
+    position::NTuple{dim, T}


What is "position" and what is "dim"?

This is just the QR right? If so, then we can just use it directly with SVector as storage type (see https://github.com/Ferrite-FEM/Ferrite.jl/blob/6eead259fc17802389b9d412f797ba59c1d0add5/src/Quadrature/quadrature.jl#L52-L61)

I have tried this solution before as well as using QuadraturePoint when I was initially doing this but for some reason they weren't abe to be adapted on GPU that's why I had to go for the raw version.

These are my adapt function for qr:

function Adapt.adapt_structure(to, cv::CellValues) fv = Adapt.adapt(to, StaticInterpolationValues(cv.fun_values)) gm = Adapt.adapt(to, StaticInterpolationValues(cv.geo_mapping)) n_quadoints = cv.qr.weights |> length #weights = Adapt.adapt(to, ntuple(i -> cv.qr.weights[i], n_quadoints)) #ξs = Adapt.adapt(to,ntuple(i -> Adapt.adapt_structure(to,cv.qr.points[i]), n_quadoints)) qr = Adapt.adapt_structure(to, cv.qr) return StaticCellValues(fv, gm,qr) end function Adapt.adapt_structure(to, qr::QuadratureRule{shape}) where {shape} N = qr.weights |> length WT = qr.weights |> eltype VT = qr.points |> eltype weights = Adapt.adapt_structure(to,SVector{N,WT}(qr.weights)) points = Adapt.adapt_structure(to,SVector{N,VT}(qr.points)) return QuadratureRule{shape}(weights, points) end

error message

.cv is of type Thunderbolt.FerriteUtils.StaticCellValues{Thunderbolt.FerriteUtils.StaticInterpolationValues{Lagrange{RefQuadrilateral, 1, Nothing}, 4, 4, Float64, SMatrix{4, 4, Vec{2, Float64}, 16}, 16}, Thunderbolt.FerriteUtils.StaticInterpolationValues{Lagrange{RefQuadrilateral, 1, Nothing}, 4, 4, Float64, SMatrix{4, 4, Vec{2, Float64}, 16}, 16}} which is not isbits. .qr is of type QuadratureRule which is not isbits. .weights is of type Any which is not isbits. .points is of type Any which is not isbits. Only bitstypes, which are "plain data" types that are immutable and contain no references to other values, can be used in GPU kernels. For more information, see the `Base.isbitstype` function.

I did a working alternative which is as follows:

struct StaticCellValues{FV, GM, Nqp, T,dim} fv::FV # StaticInterpolationValues gm::GM # StaticInterpolationValues weights::NTuple{Nqp, T} ξs::NTuple{Nqp,Vec{dim,T}} # quadrature points end struct StaticQuadratureValues{T, N_t, dNdx_t, M_t, NumN, NumM,dim ,Ti<:Integer} <: AbstractQuadratureValues detJdV::T N::SVector{NumN, N_t} dNdx::SVector{NumN, dNdx_t} M::SVector{NumM, M_t} weight::T ξ::Vec{dim,T} idx::Ti end

What do you get for typeof(cv.qr) after the adapt call?

typeof(qr) = QuadratureRule{RefQuadrilateral, SVector{4, Float64}, SVector{4, Vec{2, Float64}}}

termi-official · 2025-02-14T09:00:36Z

src/ferrite-addons/PR883.jl

@@ -10,7 +10,7 @@ struct QuadratureValuesIterator{VT,XT}
        return new{V, Nothing}(v, nothing)
    end
    function QuadratureValuesIterator(v::V, cell_coords::VT) where {V, VT <: AbstractArray}
-        reinit!(v, cell_coords)
+        #reinit!(v, cell_coords)


Can we comment this in again and make reinit!(v, cell_coords) an no-op on the GPU here?

src/ferrite-addons/gpu/device_grid.jl

test/gpu/test_operators.jl

kylebeggs

I had a few really minor comments I left on the relevant lines. The only major comment is I feel like the API for constructing operators is not unified. The existence of LinearOperator and GeneralLinearOperator seems to overlap? I believe the end goal should be to use GeneralLinearOperator for any backend, but then that makes LinearOperator obsolete. Perhaps the PR is just not finished or there are plans to address this later?

ext/CuThunderboltExt.jl

src/ferrite-addons/gpu/device_dofhandler.jl

src/ferrite-addons/gpu/device_grid.jl

src/gpu/gpu_utils.jl

ext/cuda/cuda_operator.jl

termi-official · 2025-02-17T23:41:51Z

The only major comment is I feel like the API for constructing operators is not unified.

The APIs unfortunately diverged, as the initially unified API unfortunately did not work out as I wanted. I have not prioritized this part yet, as it is currently considered internal API. I will come back to this after the first release. The long term plan is to move the operators into a separate package, as they should be able to construct in a quite generic way.

The existence of LinearOperator and GeneralLinearOperator seems to overlap? I believe the end goal should be to use GeneralLinearOperator for any backend, but then that makes LinearOperator obsolete. Perhaps the PR is just not finished or there are plans to address this later?

Exactly. The "old" idea for the operator API was that we want different kinds of operators for different kinds of assembly types (e.g. ElementAssemblyParallelLinearOperator, ColorParallelLinearOperator) and different device backends. However, after some experimentation we figured out that it should be quite easy to just have the different operators to dispatch the general type (e.g. LinearOperator, NonlinearOperator) and some additional struct, which I denote by "strategy", controls how exactly matrix and vector information is put together. However, doing this split cleanly is future work. The goal for this PR is to get a GPU baseline ready for EP simulations. Does this explain the state?

kylebeggs · 2025-02-18T14:31:22Z

@termi-official yes, that explains it! Glad to hear the new approach is the long-term API goal because I think it's the best approach!

termi-official

Where are the benchmark files? I would like to measure the performance of the implementation on different GPUs.

termi-official · 2025-02-18T15:46:02Z

src/ferrite-addons/PR883.jl

+    weight::T
+    ξ::Vec{dim,T}
+    idx::Ti


This is just QuadraturePoint right?

Ah yes, I tried to put Qudraturepoint here but there was a dependency error, also Adapt issue, so I put them in more or less raw format without encompassing them in an additional as they are already in StaticQuadratureValues

ext/cuda/cuda_operator.jl

@sync

…ame + CUDA.@sync fix

Abdelrahman912 · 2025-02-18T21:36:31Z

Where are the benchmark files? I would like to measure the performance of the implementation on different GPUs.

working on it rn.

init design (no working implementation)

5c2e71c

termi-official changed the title ~~init design (no working implementation)~~ GPU operators Dec 19, 2024

Abdelrahman912 added 2 commits December 21, 2024 00:53

init setup

8636a49

make cuda work for the test kernel

34d8b79

termi-official linked an issue Dec 23, 2024 that may be closed by this pull request

GPU assembly of linear forms #136

Open

3 tasks

add gpuy subdof and init gpu update operator

f7bd306

termi-official reviewed Jan 6, 2025

View reviewed changes

src/discretization/operator.jl Outdated Show resolved Hide resolved

Abdelrahman912 added 6 commits January 6, 2025 21:19

add subdof iterator

7eedd51

minor fix in sdh iterator

16846e4

first working example (not refined)

5787e82

working example (not refined)

c58a14f

Merge branch 'main' into gpu-operators

324dbd0

minor adjustment

139fc83

termi-official changed the title ~~GPU operators~~ GPU linear operators Jan 13, 2025

termi-official reviewed Jan 13, 2025

View reviewed changes

Abdelrahman912 added 6 commits January 13, 2025 22:04

minor refinement

babc06c

init add coeffs

fe064cf

add to extension (doesn't dispatch to ext tho)

6bdae08

minor edit

c5c8b0b

move to ext (working implementation)

4d3f7de

minor fix

ac9c807

termi-official reviewed Jan 15, 2025

View reviewed changes

src/gpu/gpu_operator.jl Outdated Show resolved Hide resolved

termi-official mentioned this pull request Jan 16, 2025

Automatic field type conversions in types #174

Open

Abdelrahman912 added 6 commits January 17, 2025 01:45

add coesfficients tests for gpu

0d075a8

minor refinements in coefficients

863ab05

minor fix

0daf36c

restructure PR913

7ba3139

change gpu -> device

077572f

some restructuring for memory allocation

34c4536

Abdelrahman912 and others added 3 commits February 1, 2025 13:14

Merge branch 'main' into gpu-operators

eb93c8d

minor fix

d4d3bab

init fix dh adapt error

23f5413

termi-official reviewed Feb 5, 2025

View reviewed changes

ext/cuda/cuda_adapt.jl Outdated Show resolved Hide resolved

termi-official reviewed Feb 5, 2025

View reviewed changes

Abdelrahman912 added 9 commits February 6, 2025 22:42

fix memory leak

5e5216d

pr review fix

d50f7fa

init qp

32c5763

minor fix

7bcdd88

change operators and fuse mem alloc

8fe74d2

subdof loop (i.e. get rid of unsafe cuda convert)

b6be5d4

Merge branch 'gpu-operators' into add-quadpoints-in-quadvalues

110c7ca

use quad points instead of quad values

ba95bae

minor fix

97c6b00

termi-official reviewed Feb 14, 2025

View reviewed changes

Abdelrahman912 added 2 commits February 17, 2025 15:54

fix 1

69530cc

add complex test function and other stuff

7339d1c

kylebeggs reviewed Feb 17, 2025

View reviewed changes

Abdelrahman912 added 2 commits February 18, 2025 05:12

fix CI (hopefully!)

59fa5b4

minor fix

72a8f0e

termi-official reviewed Feb 18, 2025

View reviewed changes

Abdelrahman912 added 2 commits February 18, 2025 21:58

add n_threads and n_blocks as kwargs + change cuda default strategy n…

569ef85

…ame + CUDA.@sync fix

fix SVector

a2a8bcd

Abdelrahman912 and others added 4 commits February 19, 2025 05:53

add benchmarks + solve major bugs in memory allocation

a0e28f4

remove nested closures

2215395

add line + remove unnecessary code

81a6529

Merge branch 'main' into gpu-operators

0c1d58f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU linear operators #163

GPU linear operators #163

Abdelrahman912 commented Dec 18, 2024

termi-official left a comment •

edited

Loading

termi-official Jan 13, 2025

Abdelrahman912 Jan 13, 2025

termi-official Jan 13, 2025

Abdelrahman912 Jan 13, 2025

termi-official Jan 15, 2025

termi-official Feb 14, 2025

termi-official left a comment

termi-official Jan 15, 2025

termi-official left a comment

termi-official Feb 5, 2025

Abdelrahman912 Feb 7, 2025

termi-official left a comment

termi-official Feb 14, 2025

termi-official Feb 17, 2025

Abdelrahman912 Feb 17, 2025

termi-official Feb 17, 2025

Abdelrahman912 Feb 18, 2025

termi-official Feb 14, 2025

kylebeggs left a comment

termi-official commented Feb 17, 2025

kylebeggs commented Feb 18, 2025

termi-official left a comment

termi-official Feb 18, 2025

Abdelrahman912 Feb 18, 2025

Abdelrahman912 commented Feb 18, 2025

		struct BackendCUDA <: AbstractBackend end
		struct BackendCPU <: AbstractBackend end

		evaluate_coefficient(coeff::SpectralTensorCoefficientCache, cell_cache, qp::QuadraturePoint, t) = _evaluate_coefficient(coeff, cell_cache, qp, t)


		evaluate_coefficient(coeff::SpectralTensorCoefficientCache, cell_cache::FerriteUtils.DeviceCellCache, qp::FerriteUtils.StaticQuadratureValues, t) = _evaluate_coefficient(coeff, cell_cache, qp, t)

GPU linear operators #163

Are you sure you want to change the base?

GPU linear operators #163

Conversation

Abdelrahman912 commented Dec 18, 2024

termi-official left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

termi-official left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

termi-official left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

termi-official left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebeggs left a comment

Choose a reason for hiding this comment

termi-official commented Feb 17, 2025

kylebeggs commented Feb 18, 2025

termi-official left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Abdelrahman912 commented Feb 18, 2025

termi-official left a comment •

edited

Loading