Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU linear operators #163

Draft
wants to merge 48 commits into
base: main
Choose a base branch
from

Conversation

Abdelrahman912
Copy link
Collaborator

Just initial rough ideas for the design of GPU linear operators πŸ§‘β€πŸŽ„

@termi-official termi-official changed the title init design (no working implementation) GPU operators Dec 19, 2024
@termi-official termi-official linked an issue Dec 23, 2024 that may be closed by this pull request
3 tasks
@termi-official termi-official changed the title GPU operators GPU linear operators Jan 13, 2025
Copy link
Owner

@termi-official termi-official left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick review.

Can you also add a benchmark script for CPU vs GPU in benchmarks/operators/linear-operators/? You can use https://github.com/termi-official/Thunderbolt.jl/blob/main/benchmarks/benchmarks-linear-form.jl as baseline.

Project.toml Outdated Show resolved Hide resolved
Project.toml Outdated Show resolved Hide resolved
Project.toml Outdated Show resolved Hide resolved
Comment on lines 566 to 567
struct BackendCUDA <: AbstractBackend end
struct BackendCPU <: AbstractBackend end
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why we do not use KernelAbstractions backends for the dispatch? If there is not, then please wait before adapting it. We might need to discuss this one in more detail (or even a separtate PR).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The things is that KA doesn't allow dynamic memory allocations for shared arrays all the interfaces they provide are static shared and I try to use dynamic mem for local vectors and matrices if it can fit the memory. I thought that we can use their interfaces but it would be installing a whole library for just using their structs wouldn't be the optimal thing to do.

@@ -10,7 +10,7 @@ struct QuadratureValuesIterator{VT,XT}
return new{V, Nothing}(v, nothing)
end
function QuadratureValuesIterator(v::V, cell_coords::VT) where {V, VT <: AbstractArray}
reinit!(v, cell_coords)
#reinit!(v, cell_coords)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because cell value is a shared instance across kernels so it can't store the coords, right?! so I rather store the coords in the cell cache which is unique to every thread

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I need some more time to think about this tho and how to make this compatible with the CPU assembly.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we comment this in again and make reinit!(v, cell_coords) an no-op on the GPU here?

src/ferrite-addons/PR883.jl Outdated Show resolved Hide resolved
Copy link
Owner

@termi-official termi-official left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next batch of comments.

ext/cuda/cuda_adapt.jl Outdated Show resolved Hide resolved
ext/cuda/cuda_adapt.jl Outdated Show resolved Hide resolved
nodes = Adapt.adapt_structure(to, grid.nodes |> cu)
#TODO subdomain info
return GPUGrid{sdim, cell_type, T, typeof(cells), typeof(nodes)}(cells, nodes)
end
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
end
end

ext/cuda/cuda_adapt.jl Outdated Show resolved Hide resolved
ext/cuda/cuda_adapt.jl Outdated Show resolved Hide resolved
src/ferrite-addons/PR913.jl Outdated Show resolved Hide resolved
src/ferrite-addons/PR913.jl Outdated Show resolved Hide resolved
src/ferrite-addons/PR913.jl Outdated Show resolved Hide resolved
src/ferrite-addons/PR913.jl Outdated Show resolved Hide resolved
src/ferrite-addons/gpu/gpudofhandler.jl Outdated Show resolved Hide resolved
Copy link
Owner

@termi-official termi-official left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this!

Here now the next batch of comments from my side.

ext/cuda/cuda_memalloc.jl Outdated Show resolved Hide resolved
ext/cuda/cuda_adapt.jl Outdated Show resolved Hide resolved
ext/cuda/cuda_adapt.jl Outdated Show resolved Hide resolved
ext/cuda/cuda_adapt.jl Outdated Show resolved Hide resolved
src/gpu/gpu_utils.jl Show resolved Hide resolved
evaluate_coefficient(coeff::SpectralTensorCoefficientCache, cell_cache, qp::QuadraturePoint, t) = _evaluate_coefficient(coeff, cell_cache, qp, t)


evaluate_coefficient(coeff::SpectralTensorCoefficientCache, cell_cache::FerriteUtils.DeviceCellCache, qp::FerriteUtils.StaticQuadratureValues, t) = _evaluate_coefficient(coeff, cell_cache, qp, t)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the idea is that all the stuff like FerriteUtils.StaticQuadratureValues goes into the coefficient caches, to exactly avoid this kind of issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, but I didn't really get what you mean by moving FerriteUtils.StaticQuadratureValues to coefficient cache?

src/utils.jl Outdated Show resolved Hide resolved
test/gpu/runtests.jl Outdated Show resolved Hide resolved
test/gpu/test_coefficients.jl Outdated Show resolved Hide resolved
test/gpu/test_coefficients.jl Outdated Show resolved Hide resolved
Copy link
Owner

@termi-official termi-official left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great already. I added some minor comments, mostly regarding formatting and code style.

I think as a final step we should do some benchmarking. For this purpose, can you do somthing analogue to

https://github.com/termi-official/Thunderbolt.jl/blob/main/benchmarks/benchmarks-linear-form.jl

for different (sufficiently small) grid sizes to compare CPU and GPU assembly performance? It can be simply two files generating some machine-readable output for now.

src/modeling/core/coefficients.jl Outdated Show resolved Hide resolved
ext/cuda/cuda_operator.jl Outdated Show resolved Hide resolved
ext/cuda/cuda_operator.jl Outdated Show resolved Hide resolved
ext/cuda/cuda_memalloc.jl Outdated Show resolved Hide resolved
ext/cuda/cuda_memalloc.jl Outdated Show resolved Hide resolved
ext/cuda/cuda_operator.jl Outdated Show resolved Hide resolved
detJdV::T
N::SVector{NumN, N_t}
dNdx::SVector{NumN, dNdx_t}
M::SVector{NumM, M_t}
weight::T
position::NTuple{dim, T}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "position" and what is "dim"?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just the QR right? If so, then we can just use it directly with SVector as storage type (see https://github.com/Ferrite-FEM/Ferrite.jl/blob/6eead259fc17802389b9d412f797ba59c1d0add5/src/Quadrature/quadrature.jl#L52-L61)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried this solution before as well as using QuadraturePoint when I was initially doing this but for some reason they weren't abe to be adapted on GPU that's why I had to go for the raw version.

These are my adapt function for qr:

function Adapt.adapt_structure(to, cv::CellValues)
    fv = Adapt.adapt(to, StaticInterpolationValues(cv.fun_values))
    gm = Adapt.adapt(to, StaticInterpolationValues(cv.geo_mapping))
    n_quadoints = cv.qr.weights |> length
    #weights = Adapt.adapt(to, ntuple(i -> cv.qr.weights[i], n_quadoints))
    #ΞΎs = Adapt.adapt(to,ntuple(i -> Adapt.adapt_structure(to,cv.qr.points[i]), n_quadoints))
    qr = Adapt.adapt_structure(to, cv.qr)
    return StaticCellValues(fv, gm,qr)
end

function Adapt.adapt_structure(to, qr::QuadratureRule{shape}) where {shape}
    N = qr.weights |> length
    WT = qr.weights |> eltype
    VT = qr.points |> eltype
    weights = Adapt.adapt_structure(to,SVector{N,WT}(qr.weights))
    points = Adapt.adapt_structure(to,SVector{N,VT}(qr.points))
    return QuadratureRule{shape}(weights, points)
end

error message

    .cv is of type Thunderbolt.FerriteUtils.StaticCellValues{Thunderbolt.FerriteUtils.StaticInterpolationValues{Lagrange{RefQuadrilateral, 1, Nothing}, 4, 4, Float64, SMatrix{4, 4, Vec{2, Float64}, 16}, 16}, Thunderbolt.FerriteUtils.StaticInterpolationValues{Lagrange{RefQuadrilateral, 1, Nothing}, 4, 4, Float64, SMatrix{4, 4, Vec{2, Float64}, 16}, 16}} which is not isbits.
      .qr is of type QuadratureRule which is not isbits.
        .weights is of type Any which is not isbits.
        .points is of type Any which is not isbits.


Only bitstypes, which are "plain data" types that are immutable
and contain no references to other values, can be used in GPU kernels.
For more information, see the `Base.isbitstype` function.

I did a working alternative which is as follows:

struct StaticCellValues{FV, GM, Nqp, T,dim}
    fv::FV # StaticInterpolationValues
    gm::GM # StaticInterpolationValues
    weights::NTuple{Nqp, T}
    ΞΎs::NTuple{Nqp,Vec{dim,T}} # quadrature points
end
struct StaticQuadratureValues{T, N_t, dNdx_t, M_t, NumN, NumM,dim ,Ti<:Integer} <: AbstractQuadratureValues
    detJdV::T
    N::SVector{NumN, N_t}
    dNdx::SVector{NumN, dNdx_t}
    M::SVector{NumM, M_t}
    weight::T
    ΞΎ::Vec{dim,T}
    idx::Ti
end

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you get for typeof(cv.qr) after the adapt call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typeof(qr) = QuadratureRule{RefQuadrilateral, SVector{4, Float64}, SVector{4, Vec{2, Float64}}}

@@ -10,7 +10,7 @@ struct QuadratureValuesIterator{VT,XT}
return new{V, Nothing}(v, nothing)
end
function QuadratureValuesIterator(v::V, cell_coords::VT) where {V, VT <: AbstractArray}
reinit!(v, cell_coords)
#reinit!(v, cell_coords)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we comment this in again and make reinit!(v, cell_coords) an no-op on the GPU here?

src/ferrite-addons/gpu/device_grid.jl Outdated Show resolved Hide resolved
test/gpu/test_operators.jl Outdated Show resolved Hide resolved
Copy link
Collaborator

@kylebeggs kylebeggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a few really minor comments I left on the relevant lines. The only major comment is I feel like the API for constructing operators is not unified. The existence of LinearOperator and GeneralLinearOperator seems to overlap? I believe the end goal should be to use GeneralLinearOperator for any backend, but then that makes LinearOperator obsolete. Perhaps the PR is just not finished or there are plans to address this later?

ext/CuThunderboltExt.jl Outdated Show resolved Hide resolved
src/ferrite-addons/gpu/device_dofhandler.jl Outdated Show resolved Hide resolved
src/ferrite-addons/gpu/device_grid.jl Show resolved Hide resolved
src/gpu/gpu_utils.jl Outdated Show resolved Hide resolved
ext/cuda/cuda_operator.jl Outdated Show resolved Hide resolved
@termi-official
Copy link
Owner

The only major comment is I feel like the API for constructing operators is not unified.

The APIs unfortunately diverged, as the initially unified API unfortunately did not work out as I wanted. I have not prioritized this part yet, as it is currently considered internal API. I will come back to this after the first release. The long term plan is to move the operators into a separate package, as they should be able to construct in a quite generic way.

The existence of LinearOperator and GeneralLinearOperator seems to overlap? I believe the end goal should be to use GeneralLinearOperator for any backend, but then that makes LinearOperator obsolete. Perhaps the PR is just not finished or there are plans to address this later?

Exactly. The "old" idea for the operator API was that we want different kinds of operators for different kinds of assembly types (e.g. ElementAssemblyParallelLinearOperator, ColorParallelLinearOperator) and different device backends. However, after some experimentation we figured out that it should be quite easy to just have the different operators to dispatch the general type (e.g. LinearOperator, NonlinearOperator) and some additional struct, which I denote by "strategy", controls how exactly matrix and vector information is put together. However, doing this split cleanly is future work. The goal for this PR is to get a GPU baseline ready for EP simulations. Does this explain the state?

@kylebeggs
Copy link
Collaborator

@termi-official yes, that explains it! Glad to hear the new approach is the long-term API goal because I think it's the best approach!

Copy link
Owner

@termi-official termi-official left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the benchmark files? I would like to measure the performance of the implementation on different GPUs.

Comment on lines +155 to +157
weight::T
ΞΎ::Vec{dim,T}
idx::Ti
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just QuadraturePoint right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I tried to put Qudraturepoint here but there was a dependency error, also Adapt issue, so I put them in more or less raw format without encompassing them in an additional as they are already in StaticQuadratureValues

ext/cuda/cuda_operator.jl Outdated Show resolved Hide resolved
ext/cuda/cuda_operator.jl Outdated Show resolved Hide resolved
@Abdelrahman912
Copy link
Collaborator Author

Where are the benchmark files? I would like to measure the performance of the implementation on different GPUs.

working on it rn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GPU assembly of linear forms
3 participants