-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU linear operators #163
base: main
Are you sure you want to change the base?
GPU linear operators #163
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick review.
Can you also add a benchmark script for CPU vs GPU in benchmarks/operators/linear-operators/
? You can use https://github.com/termi-official/Thunderbolt.jl/blob/main/benchmarks/benchmarks-linear-form.jl as baseline.
src/discretization/operator.jl
Outdated
struct BackendCUDA <: AbstractBackend end | ||
struct BackendCPU <: AbstractBackend end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason why we do not use KernelAbstractions backends for the dispatch? If there is not, then please wait before adapting it. We might need to discuss this one in more detail (or even a separtate PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The things is that KA doesn't allow dynamic memory allocations for shared arrays all the interfaces they provide are static shared and I try to use dynamic mem for local vectors and matrices if it can fit the memory. I thought that we can use their interfaces but it would be installing a whole library for just using their structs wouldn't be the optimal thing to do.
src/ferrite-addons/PR883.jl
Outdated
@@ -10,7 +10,7 @@ struct QuadratureValuesIterator{VT,XT} | |||
return new{V, Nothing}(v, nothing) | |||
end | |||
function QuadratureValuesIterator(v::V, cell_coords::VT) where {V, VT <: AbstractArray} | |||
reinit!(v, cell_coords) | |||
#reinit!(v, cell_coords) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because cell value is a shared instance across kernels so it can't store the coords, right?! so I rather store the coords in the cell cache which is unique to every thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I need some more time to think about this tho and how to make this compatible with the CPU assembly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we comment this in again and make reinit!(v, cell_coords) an no-op on the GPU here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next batch of comments.
ext/cuda/cuda_adapt.jl
Outdated
nodes = Adapt.adapt_structure(to, grid.nodes |> cu) | ||
#TODO subdomain info | ||
return GPUGrid{sdim, cell_type, T, typeof(cells), typeof(nodes)}(cells, nodes) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
end | |
end | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work on this!
Here now the next batch of comments from my side.
src/modeling/core/coefficients.jl
Outdated
evaluate_coefficient(coeff::SpectralTensorCoefficientCache, cell_cache, qp::QuadraturePoint, t) = _evaluate_coefficient(coeff, cell_cache, qp, t) | ||
|
||
|
||
evaluate_coefficient(coeff::SpectralTensorCoefficientCache, cell_cache::FerriteUtils.DeviceCellCache, qp::FerriteUtils.StaticQuadratureValues, t) = _evaluate_coefficient(coeff, cell_cache, qp, t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the idea is that all the stuff like FerriteUtils.StaticQuadratureValues
goes into the coefficient caches, to exactly avoid this kind of issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, but I didn't really get what you mean by moving FerriteUtils.StaticQuadratureValues
to coefficient cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great already. I added some minor comments, mostly regarding formatting and code style.
I think as a final step we should do some benchmarking. For this purpose, can you do somthing analogue to
https://github.com/termi-official/Thunderbolt.jl/blob/main/benchmarks/benchmarks-linear-form.jl
for different (sufficiently small) grid sizes to compare CPU and GPU assembly performance? It can be simply two files generating some machine-readable output for now.
src/ferrite-addons/PR883.jl
Outdated
detJdV::T | ||
N::SVector{NumN, N_t} | ||
dNdx::SVector{NumN, dNdx_t} | ||
M::SVector{NumM, M_t} | ||
weight::T | ||
position::NTuple{dim, T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "position" and what is "dim"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just the QR right? If so, then we can just use it directly with SVector as storage type (see https://github.com/Ferrite-FEM/Ferrite.jl/blob/6eead259fc17802389b9d412f797ba59c1d0add5/src/Quadrature/quadrature.jl#L52-L61)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried this solution before as well as using QuadraturePoint when I was initially doing this but for some reason they weren't abe to be adapted on GPU that's why I had to go for the raw version.
These are my adapt function for qr:
function Adapt.adapt_structure(to, cv::CellValues)
fv = Adapt.adapt(to, StaticInterpolationValues(cv.fun_values))
gm = Adapt.adapt(to, StaticInterpolationValues(cv.geo_mapping))
n_quadoints = cv.qr.weights |> length
#weights = Adapt.adapt(to, ntuple(i -> cv.qr.weights[i], n_quadoints))
#ΞΎs = Adapt.adapt(to,ntuple(i -> Adapt.adapt_structure(to,cv.qr.points[i]), n_quadoints))
qr = Adapt.adapt_structure(to, cv.qr)
return StaticCellValues(fv, gm,qr)
end
function Adapt.adapt_structure(to, qr::QuadratureRule{shape}) where {shape}
N = qr.weights |> length
WT = qr.weights |> eltype
VT = qr.points |> eltype
weights = Adapt.adapt_structure(to,SVector{N,WT}(qr.weights))
points = Adapt.adapt_structure(to,SVector{N,VT}(qr.points))
return QuadratureRule{shape}(weights, points)
end
error message
.cv is of type Thunderbolt.FerriteUtils.StaticCellValues{Thunderbolt.FerriteUtils.StaticInterpolationValues{Lagrange{RefQuadrilateral, 1, Nothing}, 4, 4, Float64, SMatrix{4, 4, Vec{2, Float64}, 16}, 16}, Thunderbolt.FerriteUtils.StaticInterpolationValues{Lagrange{RefQuadrilateral, 1, Nothing}, 4, 4, Float64, SMatrix{4, 4, Vec{2, Float64}, 16}, 16}} which is not isbits.
.qr is of type QuadratureRule which is not isbits.
.weights is of type Any which is not isbits.
.points is of type Any which is not isbits.
Only bitstypes, which are "plain data" types that are immutable
and contain no references to other values, can be used in GPU kernels.
For more information, see the `Base.isbitstype` function.
I did a working alternative which is as follows:
struct StaticCellValues{FV, GM, Nqp, T,dim}
fv::FV # StaticInterpolationValues
gm::GM # StaticInterpolationValues
weights::NTuple{Nqp, T}
ΞΎs::NTuple{Nqp,Vec{dim,T}} # quadrature points
end
struct StaticQuadratureValues{T, N_t, dNdx_t, M_t, NumN, NumM,dim ,Ti<:Integer} <: AbstractQuadratureValues
detJdV::T
N::SVector{NumN, N_t}
dNdx::SVector{NumN, dNdx_t}
M::SVector{NumM, M_t}
weight::T
ΞΎ::Vec{dim,T}
idx::Ti
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you get for typeof(cv.qr)
after the adapt call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typeof(qr) = QuadratureRule{RefQuadrilateral, SVector{4, Float64}, SVector{4, Vec{2, Float64}}}
src/ferrite-addons/PR883.jl
Outdated
@@ -10,7 +10,7 @@ struct QuadratureValuesIterator{VT,XT} | |||
return new{V, Nothing}(v, nothing) | |||
end | |||
function QuadratureValuesIterator(v::V, cell_coords::VT) where {V, VT <: AbstractArray} | |||
reinit!(v, cell_coords) | |||
#reinit!(v, cell_coords) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we comment this in again and make reinit!(v, cell_coords) an no-op on the GPU here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a few really minor comments I left on the relevant lines. The only major comment is I feel like the API for constructing operators is not unified. The existence of LinearOperator
and GeneralLinearOperator
seems to overlap? I believe the end goal should be to use GeneralLinearOperator
for any backend, but then that makes LinearOperator
obsolete. Perhaps the PR is just not finished or there are plans to address this later?
The APIs unfortunately diverged, as the initially unified API unfortunately did not work out as I wanted. I have not prioritized this part yet, as it is currently considered internal API. I will come back to this after the first release. The long term plan is to move the operators into a separate package, as they should be able to construct in a quite generic way.
Exactly. The "old" idea for the operator API was that we want different kinds of operators for different kinds of assembly types (e.g. ElementAssemblyParallelLinearOperator, ColorParallelLinearOperator) and different device backends. However, after some experimentation we figured out that it should be quite easy to just have the different operators to dispatch the general type (e.g. LinearOperator, NonlinearOperator) and some additional struct, which I denote by "strategy", controls how exactly matrix and vector information is put together. However, doing this split cleanly is future work. The goal for this PR is to get a GPU baseline ready for EP simulations. Does this explain the state? |
@termi-official yes, that explains it! Glad to hear the new approach is the long-term API goal because I think it's the best approach! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are the benchmark files? I would like to measure the performance of the implementation on different GPUs.
weight::T | ||
ΞΎ::Vec{dim,T} | ||
idx::Ti |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just QuadraturePoint
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I tried to put Qudraturepoint here but there was a dependency error, also Adapt issue, so I put them in more or less raw format without encompassing them in an additional as they are already in StaticQuadratureValues
working on it rn. |
Just initial rough ideas for the design of GPU linear operators π§βπ