-
Notifications
You must be signed in to change notification settings - Fork 50
Initial phase of integrating new BL:AS 2 subroutines from merge_reduction… #61
Initial phase of integrating new BL:AS 2 subroutines from merge_reduction… #61
Conversation
if (mA.getAccess()) { | ||
auto scalOp1 = make_op<ScalarOp, prdOp2_struct>(_beta, vy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain what this change is here? As far as I can tell, it's cleaning up the code to pick one implementation, but I could be very wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. The main backend code for SYCL BLAS GEMV has changed so now whether it is row-major or column major, there are two separate computation. However, the make_op is the same for both, it has been moved affter if/else.
include/operations/blas2_trees.hpp
Outdated
value_type eval(IndexType i) { | ||
bool valid_thread(cl::sycl::nd_item<1> ndItem) { return true; } | ||
|
||
value_type eval(IndexType i) { // NOT VERIFIED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "NOT VERIFIED" mean in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added todo list for @josealiaga to check if this function is not used and extra, so he can let us know and we can remove it from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the method value_type eval(IndexType i) { .. } has to be defined in all structs.
In this case, it only implements the most simple implementation of the matrix-vector product by rows
include/queue/queue_base.hpp
Outdated
*/ | ||
inline bool has_local_memory() { return false; } | ||
/* | ||
@brief This class is used to allocated the a regin of memory on the device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/the a regin/a region/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
include/queue/queue_base.hpp
Outdated
class Queue_Interface<Sequential> { | ||
Queue_Interface() {} | ||
/* | ||
@brief This class is to determine whether or not the underlying device has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class? (same for other member functions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are various code style and documentation concerns. In general I think is great we have now level 2 implementation (thanks @josealiaga !) but we need now to refactor the codebase to make the code a bit clearer. We can do that on other pull request.
include/executors/executor_sycl.hpp
Outdated
// to the tree not the root | ||
// printf("Index %ld\n", index.get_global_id(0)); | ||
if (tree.valid_thread(index)) { // FIXME (Mehdi):: this should move | ||
// to the tree not the root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I understand the FIXME, can you create a gihub issue and describe it more clearly?
include/executors/executor_sycl.hpp
Outdated
@@ -263,7 +262,9 @@ class Executor<SYCL> { | |||
*/ | |||
Executor(cl::sycl::queue q) : q_interface(q){}; | |||
|
|||
cl::sycl::queue sycl_queue() const { return q_interface.sycl_queue(); } | |||
inline Queue_Interface<SYCL> &policy_handler() { return q_interface; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you return a reference to the policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the queue interface contains the pointer mapper class which deals with allocation and deallocation of buffers and memory. The call to
implicitly-deleted copy constructor of 'Queue_Interface' error will rise in case of not returning the reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is the error :
'Queue_Interface' is implicitly deleted because field 'pointer_mapper' has a deleted copy
constructor
mutable cl::sycl::codeplay::PointerMapper pointer_mapper;
^
../../../sycl-blas/include/queue/pointer_mapper.hpp:310:3: note: 'PointerMapper' has been
explicitly marked deleted here
PointerMapper(const PointerMapper&) = delete;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used std::shared_ptr to remove the reference
include/executors/executor_sycl.hpp
Outdated
@@ -323,7 +324,7 @@ class Executor<SYCL> { | |||
*/ | |||
template <typename T> | |||
inline void copy_to_device(T *src, buffer_iterator<T> dst, size_t = 0) { | |||
sycl_queue().submit([&](cl::sycl::handler &cgh) { | |||
queue().submit([&](cl::sycl::handler &cgh) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think is clearer if you rename this to "get_queue" so it is not confused with a constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
auto localSize = 256; | ||
auto nWG = 512; | ||
|
||
auto localSize = ex.policy_handler().get_work_group_size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const auto localSize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
auto nWG = 512; | ||
|
||
auto localSize = ex.policy_handler().get_work_group_size(); | ||
auto nWG = 2 * localSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const auto nWG? Also, why 2? can you ellaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. @josealiaga could you elaborate why this is fixed for 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, first, read the attached document: SYCL_BLAS1_reduction.txt
If you analyze my original code, you will see that two alternatives are considered, related to the alternatives A and B in the attached document. In the merged code, only the case B appears, whereas the case A is the best one for OpenCL devices that I tested.
} | ||
} | ||
} | ||
return l.eval(frs_row, idWFC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a lot of duplicated code between the different variants of gemv, cannot this be shared on a separate struct?
Maybe it should be a separate issue that is smaller than this branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created
include/operations/blas2_trees.hpp
Outdated
l.eval(i) = scl * val + r2.eval(i); | ||
return val; | ||
} | ||
IndexType getSize() { return r1.getSize(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be const everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
if (rowid > frs_row) | ||
// This barrier is mandatory to be sure the data is on the shared | ||
// memory | ||
ndItem.barrier(cl::sycl::access::fence_space::local_space); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we ensure all threads go through this barrier?
The decision of thread cancelation is not in this part of the code so its difficult to understand if this is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code should be restructured and re-written and the barrier should be out side of the if statement. I have created a issue for it
// TO SOLVE IT, USE GLOBAL VALUES OF frs_col AND lst_col | ||
if ((!Upper && (((idWFC * dimWFC) + ((!Diag) ? 1 : 0)) > (lst_row - 1))) || | ||
(!Lower && | ||
((frs_row + ((!Diag) ? 1 : 0)) > ((idWFC * dimWFC + dimWFC) - 1)))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we explain what this condition means?
Extract the boolean decisions outside of the if, and use const bool temp variables to explain each step, otherwise its difficult to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@josealiaga could you please add some documentation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some BLAS routines (GER, SYR and SYR2) are computed by using the same nodes (Ger_Row and Ger_col), but changing the Single, Lower, Upper and Diag parameters.
If only one of the triangle of the matrix has to be computed (SYR and SYR2), the function verifies when the computation has to be made.
I don't remember if Diag can be false in some case, but I probably consider the more general case, and I also included it as parameter
wGSize |= (wGSize >> 4); | ||
wGSize |= (wGSize >> 8); | ||
wGSize |= (wGSize >> 16); | ||
#if defined(__x86_64__) || defined(_M_X64) || defined(__amd64) || \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract that set of macros onto a separate hader and say IS_64BIT or something like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are creating a syclblas.hpp file and this macro will be added there, in aseparate PR.
… branch.