Skip to content
This repository has been archived by the owner on Jan 13, 2025. It is now read-only.

Initial phase of integrating new BL:AS 2 subroutines from merge_reduction… #61

Merged
merged 13 commits into from
Aug 17, 2018

Conversation

mehdi-goli
Copy link
Collaborator

… branch.

@mehdi-goli mehdi-goli added the WIP label Aug 7, 2018
@mehdi-goli mehdi-goli changed the title Initial phase of integrating new BL:AS 2 subroutines from merge_reduction… [WIP] Initial phase of integrating new BL:AS 2 subroutines from merge_reduction… Aug 7, 2018
@mehdi-goli mehdi-goli removed the WIP label Aug 10, 2018
@mehdi-goli mehdi-goli changed the title [WIP] Initial phase of integrating new BL:AS 2 subroutines from merge_reduction… Initial phase of integrating new BL:AS 2 subroutines from merge_reduction… Aug 10, 2018
if (mA.getAccess()) {
auto scalOp1 = make_op<ScalarOp, prdOp2_struct>(_beta, vy);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what this change is here? As far as I can tell, it's cleaning up the code to pick one implementation, but I could be very wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. The main backend code for SYCL BLAS GEMV has changed so now whether it is row-major or column major, there are two separate computation. However, the make_op is the same for both, it has been moved affter if/else.

value_type eval(IndexType i) {
bool valid_thread(cl::sycl::nd_item<1> ndItem) { return true; }

value_type eval(IndexType i) { // NOT VERIFIED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "NOT VERIFIED" mean in this context?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added todo list for @josealiaga to check if this function is not used and extra, so he can let us know and we can remove it from here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the method value_type eval(IndexType i) { .. } has to be defined in all structs.
In this case, it only implements the most simple implementation of the matrix-vector product by rows

*/
inline bool has_local_memory() { return false; }
/*
@brief This class is used to allocated the a regin of memory on the device
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/the a regin/a region/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

class Queue_Interface<Sequential> {
Queue_Interface() {}
/*
@brief This class is to determine whether or not the underlying device has
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class? (same for other member functions).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor

@Ruyk Ruyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are various code style and documentation concerns. In general I think is great we have now level 2 implementation (thanks @josealiaga !) but we need now to refactor the codebase to make the code a bit clearer. We can do that on other pull request.

// to the tree not the root
// printf("Index %ld\n", index.get_global_id(0));
if (tree.valid_thread(index)) { // FIXME (Mehdi):: this should move
// to the tree not the root
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understand the FIXME, can you create a gihub issue and describe it more clearly?

@@ -263,7 +262,9 @@ class Executor<SYCL> {
*/
Executor(cl::sycl::queue q) : q_interface(q){};

cl::sycl::queue sycl_queue() const { return q_interface.sycl_queue(); }
inline Queue_Interface<SYCL> &policy_handler() { return q_interface; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you return a reference to the policy?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the queue interface contains the pointer mapper class which deals with allocation and deallocation of buffers and memory. The call to
implicitly-deleted copy constructor of 'Queue_Interface' error will rise in case of not returning the reference

Copy link
Collaborator Author

@mehdi-goli mehdi-goli Aug 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the error :
'Queue_Interface' is implicitly deleted because field 'pointer_mapper' has a deleted copy
constructor
mutable cl::sycl::codeplay::PointerMapper pointer_mapper;
^
../../../sycl-blas/include/queue/pointer_mapper.hpp:310:3: note: 'PointerMapper' has been
explicitly marked deleted here
PointerMapper(const PointerMapper&) = delete;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used std::shared_ptr to remove the reference

@@ -323,7 +324,7 @@ class Executor<SYCL> {
*/
template <typename T>
inline void copy_to_device(T *src, buffer_iterator<T> dst, size_t = 0) {
sycl_queue().submit([&](cl::sycl::handler &cgh) {
queue().submit([&](cl::sycl::handler &cgh) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is clearer if you rename this to "get_queue" so it is not confused with a constructor

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

auto localSize = 256;
auto nWG = 512;

auto localSize = ex.policy_handler().get_work_group_size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto localSize?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

auto nWG = 512;

auto localSize = ex.policy_handler().get_work_group_size();
auto nWG = 2 * localSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto nWG? Also, why 2? can you ellaborate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. @josealiaga could you elaborate why this is fixed for 2?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, first, read the attached document: SYCL_BLAS1_reduction.txt

If you analyze my original code, you will see that two alternatives are considered, related to the alternatives A and B in the attached document. In the merged code, only the case B appears, whereas the case A is the best one for OpenCL devices that I tested.

}
}
}
return l.eval(frs_row, idWFC);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a lot of duplicated code between the different variants of gemv, cannot this be shared on a separate struct?
Maybe it should be a separate issue that is smaller than this branch

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created

l.eval(i) = scl * val + r2.eval(i);
return val;
}
IndexType getSize() { return r1.getSize(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be const everywhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

if (rowid > frs_row)
// This barrier is mandatory to be sure the data is on the shared
// memory
ndItem.barrier(cl::sycl::access::fence_space::local_space);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we ensure all threads go through this barrier?
The decision of thread cancelation is not in this part of the code so its difficult to understand if this is correct

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code should be restructured and re-written and the barrier should be out side of the if statement. I have created a issue for it

// TO SOLVE IT, USE GLOBAL VALUES OF frs_col AND lst_col
if ((!Upper && (((idWFC * dimWFC) + ((!Diag) ? 1 : 0)) > (lst_row - 1))) ||
(!Lower &&
((frs_row + ((!Diag) ? 1 : 0)) > ((idWFC * dimWFC + dimWFC) - 1)))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we explain what this condition means?
Extract the boolean decisions outside of the if, and use const bool temp variables to explain each step, otherwise its difficult to understand.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josealiaga could you please add some documentation here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some BLAS routines (GER, SYR and SYR2) are computed by using the same nodes (Ger_Row and Ger_col), but changing the Single, Lower, Upper and Diag parameters.
If only one of the triangle of the matrix has to be computed (SYR and SYR2), the function verifies when the computation has to be made.
I don't remember if Diag can be false in some case, but I probably consider the more general case, and I also included it as parameter

wGSize |= (wGSize >> 4);
wGSize |= (wGSize >> 8);
wGSize |= (wGSize >> 16);
#if defined(__x86_64__) || defined(_M_X64) || defined(__amd64) || \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract that set of macros onto a separate hader and say IS_64BIT or something like that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are creating a syclblas.hpp file and this macro will be added there, in aseparate PR.

@mehdi-goli mehdi-goli merged commit b6329fc into codeplaysoftware:master Aug 17, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants