Question on design choice in merge-spmv shared memory loading

Great work!

I have a few questions regarding the **merge-spmv** implementation. In particular, I noticed that in the `csrgemv_merge` function, lines 366–371 contain the following code:

```cpp
#pragma unroll
for (int j = 0; j < TASKS_PER_THREAD; j++) {
    sa[j*THREADS_PER_BLOCK + threadIdx.x] =
        alpha * a[kmin + j*THREADS_PER_BLOCK + threadIdx.x] *
        x[colidx[kmin + j*THREADS_PER_BLOCK + threadIdx.x]];
    srowptr[j*THREADS_PER_BLOCK + threadIdx.x] =
        rowptr[imin + j*THREADS_PER_BLOCK + threadIdx.x];
}
```

When loading the row offsets (list A) and the matrix–vector dot products (list B) into shared memory, this code appears to perform additional work. Starting from the initial point ((imin, kmin)), it advances **to the right** and **downward** by `THREADS_PER_BLOCK * TASKS_PER_THREAD`, meaning that list A and list B each consume `THREADS_PER_BLOCK * TASKS_PER_THREAD` elements.

This behavior seems somewhat different from the implementation described in the original merge-spmv paper
([https://ieeexplore.ieee.org/document/7877136](https://ieeexplore.ieee.org/document/7877136)). In merge-spmv, after computing the coordinates of the start and end points, the total number of consumed elements from list A and list B is `THREADS_PER_BLOCK * TASKS_PER_THREAD`.

Could you please share the rationale behind this design choice? Do you expect these two approaches to exhibit any performance differences in practice?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on design choice in merge-spmv shared memory loading #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question on design choice in merge-spmv shared memory loading #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions