-
Notifications
You must be signed in to change notification settings - Fork 3
Bulk read write operation
With below methods, any number of elements can be written-to and read-from virtual array. This avoids more of locking latency and increases average bandwidth compared to reading/writing elements one by one.
std::vector<Obj> vec;
vec.push_back(Obj(1));
vec.push_back(Obj(1));
// write all vector elements to vArray starting from index "i" of vArray
vArray.writeOnlySetN(i,vec);
// reading same 2 elements back
std::vector<Obj> result = vArray.readOnlyGetN(i,2)
// write only last element of vector to vArray
vArray.writeOnlySet(i,vec,1 /* vector index */, 1 /* number of elements to write */)
- Concurrent (non-overlapping) bulk write + bulk write: consistent
- Concurrent (overlapping or not) bulk read + bulk read: no problem
- Concurrent (overlapping) bulk write + read: explicit thread-safety needed
- Concurrent (overlapped) bulk write + bulk write: explicit thread-safety needed
A bulk read/write operation may span over multiple pages and this causes multiple page-in/page-out operations to graphics cards. This means atomicity is not preserved outside a page. Still all elements in one page always updated atomically (due to page-lock mechanism) but multiple pages are not synchronized with each other during bulk r/w. User needs to do either non-overlapped bulk operations (with or without bulk/scalar other operations).
If bulk reads are perfectly matching with paging (start index = index % pageSize, range = pageSize), then there is no need to do any explicit thread-safety.
Bulk region can be any sized with any range but their methods are single-threaded so using too small page-size and too big bulk region creates too many recursion steps of methods. Since it is tail-recursion, C++ compiler (g++, visual C++, ...) can optimize it but still number of iterations increase and possibly slow the process.
Test object:
struct Obj
{
size_t x;
Obj(){}
Obj(size_t y){x=y;}
};
Test array:
VirtualMultiArray<Obj> data1(200000000,gpus,4000,5,{5 /* gt1030 */,10 /* k420 pcie 8x */,5 /* k420 pcie 4x */});
Scalar test:
#pragma omp parallel for num_threads(7)
for(size_t i=0;i<n;i++)
{
Obj o=data1[i];
if(o.x!=i)
{
std::cout<<"!!!! err"<<std::endl;
}
}
- result = 115 MB/s.
Bulk read:
#pragma omp parallel for num_threads(7)
for(size_t i=0;i<n;i+=elementsPerRead)
{
auto vec = data1.readOnlyGetN(i,elementsPerRead);
for(int k=0;k<elementsPerRead;k++)
{
if(vec[k].x != i+k)
{
std::cout<<"err!!"<<std::endl;
}
}
}
-
5 elements per read = 277 MB/s
-
50 elements per read = 894 MB/s
-
500 elements per read = 1350 MB/s
-
5000 elements per read = 1600 MB/s <--- sweet spot since page size is 4000
-
50000 elements per read = 650 MB/s
-
Perfectly overlapping pages with bulks (4000 elements per read, also matching starting index): 1930 MB/s
-
Integer multiples of page (8000 elements) = 1750 MB/s
-
8000 page elements, 8000 elements per read = 3160 MB/s
-
40000 page elements, 8000 elements per read = 2420 MB/s
-
40000 page elements, 500 elements per read = 1485 MB/s (10% better than 4000 page elements)
Concurrent reads+write test with pageSize=4000:
#pragma omp parallel for num_threads(7)
for(size_t i=0;i<n;i+=elementsPerRead)
{
auto vec = data1.readOnlyGetN(i,elementsPerRead);
for(int k=0;k<elementsPerRead;k++)
{
if(vec[k].x != i+k)
{
std::cout<<"err!!"<<std::endl;
}
}
data2.writeOnlySetN(i,vec);
}
5 elements per read/write = 73 MB/s (pcie overhead is in effect. 2x op per iteration = 1/2 bw)
50 elements per read/write = 362 MB/s (x4.95)
500 elements per read/write = 854 MB/s (x2.36)
5000 elements per read/write= 1550 MB/s (nearly same as read-only test)
8000 elements (page+bulk size) = 2830 MB/s
With a bigger object, a page size of 128-elements and only 128k-element array:
struct Obj
{
Obj(){}
Obj(size_t y){x=y;}
size_t x;
char data[1000];
};
- 5 elements per read/write: 920 MB/s
- 50 elements per read/write: 1950 MB/s
- 128 elements per read/write: 2600 MB/s
- 500 elements per read/write: 1330 MB/s
- 1000 elements per read/write: 750 MB/s