Skip to content

Bulk read write operation

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Feb 13, 2021 · 10 revisions

With below methods, any number of elements can be written-to and read-from virtual array. This avoids more of locking latency and increases average bandwidth compared to reading/writing elements one by one.

std::vector<Obj> vec;
vec.push_back(Obj(1));
vec.push_back(Obj(1));

// write all vector elements to vArray starting from index "i" of vArray
vArray.writeOnlySetN(i,vec);

// reading same 2 elements back
std::vector<Obj> result = vArray.readOnlyGetN(i,2)

// write only last element of vector to vArray
vArray.writeOnlySet(i,vec,1 /* vector index */, 1 /* number of elements to write */)

Data Consistency

  • Concurrent (non-overlapping) bulk write + bulk write: consistent
  • Concurrent (overlapping or not) bulk read + bulk read: no problem
  • Concurrent (overlapping) bulk write + read: explicit thread-safety needed
  • Concurrent (overlapped) bulk write + bulk write: explicit thread-safety needed

A bulk read/write operation may span over multiple pages and this causes multiple page-in/page-out operations to graphics cards. This means atomicity is not preserved outside a page. Still all elements in one page always updated atomically (due to page-lock mechanism) but multiple pages are not synchronized with each other during bulk r/w. User needs to do either non-overlapped bulk operations (with or without bulk/scalar other operations).

If bulk reads are perfectly matching with paging (start index = index % pageSize, range = pageSize), then there is no need to do any explicit thread-safety.

Bulk region can be any sized with any range but their methods are single-threaded so using too small page-size and too big bulk region creates too many recursion steps of methods. Since it is tail-recursion, C++ compiler (g++, visual C++, ...) can optimize it but still number of iterations increase and possibly slow the process.


Some benchmark data:

Test object:

struct Obj
{
	size_t x;
	Obj(){}
	Obj(size_t y){x=y;}
};

Test array:

VirtualMultiArray<Obj> data1(200000000,gpus,4000,5,{5 /* gt1030 */,10 /* k420 pcie 8x */,5 /* k420 pcie 4x */});

Scalar test:

#pragma omp parallel for num_threads(7)
for(size_t i=0;i<n;i++)
{
	Obj o=data1[i];
	if(o.x!=i)
	{
		std::cout<<"!!!! err"<<std::endl;
	}
}
  • result = 115 MB/s.

Bulk read:

#pragma omp parallel for num_threads(7)
for(size_t i=0;i<n;i+=elementsPerRead)
{	
	auto vec = data1.readOnlyGetN(i,elementsPerRead);
	for(int k=0;k<elementsPerRead;k++)
	{

		if(vec[k].x != i+k)
		{
			std::cout<<"err!!"<<std::endl;
		}
	}
}
  • 5 elements per read = 277 MB/s

  • 50 elements per read = 894 MB/s

  • 500 elements per read = 1350 MB/s

  • 5000 elements per read = 1600 MB/s <--- sweet spot since page size is 4000

  • 50000 elements per read = 650 MB/s

  • Perfectly overlapping pages with bulks (4000 elements per read, also matching starting index): 1930 MB/s

  • Integer multiples of page (8000 elements) = 1750 MB/s

  • 8000 page elements, 8000 elements per read = 3160 MB/s

  • 40000 page elements, 8000 elements per read = 2420 MB/s

  • 40000 page elements, 500 elements per read = 1485 MB/s (10% better than 4000 page elements)


Concurrent reads+write test with pageSize=4000:

#pragma omp parallel for num_threads(7)
for(size_t i=0;i<n;i+=elementsPerRead)
{
	auto vec = data1.readOnlyGetN(i,elementsPerRead);
	for(int k=0;k<elementsPerRead;k++)
	{
		if(vec[k].x != i+k)
		{
			std::cout<<"err!!"<<std::endl;
		}
	}
	data2.writeOnlySetN(i,vec);
}

5 elements per read/write = 73 MB/s (pcie overhead is in effect. 2x op per iteration = 1/2 bw)

50 elements per read/write = 362 MB/s (x4.95)

500 elements per read/write = 854 MB/s (x2.36)

5000 elements per read/write= 1550 MB/s (nearly same as read-only test)

8000 elements (page+bulk size) = 2830 MB/s


With a bigger object, a page size of 128-elements and only 128k-element array:

struct Obj
{
	Obj(){}
	Obj(size_t y){x=y;}
	size_t x;
	char data[1000];
};
  • 5 elements per read/write: 920 MB/s
  • 50 elements per read/write: 1950 MB/s
  • 128 elements per read/write: 2600 MB/s
  • 500 elements per read/write: 1330 MB/s
  • 1000 elements per read/write: 750 MB/s