Skip to content

A GPU accelerated find() method

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Feb 26, 2021 · 18 revisions

VirtualMultiArray has a find(T object, S objectMember, int maxIndices) method that searches for objects with same member value (same byte representation) and returns a std::vector containing their indices (but with maxIndices elements at most per gpu, if found).

  • Search value (S objectMember) needs to be in same parent T object for proper calculations. Using different parent objects for both parameters leads to errors.
  • If an array member (like struct test{ int data[100]; }) needs to be searched for exact byte representation, then it needs to be inside another struct (like struct search_layer { int data[100]; }); struct test { search_layer }; and that struct can be used for searching.
  • If no elements are found, returned vector size is 0.
  • Multiple index order is not guaranteed to be same between repeated searches. OpenCL kernel finds elements and saves results using atomic functions. This lets find() complete quicker than a reduction algorithm when there are only few indices to find.
  • Since finding 1 element has nearly same timing with finding 200 elements, it can beat std::find for big arrays, without even using RAM bandwidth(nor CPU cache bandwidth).
  • If array template parameter is plain data, like char, then searching it needs itself as member:
     char searchObj = 'X';
     std::vector<size_t> found = charArray.find(searchObj,searchObj);

this lets the find() method calculate "member offset" as 0 as expected and finds 'X' element in array.

It's not very accelerated currently but it can do this at "nearly-constant" time:

preparing virtual array...
initializing data...
benchmarking find()...
test object: 5
found object at index: 5
data search: 40MB         358ms
data: 5
----------------------
----------------------
----------------------
test object: 100005
found object at index: 100005
data search: 40MB         6ms
data: 100005
----------------------
----------------------
----------------------
test object: 200005
found object at index: 200005
data search: 40MB         6ms
data: 200005

....

test object: 7800005
found object at index: 7800005
data search: 40MB         6ms
data: 7800005
----------------------
----------------------
----------------------
test object: 7900005
found object at index: 7900005
data search: 40MB         6ms
data: 7900005
----------------------


while std::find_if performance depends on position of element:

test object: 100005
found object at index: 100005
data search: 40MB         0ms
data: 100005
----------------------
----------------------
----------------------
test object: 200005
found object at index: 200005
data search: 40MB         1ms
data: 200005

...

test object: 7600005
found object at index: 7600005
data search: 40MB         16ms
data: 7600005
----------------------
----------------------
----------------------
test object: 7700005
found object at index: 7700005
data search: 40MB         15ms
data: 7700005
----------------------

Example:

#include "GraphicsCardSupplyDepot.h"
#include "VirtualMultiArray.h"
#include "PcieBandwidthBenchmarker.h"

// testing
#include <iostream>
#include "omp.h"

struct ArrayPassByValue
{
	char buf[3];
};

class Obj
{
public:
	Obj(){b=-1;}
	Obj(int i){b=i;}
	char a;
	int b;
	ArrayPassByValue arr;
	char tmp;
};


int main(int argC, char ** argV)
{


	std::cout<<"preparing virtual array..."<<std::endl;
	size_t n = 10000000;
	size_t p = 10000;
	GraphicsCardSupplyDepot gpu;
	VirtualMultiArray<Obj> arr(n,gpu.requestGpus(),p,3,{4,2,2});

	std::cout<<"initializing data..."<<std::endl;
	#pragma omp parallel for
	for(int i=0;i<10;i++)
	{
		arr.mappedReadWriteAccess(i*(n/10),n/10,[&](Obj * ptr){
			for(size_t j=i*(n/10);j<i*(n/10)+(n/10);j++)
			{
				Obj obj(j);
				for(int k=0;k<3;k++)
					obj.arr.buf[k]=' ';
				obj.arr.buf[j%3]='X';
				ptr[j]=obj;
			}
		},false,false,true);
	}
	std::cout<<"benchmarking find()..."<<std::endl;

	Obj searchObj;

	for(int i=0;i<80;i++)
	{
		std::chrono::milliseconds t1 =  std::chrono::duration_cast< std::chrono::milliseconds >(std::chrono::system_clock::now().time_since_epoch());
		searchObj.b=i*100000+5;
		searchObj.arr.buf[0]='X';
		searchObj.arr.buf[1]=' ';
		searchObj.arr.buf[2]=' ';

		// integer member as search value
		// find at most 100 indices that has same member value
		std::vector<size_t> found = arr.find(searchObj,searchObj.b,100);

		// buffer member as search value
		// find at most 100 indices that has same member value
		//std::vector<size_t> found = arr.find(searchObj,searchObj.arr,100);

		std::chrono::milliseconds t2 =  std::chrono::duration_cast< std::chrono::milliseconds >(std::chrono::system_clock::now().time_since_epoch());
		std::cout<<"test object: "<<i*100000+5<<std::endl;
		std::cout<<"found object at index: "<<(found.size()>0?found[0]:-1)<<std::endl;
		std::cout<<"data search: "<<n*sizeof(int)/1000000.0<<"MB         "<<(t2.count()-t1.count())<<"ms"<<std::endl;
		std::cout<<"data: "<<arr.get(found[0]).b<<std::endl;

		// integer search error check
		if(arr.get(found[0]).b!=i*100000+5)
		{
			std::cout<<"ERROR ERROR ERROR ERROR ERROR ERROR ERROR ERROR ERROR ERROR ERROR ERROR"<<std::endl;
			return 0;
		}
		std::cout<<"----------------------"<<std::endl;
		std::cout<<"----------------------"<<std::endl;
		std::cout<<"----------------------"<<std::endl;

	}
	return 0;
}

Here is another example with multiple indices found per find() call:

#include "GraphicsCardSupplyDepot.h"
#include "VirtualMultiArray.h"
#include "PcieBandwidthBenchmarker.h"

// testing
#include <iostream>
#include "omp.h"
#include <algorithm>



class Obj
{
public:
	char c;
};


int main(int argC, char ** argV)
{


	std::cout<<"preparing virtual array..."<<std::endl;
	size_t n = 1100000000ull;
	size_t p = 100000ull;
	GraphicsCardSupplyDepot gpu;
	VirtualMultiArray<Obj> arr(n,gpu.requestGpus(),p,3,{4,2,2});
	std::cout<<"initializing data..."<<std::endl;

	#pragma omp parallel for
	for(size_t i=0;i<10;i++)
	{
		arr.mappedReadWriteAccess(i*(n/10),n/10,[&](Obj * ptr){
			for(size_t j=i*(n/10);j<i*(n/10)+(n/10);j++)
			{
				Obj obj;
				obj.c=' ';

				ptr[j]=obj;
			}
		},false,false,true);
	}

	Obj searchObj;
	searchObj.c='X';
	arr.set(100,searchObj);
	arr.set(1000,searchObj);
	arr.set(10000,searchObj);
	arr.set(100000,searchObj);
	arr.set(1000000,searchObj);
	arr.set(10000000,searchObj);
	arr.set(100000000,searchObj);
	arr.set(1000000000,searchObj);
	std::cout<<"benchmarking find()..."<<std::endl;



	for(int i=0;i<80;i++)
	{
		std::chrono::milliseconds t1 =  std::chrono::duration_cast< std::chrono::milliseconds >(std::chrono::system_clock::now().time_since_epoch());



		// find at most 100 indices that have elements with c='X'
		std::vector<size_t> found = arr.find(searchObj,searchObj.c,200);


		std::chrono::milliseconds t2 =  std::chrono::duration_cast< std::chrono::milliseconds >(std::chrono::system_clock::now().time_since_epoch());

		std::cout<<"num found: "<<found.size()<<std::endl;
		if(found.size()>0)
		{
			std::cout<<"found: ";
			for(int h=0;h<found.size();h++)
			{
				std::cout<<found[h]<<" ";
			}
			std::cout<<std::endl;
		}
		std::cout<<"data search: "<<n*sizeof(char)/1000000.0<<"MB         "<<(t2.count()-t1.count())<<"ms"<<std::endl;
		std::cout<<"consumed RAM bandwidth: "<<(found.size()*sizeof(size_t) + gpu.requestGpus().size()*101*sizeof(size_t))/((t2.count()-t1.count())/1000.0)/1000000.0<<"MB/s"<<std::endl;



		std::cout<<"----------------------"<<std::endl;
		std::cout<<"----------------------"<<std::endl;
		std::cout<<"----------------------"<<std::endl;

	}
	return 0;
}

output:

num found: 8
found: 100 1000 10000 100000000 1000000000 1000000 100000 10000000 
data search: 4000MB         237ms
consumed RAM bandwidth: 0.0120193MB/s
----------------------
----------------------
----------------------
num found: 8
found: 100 1000 10000 100000000 1000000000 1000000 100000 10000000 
data search: 4000MB         237ms
consumed RAM bandwidth: 0.0120193MB/s
----------------------
----------------------
----------------------
num found: 8
found: 100 1000 10000 100000000 1000000000 1000000 100000 10000000 
data search: 4000MB         237ms
consumed RAM bandwidth: 0.0120193MB/s
----------------------
----------------------
----------------------

No performance loss even with 200 found items per find() call:

	Obj searchObj;
	searchObj.c='X';
	for(int i=0;i<25;i++)
	{
		arr.set(100+i,searchObj);
		arr.set(1000+i,searchObj);
		arr.set(10000+i,searchObj);
		arr.set(100000+i,searchObj);
		arr.set(1000000+i,searchObj);
		arr.set(10000000+i,searchObj);
		arr.set(100000000+i,searchObj);
		arr.set(1000000000+i,searchObj);
	}

output:

num found: 200
found: 104 105 106 107 100 101 102 103 108 109 110 111 112 113 114 115 120 121 122 123 116 1024 117 118 119 124 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1016 1017 1018 1019 1012 1013 1014 1015 1020 1021 1022 1023 10000 10001 10002 10003 10008 10009 10010 10011 10004 10005 10006 10007 10012 10013 10014 10015 10016 10017 10018 10019 10024 10020 10021 10022 10023 100000000 100000001 100000002 100000003 100000008 100000009 100000010 100000011 100000004 100000005 100000006 100000007 100000012 100000013 100000014 100000015 100000016 100000017 100000018 100000019 100000024 100000020 100000021 100000022 100000023 1000000000 1000000001 1000000002 1000000003 1000000008 1000000009 1000000010 1000000011 1000000004 1000000005 1000000006 1000000007 1000000012 1000000013 1000000014 1000000015 1000000016 1000000017 1000000018 1000000019 1000000024 1000000020 1000000021 1000000022 1000000023 100000 100001 100002 100003 100016 100017 100018 100019 100004 100005 100006 100007 100020 100021 100022 100023 100008 100009 100010 100011 100024 100012 100013 100014 100015 1000000 1000001 1000002 1000003 1000016 1000017 1000018 1000019 1000004 1000005 1000006 1000007 1000020 1000021 1000022 1000023 1000008 1000009 1000010 1000011 1000024 1000012 1000013 1000014 1000015 10000000 10000001 10000002 10000003 10000016 10000017 10000018 10000019 10000004 10000005 10000006 10000007 10000020 10000021 10000022 10000023 10000008 10000009 10000010 10000011 10000024 10000012 10000013 10000014 10000015 
data search: 1100MB         239ms
consumed RAM bandwidth: 0.0175721MB/s
----------------------
----------------------
----------------------
num found: 200
found: 104 105 106 107 100 101 102 103 108 109 110 111 112 113 114 1024 115 120 121 122 123 116 117 1000 118 1001 119 1002 124 1003 1004 1005 1006 1007 1008 1009 1010 1011 1016 1017 1018 1019 1012 1013 1014 1015 1020 1021 1022 1023 10016 10017 10018 10019 10024 10020 10021 10022 10023 10000 10001 10002 10003 10008 10009 10010 10011 10004 10005 10006 10007 10012 10013 10014 10015 100000000 100000001 100000002 100000003 100000008 100000009 100000010 100000011 100000004 100000005 100000006 100000007 100000012 100000013 100000014 100000015 100000016 100000017 100000018 100000019 100000024 100000020 100000021 100000022 100000023 1000000000 1000000001 1000000002 1000000003 1000000008 1000000009 1000000010 1000000011 1000000004 1000000005 1000000006 1000000007 1000000012 1000000013 1000000014 1000000015 1000000016 1000000017 1000000018 1000000019 1000000024 1000000020 1000000021 1000000022 1000000023 1000000 1000001 1000002 1000003 1000016 1000017 1000018 1000019 1000004 1000005 1000006 1000007 1000020 1000021 1000022 1000023 1000008 1000009 1000010 1000011 1000024 1000012 1000013 1000014 1000015 100000 100001 100002 100003 100016 100017 100018 100019 100004 100005 100006 100007 100020 100021 100022 100023 100008 100009 100010 100011 100024 100012 100013 100014 100015 10000000 10000001 10000002 10000003 10000016 10000017 10000018 10000019 10000004 10000005 10000006 10000007 10000020 10000021 10000022 10000023 10000008 10000009 10000010 10000011 10000024 10000012 10000013 10000014 10000015 
data search: 1100MB         239ms
consumed RAM bandwidth: 0.0175721MB/s


An Intel i7-6700K + GTX 1070 system (https://pastebin.com/fFNuR1Lq) achieves 46ms timing. This is 5.2x faster than development computer, mainly due to graphics card memory performance.