Closed
Description
Enhancement Request
Your issue may already be reported!
Please check out our active issues before creating one.
Is Your Enhancement Request Related to an Issue?
At present, the Parallel Processing class is utilizing numpy to sort datasets into its chunk form where
- (Number of chunks == throttled max threads) <= specified max threads
- Chunk size == ( (dataset size % throttled max threads) + (0 or 1) )
Numpy is only used for calculating the chunks, which should not be the best solution for thread as the numpy library is huge and would be impractical for this use case.
This got me thinking: Would it be more practical to drop numpy for pure python alternative or stick with numpy's C utilization?
Additional Context
To figure this out, I profiled a pure python solution with the numpy solution as found out that for a dataset of 10^6 entries:
profilingNP.py
import time
import numpy
def profile(func):
def wrapped(*args, **kwargs):
iteration = 100
total_time = 0
for _ in range(iteration):
start = time.perf_counter()
result = func(*args, **kwargs)
total_time += (time.perf_counter() - start)
avg_time = round(total_time / iteration, 10)
print(f'{func.__name__} took on average of {avg_time}s for {iteration} iterations')
return result, avg_time
return wrapped
dataset = list(range(10**6))
threads = 8
# numpy solution
@profile
def np():
chunks = numpy.array_split(dataset, threads)
return [ chunk.tolist() for chunk in chunks ]
@profile
def pure():
length = len(dataset)
chunk_count = length // threads
overflow = length % threads
i = 0
final = []
while i < length:
chunk_length = chunk_count + int(overflow > 0)
b = i + chunk_length
final.append(dataset[i:b])
overflow -= 1
i = b
return final
if __name__ == '__main__':
npResult, npTime = np()
pureResult, pureTime = pure()
print(f'Pure python was {-1 * round(((pureTime - npTime) / npTime) * 100, 10)}% faster than the numpy solution')
assert npResult == pureResult, 'There was an algorithm error'