Skip to content

Optimizing parallelprocessing #24

Closed
@caffeine-addictt

Description

@caffeine-addictt

Enhancement Request

Your issue may already be reported!
Please check out our active issues before creating one.

Is Your Enhancement Request Related to an Issue?

At present, the Parallel Processing class is utilizing numpy to sort datasets into its chunk form where

  • (Number of chunks == throttled max threads) <= specified max threads
  • Chunk size == ( (dataset size % throttled max threads) + (0 or 1) )

Numpy is only used for calculating the chunks, which should not be the best solution for thread as the numpy library is huge and would be impractical for this use case.

This got me thinking: Would it be more practical to drop numpy for pure python alternative or stick with numpy's C utilization?

Additional Context

To figure this out, I profiled a pure python solution with the numpy solution as found out that for a dataset of 10^6 entries:

profiling result

profilingNP.py

import time
import numpy

def profile(func):
  def wrapped(*args, **kwargs):
    iteration = 100
    total_time = 0

    for _ in range(iteration):
      start = time.perf_counter()
      result = func(*args, **kwargs)
      total_time += (time.perf_counter() - start)

    avg_time = round(total_time / iteration, 10)
    print(f'{func.__name__} took on average of {avg_time}s for {iteration} iterations')

    return result, avg_time
  return wrapped


dataset = list(range(10**6))
threads = 8

# numpy solution
@profile
def np():
  chunks = numpy.array_split(dataset, threads)
  return [ chunk.tolist() for chunk in chunks ]

@profile
def pure():
  length = len(dataset)
  chunk_count = length // threads
  overflow = length % threads

  i = 0
  final = []
  while i < length:
    chunk_length = chunk_count + int(overflow > 0)
    b = i + chunk_length

    final.append(dataset[i:b])
    overflow -= 1
    i = b

  return final


if __name__ == '__main__':
  npResult, npTime = np()
  pureResult, pureTime = pure()

  print(f'Pure python was {-1 * round(((pureTime - npTime) / npTime) * 100, 10)}% faster than the numpy solution')

  assert npResult == pureResult, 'There was an algorithm error'

Metadata

Metadata

Labels

Priority: High +Task is considered higher-priority.Status: WIPCurrently being worked on.Type: EnhancementSuggest an improvement for an existing feature.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions