Process cross index queries in parallel #854

ylogx · 2016-09-12T12:52:39Z

This reduced the processing time from 10hrs to less than 2hrs on a machine with 24 cores (100Gig RAM) and a dataset of size ~635,000.

tmylk · 2016-09-15T14:04:18Z

gensim/similarities/docsim.py

+            futures = []
+            process_chunk = functools.partial(_query_chunk, index=self)
+
+            chunks = list(self.iter_chunks())


Better avoid loading entire index into memory by loading the chunk inside the executor.

tmylk · 2016-09-15T14:05:39Z

gensim/similarities/docsim.py

+
+def _query_chunk(chunk, index):
+    """ Allow pickling """
+    return list(_query_chunk_gen(chunk, index))


what is an example use of this method?

workers can't return a generators, since generator can't be pickled. This method's sole purpose in life is to allow returning results from a worker thread/process.

tmylk · 2016-09-15T14:05:58Z

gensim/similarities/docsim.py

-        for chunk in self.iter_chunks():
-            if chunk.shape[0] > 1:
-                for sim in self[chunk]:
+        with concurrent.futures.ProcessPoolExecutor() as executor:


how big is the speed up from this parallelizatoin?

Depends on the size of machine, for me it took down a task that took 10hrs down to ~2hrs.
What is the standard way to measure perf?

that sounds like a great speed-up. Could you add description of your machine, data size and speed up into the pr description at the top?

tmylk

Please avoid loading entire index into memory

tmylk · 2016-09-15T14:08:01Z

Thanks for the PR!
Please add CHANGELOG and state how much improvement this parallelization gives.

piskvorky · 2016-10-31T06:18:40Z

For MatrixSimilarity, each chunk is already processed using multiple threads (if BLAS is configured to use threads). So more parallelization likely won't help, and may hurt (more context switching, memory bus contention).

Could be useful for other types of indexes though (SparseMatrixSimilarity, Annoy etc).

We cannot use py3k-only features, but that shouldn't be hard to replace/backport.

* Iterate through the chunks instead of loading in memory * Make work an inner method in worker

ylogx · 2017-03-06T21:20:04Z

@tmylk Made the changes.

tmylk · 2017-03-06T22:38:29Z

Thanks for the changes. Before merging will need to see the gensim/test/simspeed2.py benchmark before and after the parallelisation.

Ideally this parallelisation should be in SimilarityABC so that SparseMatrixSimilarity and WMDSimilarity could take advantage of it as well, but not MatrixSimilarity as it uses numpy.dot which is parallelised for dense matrices. I will create another improvement issue for this, unless you wish to make it as a part of this PR. It requires changes to MatrixSimilarity to not use parallelisation and to Similarity to use iter code from SimilarityABC.

piskvorky · 2017-03-06T23:08:46Z

The feature is nice in principle, but the implementation has to be carefully tested on large data. A naive multiprocessing.map will take up all memory, I'm afraid we'll need proper input/output queueing.

piskvorky · 2017-03-06T23:03:13Z

gensim/similarities/docsim.py

-                    yield sim
-            else:
-                yield self[chunk]
+        import multiprocessing


Imports at the top of the file please.

piskvorky · 2017-03-06T23:04:45Z

gensim/similarities/docsim.py

-                yield self[chunk]
+        import multiprocessing
+        import functools
+        pool = multiprocessing.Pool()


How many workers / processes?

piskvorky · 2017-03-06T23:06:09Z

gensim/similarities/docsim.py

+        import functools
+        pool = multiprocessing.Pool()
+        worker = functools.partial(_query_chunk_worker, index=self)
+        for result in pool.map(worker, self.iter_chunks()):


This will be problematic: multiprocessing has no queue limits, it keeps feeding the input (and output) queue constantly, while it can. This will blow up the memory (queue up all chunks), for large enough index, for slow enough workers, or for slow enough result consumption after yield.

In other words, we need to tell the process that feeds the input queue to block if there are too many tasks pending already. Likewise if there are too many results in the output queue, waiting to be yielded.

ludakas · 2017-03-23T06:29:53Z

I am examining the requested changes and I would like to ask if I understand few things correctly and an advice on expected behaviour.

I suppose that the benchmark is the simspeed2.py in gensim/test directory. It fails for me on the second test. Is it supposed to fail, is this the bug I should correct? (it fails in "correct" place where the blocking should be implemented..)
Here is the simspeed2.py benchmark output for reference:
https://gist.github.com/ludakas/31e78c9ed8adeacfdf6228052e5a2b7a

blocking - if I understand it right the "pool.map(worker, self.iter_chunks())" in "__iter__" function (docsim.py) loads all data which are slowly processed and it fills the memory. I was reading about Queue class in multiprocessing. Intuitively the iter_chunks should be entering the queue and only when needed passed to a worker which computes the result. Is the intuition correct?
I am not sure how to translate this into code with pool.map. I found examples which were using Process..

number of workers - this seems easy using the PARALLEL_SHARDS which are declared at the beginning and passing it to the Pool constructor. However I do not understand why it is commented? How should the pool behave if the multiprocessing is disabled?

reference from the beginning:

try:
    import multiprocessing
    # by default, don't parallelize queries. uncomment the following line if you want that.
    # PARALLEL_SHARDS = multiprocessing.cpu_count() # use #parallel processes = #CPus
except ImportError:
    pass

tmylk · 2017-03-27T20:09:02Z

@ludakas does simspeed2.py fail even in the develop branch outside of this PR?

For queue limits see example in MulticoreLDA

The num_workers should be a parameter passed in constructor.

ludakas · 2017-03-29T23:53:20Z

@tmylk yes simspeed2.py fails even in the develop branch on the second test, same as in this PR.

Thanks for the queue limits example, however I encountered another problem even before I got to the queues.

At the top of the docsim.py file is the import of multiprocessing in try except with an option to uncomment setting the number of PARALLEL_SHARDS to the number of cpus. When I uncomment it the code fails with the following error:
AssertionError: daemonic processes are not allowed to have children
-->multiprocessing does not allow for the nested pools, which is the case in this situation. I tried a dirty version of creating sub-class of multiprocesing.pool.Pool described in the stackoverflow post below, but the code never finished, maybe my mistake..(it run more than 30 minutes, normally it takes 4 minutes for simspeed2.py).
http://stackoverflow.com/questions/6974695/python-process-pool-non-daemonic

I made another observation - when I keep the PARALLEL_SHARDS commented out (implying query_shards being sequential) as is default in the branch and I remove the parallelism from __iter__ (here should be implemented the queue) and only keeps there standard for loop shown below, the simspeed2.py runs in only 40 seconds, passing all the tests while with the parallel pool takes 4 minutes to run and it fails the second test. I am not sure what I am doing wrong or whether there is some bug already.

# modified, sequential, runs in 40 seconds
for result in self.iter_chunks():
    for sim in result:
        yield sim

# original, parallel, runs in 4 minutes
pool = multiprocessing.Pool()
worker = functools.partial(_query_chunk_worker, index=self)
for result in pool.map(worker, self.iter_chunks()):
    for sim in result:
        yield sim
pool.terminate()

tmylk · 2017-03-30T13:19:48Z

@ludakas Thanks for running the benchmark. A line_profiler run is required to find out in more detal, but it is not too surprising that adding parallelism speeds up some cases and slows down others.

piskvorky · 2017-04-03T23:40:54Z

My 2 cents: that PARALLEL_SHARDS code that is commented out is ancient and should probably be just removed. Simply uncommenting it is certainly not expected to work, or do anything meaningful.

tmylk · 2017-05-02T19:45:04Z

It seems that the parallelisation doesn't pass the benchmark with PARALLEL_SHARDS commented out:40 seconds old vs 240 seconds new.

tmylk suggested changes Sep 15, 2016

View reviewed changes

tmylk added the difficulty easy Easy issue: required small fix label Sep 24, 2016

tmylk added the wishlist Feature request label Jan 25, 2017

Process chunk queries in parallel

f88d7bc

ylogx force-pushed the feature/parallel_cross_similarity branch from d51c5bc to 4fa50f5 Compare March 6, 2017 19:48

ylogx added 2 commits March 7, 2017 01:22

Update changelog

876f282

Use multiprocessing instead of concurrent.futures

a095312

* Iterate through the chunks instead of loading in memory * Make work an inner method in worker

ylogx force-pushed the feature/parallel_cross_similarity branch from 4fa50f5 to a095312 Compare March 6, 2017 19:52

Write mp.pool in a backward compaitable way

0061e25

tmylk approved these changes Mar 6, 2017

View reviewed changes

piskvorky requested changes Mar 6, 2017

View reviewed changes

piskvorky added difficulty medium Medium issue: required good gensim understanding & python skills and removed difficulty easy Easy issue: required small fix labels Mar 18, 2017

tmylk closed this May 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process cross index queries in parallel #854

Process cross index queries in parallel #854

ylogx commented Sep 12, 2016 •

edited

Loading

tmylk Sep 15, 2016

tmylk Sep 15, 2016

ylogx Sep 16, 2016

tmylk Sep 15, 2016

ylogx Sep 16, 2016

tmylk Sep 16, 2016

tmylk left a comment

tmylk commented Sep 15, 2016

piskvorky commented Oct 31, 2016

ylogx commented Mar 6, 2017

tmylk commented Mar 6, 2017 •

edited

Loading

piskvorky commented Mar 6, 2017

piskvorky Mar 6, 2017

piskvorky Mar 6, 2017

piskvorky Mar 6, 2017 •

edited

Loading

ludakas commented Mar 23, 2017

tmylk commented Mar 27, 2017

ludakas commented Mar 29, 2017

tmylk commented Mar 30, 2017

piskvorky commented Apr 3, 2017

tmylk commented May 2, 2017

Process cross index queries in parallel #854

Process cross index queries in parallel #854

Conversation

ylogx commented Sep 12, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk left a comment

Choose a reason for hiding this comment

tmylk commented Sep 15, 2016

piskvorky commented Oct 31, 2016

ylogx commented Mar 6, 2017

tmylk commented Mar 6, 2017 • edited Loading

piskvorky commented Mar 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Mar 6, 2017 • edited Loading

Choose a reason for hiding this comment

ludakas commented Mar 23, 2017

tmylk commented Mar 27, 2017

ludakas commented Mar 29, 2017

tmylk commented Mar 30, 2017

piskvorky commented Apr 3, 2017

tmylk commented May 2, 2017

ylogx commented Sep 12, 2016 •

edited

Loading

tmylk commented Mar 6, 2017 •

edited

Loading

piskvorky Mar 6, 2017 •

edited

Loading