Use numpy in RandomSampler #10768

leezu · 2018-05-01T19:03:10Z

Significant speedup for large datasets:

In [2]: %timeit current_sample(1529*8192)
12.3 s ± 721 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit np_sample(1529*8192)
641 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Significant speedup for large datasets: In [2]: %timeit current_sample(1529*8192) 12.3 s ± 721 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [3]: %timeit np_sample(1529*8192) 641 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

piiswrong · 2018-05-01T22:50:58Z

Have you tried mx ndarray shuffle?

leezu · 2018-05-02T00:10:50Z

It doesn't perform well. At least not with the naive approach of:

...: def sample(length):
    ...:     indices = mx.nd.arange(length)
    ...:     mx.nd.random.shuffle(indices)
    ...:     return (indices[i].asscalar() for i in range(indices.shape[0]))

What did you have in mind?

asitstands · 2018-05-02T03:53:37Z

Would you try this and provide the timings?

def __iter__(self):
  indices = mx.nd.arange(self._length, dtype='int32').reshape((1, self._length)) # may look weird but anyway
  mx.nd.random.shuffle(indices, out=indices)
  return iter(indices[0].asnumpy())

-------- EDIT --------

Please ignore the above. The reshaping was my mistake. It should be this. The performance of shuffle is affected by OMP_NUM_THREADS env variable. Do not set OMP_NUM_THREADS or set it as the number of physical cores.

def __iter__(self):
  indices = mx.nd.arange(self._length, dtype='int32')
  mx.nd.random.shuffle(indices, out=indices)
  return iter(indices.asnumpy())

piiswrong · 2018-05-02T06:51:45Z

I guess this is good enough since we want scalars in the end anyway

asitstands · 2018-05-02T06:57:16Z

If the performance penalty is not too much, I think that using mxnet's shuffle would be preferable. Depending on external global RNG is not a good idea.

leezu · 2018-05-02T22:04:52Z

Thanks @asitstands . Comparing the numpy code to the mx.nd code you provided results in the following performance on my machine:


In [3]: %timeit list(sample_mx(1529*8192))
2.17 s ± 188 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit list(sample_np(1529*8192))
1.3 s ± 73.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So relying on mx.nd.random.shuffle + asnumpy seems to add an extra second.

Regarding RNG, our test cases set both numpy and mxnet seeds. I believe other parts of mxnet also use numpy random, so it may be good to document that both seeds must be set to get deterministic behavior. If this is the only place numpy.random is used it may be worth the extra second to stay consistent?

asitstands · 2018-05-03T06:30:50Z

Numpy's and python's global RNGs are used here and there in mxnet. In my opinion they should be removed all in someday :) It annoys people who work with subtle probabilistic reasonings. Anyway, it looks like currently using numpy is the best. Thanks for quick test.

asitstands · 2018-05-03T07:16:59Z

@leezu If possible, could I ask one more test? In my experiments, mxnet's shuffle outperforms numpy's when array size is large. Would you please test with somewhat larger arrays? self.length_ larger than at least 30000~40000. The performance gain may increase as the size grows.

leezu · 2018-05-04T05:53:15Z

@asitstands above timings where taken with an array size of 12525568. I have tried again with size 40000 and get the following:


In [3]: %timeit list(sample_np(40000))
2.08 ms ± 96.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [5]: %timeit list(sample_mx(40000))
3.12 ms ± 21.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Are you taking the overhead of converting to scalars into account?

asitstands · 2018-05-04T09:17:53Z

Thanks @leezu. I wish this discussion would not bother you too much. Here is my test code.

import time
import mxnet as mx
import numpy as np

n = 40000

start = time.time()
for i in range(10000):
	x = mx.nd.arange(n)
	mx.random.shuffle(x, out=x)
	y = iter(x.asnumpy())
end = time.time()
print("mx elapsed time: ", end - start)

start = time.time()
for i in range(10000):
	x = np.arange(n)
	np.random.shuffle(x)
	y = iter(x)
end = time.time()
print("np elapsed time: ", end - start)

On i7-3770K 3.50GHz, the result is

mx elapsed time:  3.1706936359405518
np elapsed time:  5.6994311809539795

On two Xeon(R) E5-2680 v4 2.40GHz, the result is

mx elapsed time:  2.679560661315918
np elapsed time:  6.299736976623535

As I increase n, the time ratio np / mx also increases. If n is smaller than 15000, np has shorter running time in i7. If n is smaller than 10000, np outperforms mx also in Xeon. I didn't test with gluon samplers but I think this code should capture the difference between shuffles of mx and np.

leezu · 2018-05-04T21:41:30Z

@asitstands I guess the difference between our experiments is that I used a optimized numpy from conda and the standard mxnet pypi build.

Using both optimized numpy and an optimized mxnet build on AWS p3 instance I do observe like you that mxnet is faster for small sizes (40000): ~500μs vs ~800μs of numpy For large sizes (12525568) the asnumpy() overhead is however large and the numpy version takes just 180ms compared to 600ms with the mxnet code.

asitstands · 2018-05-05T02:57:12Z

I think that conda has no special optimization for numpy's shuffle. Numpy's shuffle uses n times of element swaps in serial, where n is the size of the array, while mxnet's shuffle uses essetially the same number of memory copies in a way similar to parallel radix sort (except msvc build). Of course there are more details but this is the essential difference that makes the performance difference. The performance of parallel shuffle varies along with the environment, but it should be faster than numpy if the array size is not too small. asnumpy is a bulk copy of large memory. Its effect is not so prominent comparing to shuffle. The overhead of the serialization by mxnet's engine is also not important for large arrays. With 12525568 elements, mxnet's shuffle is 5~8 times faster than np in my test. For arrays larger than 20000 elements, mxnet is always faster in my tests. I think that the only possible way that mxnet version is slower is low memory bandwidth of the underlying system (OS/hardware) which increase the effect of asnumpy. But the systems I used for tests don't have specially high bandwidth. So I think that my test result would be the general case. I don't have an access to AWS p3. Could you run my test code there?

leezu · 2018-05-05T04:50:28Z

On my personal computer indeed I experience the same speed-up of mxnet compared to numpy. On the other machines the results I quoted above still stand. I guess in the end this depends a lot on the particular system and the build options of the libraries, though it is strange given your explanation about the implementation. As this code is only run once per epoch to shuffle the dataset I believe it is not that important if it takes 200ms or 500ms for large datasets. It was just unbearable that it took 10s+ before.

I don't have a strong feeling about changing it, though I won't propose such change myself given that I had mixed results depending on the computer. If you open a PR and someone is willing to merge it I won't mind.

asitstands · 2018-05-05T05:09:10Z

I'll test on some other environments including AWS and make a PR if I'm sure that the performance hit is not usual.

leezu · 2018-05-05T05:16:08Z

Sounds great, thanks!

Significant speedup for large datasets: In [2]: %timeit current_sample(1529*8192) 12.3 s ± 721 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [3]: %timeit np_sample(1529*8192) 641 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

leezu requested a review from szha as a code owner May 1, 2018 19:03

szha approved these changes May 1, 2018

View reviewed changes

leezu force-pushed the gluondatasamplerspeedup branch from 1d706f8 to 0eda17d Compare May 1, 2018 22:39

piiswrong merged commit 23934cf into apache:master May 2, 2018

leezu deleted the gluondatasamplerspeedup branch May 2, 2018 22:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use numpy in RandomSampler #10768

Use numpy in RandomSampler #10768

leezu commented May 1, 2018

piiswrong commented May 1, 2018

leezu commented May 2, 2018

asitstands commented May 2, 2018 •

edited

Loading

piiswrong commented May 2, 2018

asitstands commented May 2, 2018

leezu commented May 2, 2018

asitstands commented May 3, 2018

asitstands commented May 3, 2018 •

edited

Loading

leezu commented May 4, 2018

asitstands commented May 4, 2018 •

edited

Loading

leezu commented May 4, 2018

asitstands commented May 5, 2018 •

edited

Loading

leezu commented May 5, 2018

asitstands commented May 5, 2018

leezu commented May 5, 2018 via email

Use numpy in RandomSampler #10768

Use numpy in RandomSampler #10768

Conversation

leezu commented May 1, 2018

piiswrong commented May 1, 2018

leezu commented May 2, 2018

asitstands commented May 2, 2018 • edited Loading

piiswrong commented May 2, 2018

asitstands commented May 2, 2018

leezu commented May 2, 2018

asitstands commented May 3, 2018

asitstands commented May 3, 2018 • edited Loading

leezu commented May 4, 2018

asitstands commented May 4, 2018 • edited Loading

leezu commented May 4, 2018

asitstands commented May 5, 2018 • edited Loading

leezu commented May 5, 2018

asitstands commented May 5, 2018

leezu commented May 5, 2018 via email

asitstands commented May 2, 2018 •

edited

Loading

asitstands commented May 3, 2018 •

edited

Loading

asitstands commented May 4, 2018 •

edited

Loading

asitstands commented May 5, 2018 •

edited

Loading