Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Use numpy in RandomSampler #10768

Merged
merged 1 commit into from
May 2, 2018
Merged

Conversation

leezu
Copy link
Contributor

@leezu leezu commented May 1, 2018

Significant speedup for large datasets:

In [2]: %timeit current_sample(1529*8192)
12.3 s ± 721 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit np_sample(1529*8192)
641 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@leezu leezu requested a review from szha as a code owner May 1, 2018 19:03
Significant speedup for large datasets:

In [2]: %timeit current_sample(1529*8192)
12.3 s ± 721 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit np_sample(1529*8192)
641 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
@leezu leezu force-pushed the gluondatasamplerspeedup branch from 1d706f8 to 0eda17d Compare May 1, 2018 22:39
@piiswrong
Copy link
Contributor

Have you tried mx ndarray shuffle?

@leezu
Copy link
Contributor Author

leezu commented May 2, 2018

It doesn't perform well. At least not with the naive approach of:

...: def sample(length):
    ...:     indices = mx.nd.arange(length)
    ...:     mx.nd.random.shuffle(indices)
    ...:     return (indices[i].asscalar() for i in range(indices.shape[0]))

What did you have in mind?

@asitstands
Copy link
Contributor

asitstands commented May 2, 2018

Would you try this and provide the timings?

def __iter__(self):
  indices = mx.nd.arange(self._length, dtype='int32').reshape((1, self._length)) # may look weird but anyway
  mx.nd.random.shuffle(indices, out=indices)
  return iter(indices[0].asnumpy())

-------- EDIT --------

Please ignore the above. The reshaping was my mistake. It should be this. The performance of shuffle is affected by OMP_NUM_THREADS env variable. Do not set OMP_NUM_THREADS or set it as the number of physical cores.

def __iter__(self):
  indices = mx.nd.arange(self._length, dtype='int32')
  mx.nd.random.shuffle(indices, out=indices)
  return iter(indices.asnumpy())

@piiswrong
Copy link
Contributor

I guess this is good enough since we want scalars in the end anyway

@piiswrong piiswrong merged commit 23934cf into apache:master May 2, 2018
@asitstands
Copy link
Contributor

If the performance penalty is not too much, I think that using mxnet's shuffle would be preferable. Depending on external global RNG is not a good idea.

@leezu
Copy link
Contributor Author

leezu commented May 2, 2018

Thanks @asitstands . Comparing the numpy code to the mx.nd code you provided results in the following performance on my machine:


In [3]: %timeit list(sample_mx(1529*8192))
2.17 s ± 188 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit list(sample_np(1529*8192))
1.3 s ± 73.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So relying on mx.nd.random.shuffle + asnumpy seems to add an extra second.

Regarding RNG, our test cases set both numpy and mxnet seeds. I believe other parts of mxnet also use numpy random, so it may be good to document that both seeds must be set to get deterministic behavior. If this is the only place numpy.random is used it may be worth the extra second to stay consistent?

@leezu leezu deleted the gluondatasamplerspeedup branch May 2, 2018 22:05
@asitstands
Copy link
Contributor

Numpy's and python's global RNGs are used here and there in mxnet. In my opinion they should be removed all in someday :) It annoys people who work with subtle probabilistic reasonings. Anyway, it looks like currently using numpy is the best. Thanks for quick test.

@asitstands
Copy link
Contributor

asitstands commented May 3, 2018

@leezu If possible, could I ask one more test? In my experiments, mxnet's shuffle outperforms numpy's when array size is large. Would you please test with somewhat larger arrays? self.length_ larger than at least 30000~40000. The performance gain may increase as the size grows.

@leezu
Copy link
Contributor Author

leezu commented May 4, 2018

@asitstands above timings where taken with an array size of 12525568. I have tried again with size 40000 and get the following:


In [3]: %timeit list(sample_np(40000))
2.08 ms ± 96.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [5]: %timeit list(sample_mx(40000))
3.12 ms ± 21.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Are you taking the overhead of converting to scalars into account?

@asitstands
Copy link
Contributor

asitstands commented May 4, 2018

Thanks @leezu. I wish this discussion would not bother you too much. Here is my test code.

import time
import mxnet as mx
import numpy as np

n = 40000

start = time.time()
for i in range(10000):
	x = mx.nd.arange(n)
	mx.random.shuffle(x, out=x)
	y = iter(x.asnumpy())
end = time.time()
print("mx elapsed time: ", end - start)

start = time.time()
for i in range(10000):
	x = np.arange(n)
	np.random.shuffle(x)
	y = iter(x)
end = time.time()
print("np elapsed time: ", end - start)

On i7-3770K 3.50GHz, the result is

mx elapsed time:  3.1706936359405518
np elapsed time:  5.6994311809539795

On two Xeon(R) E5-2680 v4 2.40GHz, the result is

mx elapsed time:  2.679560661315918
np elapsed time:  6.299736976623535

As I increase n, the time ratio np / mx also increases. If n is smaller than 15000, np has shorter running time in i7. If n is smaller than 10000, np outperforms mx also in Xeon. I didn't test with gluon samplers but I think this code should capture the difference between shuffles of mx and np.

@leezu
Copy link
Contributor Author

leezu commented May 4, 2018

@asitstands I guess the difference between our experiments is that I used a optimized numpy from conda and the standard mxnet pypi build.

Using both optimized numpy and an optimized mxnet build on AWS p3 instance I do observe like you that mxnet is faster for small sizes (40000): ~500μs vs ~800μs of numpy For large sizes (12525568) the asnumpy() overhead is however large and the numpy version takes just 180ms compared to 600ms with the mxnet code.

@asitstands
Copy link
Contributor

asitstands commented May 5, 2018

I think that conda has no special optimization for numpy's shuffle. Numpy's shuffle uses n times of element swaps in serial, where n is the size of the array, while mxnet's shuffle uses essetially the same number of memory copies in a way similar to parallel radix sort (except msvc build). Of course there are more details but this is the essential difference that makes the performance difference. The performance of parallel shuffle varies along with the environment, but it should be faster than numpy if the array size is not too small. asnumpy is a bulk copy of large memory. Its effect is not so prominent comparing to shuffle. The overhead of the serialization by mxnet's engine is also not important for large arrays. With 12525568 elements, mxnet's shuffle is 5~8 times faster than np in my test. For arrays larger than 20000 elements, mxnet is always faster in my tests. I think that the only possible way that mxnet version is slower is low memory bandwidth of the underlying system (OS/hardware) which increase the effect of asnumpy. But the systems I used for tests don't have specially high bandwidth. So I think that my test result would be the general case. I don't have an access to AWS p3. Could you run my test code there?

@leezu
Copy link
Contributor Author

leezu commented May 5, 2018

On my personal computer indeed I experience the same speed-up of mxnet compared to numpy. On the other machines the results I quoted above still stand. I guess in the end this depends a lot on the particular system and the build options of the libraries, though it is strange given your explanation about the implementation. As this code is only run once per epoch to shuffle the dataset I believe it is not that important if it takes 200ms or 500ms for large datasets. It was just unbearable that it took 10s+ before.

I don't have a strong feeling about changing it, though I won't propose such change myself given that I had mixed results depending on the computer. If you open a PR and someone is willing to merge it I won't mind.

@asitstands
Copy link
Contributor

I'll test on some other environments including AWS and make a PR if I'm sure that the performance hit is not usual.

@leezu
Copy link
Contributor Author

leezu commented May 5, 2018 via email

anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 7, 2018
Significant speedup for large datasets:

In [2]: %timeit current_sample(1529*8192)
12.3 s ± 721 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit np_sample(1529*8192)
641 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
Significant speedup for large datasets:

In [2]: %timeit current_sample(1529*8192)
12.3 s ± 721 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit np_sample(1529*8192)
641 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
Significant speedup for large datasets:

In [2]: %timeit current_sample(1529*8192)
12.3 s ± 721 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit np_sample(1529*8192)
641 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants