-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python Polars expression shuffle
does not seem to respect polars.set_random_seed
#15464
Comments
The seed is not per operation, it's per process: it's used to initialize a random number generator. So if you set the seed, the program's series of random operations should give the same result (modulo the fact that multithreading means different operations can happen in a different order). Specific operations won't necessarily give the same result. |
Right. But if I set two different processes to have the same global seed, then they should produce the same output. Is that not correct? What I'm saying here is that running my |
I apologize for the random line |
Insofar as there is a thread pool running code in parallel, you won't necessarily get identical results. |
But if I pass in a seed to So does passing in a seed make it not-threaded? |
No, but if I If you explicitly pass in a seed to the shuffle, you will get consistent behavior because the shuffle is getting the same seed. Setting the seed globally sets a starting seed for a global random number generator. Then, every time an API retrieves a number from the random generator (e.g. shuffle()), the state of the random number generator changes. Since multiple threads can be executed in a different order, and so each shuffle on each column may get a different seed for the shuffle in different runs. |
So, then I guess my confusion is that Re-reading the documentation for
I guess in my mind I interpreted |
I think that verges on implementation detail that could potentially change. Probably better to document |
Checks
Reproducible example
Assume this script is saved as
zonk.py
:Log output
No response
Issue description
It seems like
shuffle
in the Expression logic does not respect the global seed set bypolars.set_random_seed
?My intention with the code above is to shuffle the dataframe, take the first 50 rows, and sum each of those columns. The documentation for
set_random_seed
says:My expectation, then, is that this program would output the same sums each time it is run, since the shuffling should be determined by the random seed. However, this is not what I see:
If I pass in a seed to the
shuffle
method directly, it is deterministic; every execution of the program outputs the same sum.(in my case, I don't want to pass in the seed because I want to shuffle the dataset twice, but I don't want that shuffling to be exactly the same in each shuffle; I'd rather have the second shuffle continue off of the same seed that the first one was started with. It's a little more complicated than that and there are ways around it, but I noticed this problem when I was trying to implement my tests).
Am I correct in assuming that the program shouldn't be coming up with different sums on subsequent runs? Or have I misunderstood something?
Side note: Interestingly, the possible sums do seem to be relatively constrained as to which values come up--that is, if I run the program 40 times (as above), I get more than 1 set of possible sums, but not 40 different sets. And likewise, the values within those sets seems to be constrained to certain values as well. Moreover, the number of sets I can generate seems to be related to the number of columns in my
DataFrame
, specifically to the factorial of the number of columns; with 3 columns I only see 6 different sets, with 2 I only see 2. If there is only 1 column in theDataFrame
it seems like the sum is always the same.My intuition, then, is not that the global seed is not being respected, but rather that the order the columns are sorted is not adhering to the global seed. But I really don't know the internals of Polars; I just wanted to point out what I had noticed.
Expected behavior
Installed versions
The text was updated successfully, but these errors were encountered: