Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python Polars expression shuffle does not seem to respect polars.set_random_seed #15464

Open
2 tasks done
Swandog opened this issue Apr 3, 2024 · 8 comments
Open
2 tasks done
Labels
bug Something isn't working documentation Improvements or additions to documentation needs decision Awaiting decision by a maintainer needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@Swandog
Copy link

Swandog commented Apr 3, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Assume this script is saved as zonk.py:

import polars as pl

n = 100
real_df = pl.DataFrame(
    {
        "a": [float(x) for x in range(0, n)] * n,
        "b": [float(x) for x in range(n, n*2)] * n,
        "c": [float(x) for x in range(n*2, n*3)] * n,
        "d": [float(x) for x in range(n*3, n*4)] * n,
    }
)

pl.set_random_seed(0)
new_df = real_df.select(pl.col(["a", "b", "c", "d"]).shuffle())
print(new_df.head(50).sum().row(0))
(for i in {1..40}; do python3 zonk.py ; done ) | sort | uniq

Log output

No response

Issue description

It seems like shuffle in the Expression logic does not respect the global seed set by polars.set_random_seed?

My intention with the code above is to shuffle the dataframe, take the first 50 rows, and sum each of those columns. The documentation for set_random_seed says:

Set the global random seed for Polars.

This random seed is used to determine things such as shuffle ordering.

My expectation, then, is that this program would output the same sums each time it is run, since the shuffling should be determined by the random seed. However, this is not what I see:

❯ (for i in {1..40}; do python3 zonk.py ; done ) | sort | uniq
(2306.0, 7581.0, 12835.0, 17523.0)
(2306.0, 7835.0, 12581.0, 17523.0)
(2581.0, 7306.0, 12835.0, 17523.0)
(2581.0, 7523.0, 12835.0, 17306.0)
(2581.0, 7835.0, 12306.0, 17523.0)
(2835.0, 7306.0, 12581.0, 17523.0)
(2835.0, 7523.0, 12581.0, 17306.0)
(2835.0, 7581.0, 12306.0, 17523.0)

If I pass in a seed to the shuffle method directly, it is deterministic; every execution of the program outputs the same sum.

(in my case, I don't want to pass in the seed because I want to shuffle the dataset twice, but I don't want that shuffling to be exactly the same in each shuffle; I'd rather have the second shuffle continue off of the same seed that the first one was started with. It's a little more complicated than that and there are ways around it, but I noticed this problem when I was trying to implement my tests).

Am I correct in assuming that the program shouldn't be coming up with different sums on subsequent runs? Or have I misunderstood something?

Side note: Interestingly, the possible sums do seem to be relatively constrained as to which values come up--that is, if I run the program 40 times (as above), I get more than 1 set of possible sums, but not 40 different sets. And likewise, the values within those sets seems to be constrained to certain values as well. Moreover, the number of sets I can generate seems to be related to the number of columns in my DataFrame, specifically to the factorial of the number of columns; with 3 columns I only see 6 different sets, with 2 I only see 2. If there is only 1 column in the DataFrame it seems like the sum is always the same.

My intuition, then, is not that the global seed is not being respected, but rather that the order the columns are sorted is not adhering to the global seed. But I really don't know the internals of Polars; I just wanted to point out what I had noticed.

Expected behavior

❯ (for i in {1..40}; do python3 zonk.py ; done ) | sort | uniq
(2306.0, 7581.0, 12835.0, 17523.0)

Installed versions

>>> pl.show_versions()
--------Version info---------
Polars:               0.20.18
Index type:           UInt32
Platform:             macOS-14.4-x86_64-i386-64bit
Python:               3.11.8 (main, Feb  6 2024, 21:21:21) [Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.3
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              15.0.2
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@Swandog Swandog added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 3, 2024
@itamarst
Copy link
Contributor

The seed is not per operation, it's per process: it's used to initialize a random number generator. So if you set the seed, the program's series of random operations should give the same result (modulo the fact that multithreading means different operations can happen in a different order). Specific operations won't necessarily give the same result.

@deanm0000 deanm0000 added documentation Improvements or additions to documentation needs decision Awaiting decision by a maintainer labels Jun 11, 2024
@Swandog
Copy link
Author

Swandog commented Jun 13, 2024

Right. But if I set two different processes to have the same global seed, then they should produce the same output. Is that not correct?

What I'm saying here is that running my zonk.py (which always sets a global seed of 0) multiple times does not always produce the same result.

@Swandog
Copy link
Author

Swandog commented Jun 13, 2024

I apologize for the random line
(for i in {1..40}; do python3 zonk.py ; done ) | sort | uniq
at the end of zonk.py in my original comment; I think I was copying it for later use and forgot to move/delete it. That's a shell command used to test zonk.py, not python code.

@itamarst
Copy link
Contributor

Insofar as there is a thread pool running code in parallel, you won't necessarily get identical results.

@Swandog
Copy link
Author

Swandog commented Jun 14, 2024

But if I pass in a seed to shuffle directly, it is deterministic.

So does passing in a seed make it not-threaded?

@itamarst
Copy link
Contributor

No, but if I export POLARS_MAX_THREADS=1 I start getting consistent results.

If you explicitly pass in a seed to the shuffle, you will get consistent behavior because the shuffle is getting the same seed.

Setting the seed globally sets a starting seed for a global random number generator. Then, every time an API retrieves a number from the random generator (e.g. shuffle()), the state of the random number generator changes. Since multiple threads can be executed in a different order, and so each shuffle on each column may get a different seed for the shuffle in different runs.

@Swandog
Copy link
Author

Swandog commented Jun 14, 2024

So, then I guess my confusion is that shuffle doesn't obtain a seed from the global seed if it isn't given one. I assume it just lets the threads obtain their seeds as they execute on each column?

Re-reading the documentation for shuffle, I guess I see my confusion:

Seed for the random number generator. If set to None (default), a random seed is generated each time the shuffle is called.

I guess in my mind I interpreted the shuffle is called as one operation on the whole selection, rather than many operations on separate selected expressions. Perhaps I'm not used to Polars terminology.

@itamarst
Copy link
Contributor

I think that verges on implementation detail that could potentially change. Probably better to document set_random_seed() as not giving reproducible results if you have more than one thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation needs decision Awaiting decision by a maintainer needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants