Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: sort_values fails because of TypeError: Cannot cast array data from dtype('<M8[ns]') to dtype('float64') according to the rule 'safe' #5648

Open
3 tasks done
samyoung-dsci opened this issue Feb 13, 2023 · 8 comments
Labels
bug 🦗 Something isn't working External Pull requests and issues from people who do not regularly contribute to modin Needs more information ❔ Issues that require more information from the reporter

Comments

@samyoung-dsci
Copy link

samyoung-dsci commented Feb 13, 2023

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import modin.pandas as pd
import ray
ray.init(runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}})
import requests

# Get the example data
r = requests.get('https://raw.githubusercontent.com/ES-Catapult/clock_plot/main/data/eden_2_houseid_324_combined_data.csv')
with open('data.csv', 'w') as f:
    f.write(r.text)
data = pd.read_csv('data.csv')
data = data.loc[8029:8033, ['datetime', 'reading_elec']].reset_index(drop=True)
data['reading_elec'] = 0
data['datetime'] = pd.to_datetime(data['datetime'])
# Try to sort the datetimes
data = data.sort_values('datetime')

Issue Description

When trying to use sort_values on a datetime column, it fails with TypeError: Cannot cast array data from dtype('<M8[ns]') to dtype('float64') according to the rule 'safe'.

This presumably has something to do with partitioning (potentially linked to #5552), because in attempting to create a MCVE I found that the issue does not occur when creating an identical minimal dataframe from scratch (using data = pd.DataFrame({'datetime': ["2018-11-07 23:00:00", "2018-11-08 00:00:00", "2018-11-08 01:00:00", "2018-11-08 02:00:00", "2018-11-08 03:00:00"], 'reading_elec': [0]*5})) rather than as a subset of the full dataframe.

This doesn't affect all timeseries - for example, I get the issue with the 2531:2535 of the above dataset, but not with 0:2534. However, I have observed it with multiple files, not just the one above.

Changing from Ray to Dask doesn't solve the issue.

Expected Behavior

The dateframe should end up sorted by datetime (without the exception being raised). Interestingly, this is what happens if you create the smaller dataframe from scratch, as shown below:

import ray
ray.init(runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}})
import requests

data = pd.DataFrame({'datetime': ["2018-11-07 23:00:00", "2018-11-08 00:00:00", "2018-11-08 01:00:00", "2018-11-08 02:00:00", "2018-11-08 03:00:00"], 'reading_elec': [0]*5})
data['datetime'] = pd.to_datetime(data['datetime'])
# Try to sort the datetimes
data = data.sort_values('datetime')```

### Error Logs

<details>

```python-traceback

---------------------------------------------------------------------------
RayTaskError(TypeError)                   Traceback (most recent call last)
Cell In[142], line 1
----> 1 data = data.sort_values('datetime')
      2 print(data)

File c:\Users\samuel.young\AppData\Local\Continuum\anaconda3\envs\default\lib\site-packages\modin\logging\logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115 
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File c:\Users\samuel.young\AppData\Local\Continuum\anaconda3\envs\default\lib\site-packages\modin\pandas\base.py:2911, in BasePandasDataset.sort_values(self, by, axis, ascending, inplace, kind, na_position, ignore_index, key)
   2909 ascending = validate_ascending(ascending)
   2910 if axis == 0:
-> 2911     result = self._query_compiler.sort_rows_by_column_values(
   2912         by,
   2913         ascending=ascending,
   2914         kind=kind,
...
    groupby_col = np.digitize(cols_to_digitize.squeeze(), pivots)
  File "<__array_function__ internals>", line 200, in digitize
  File "c:\Users\samuel.young\AppData\Local\Continuum\anaconda3\envs\default\lib\site-packages\numpy\lib\function_base.py", line 5604, in digitize
    mono = _monotonicity(bins)
TypeError: Cannot cast array data from dtype('<M8[ns]') to dtype('float64') according to the rule 'safe'

Installed Versions

INSTALLED VERSIONS

commit : 9068fbc
python : 3.10.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United Kingdom.1252

Modin dependencies

modin : 0.18.1
ray : 2.2.0
dask : 2023.2.0
distributed : 2023.2.0
hdk : None

pandas dependencies

pandas : 1.5.3
numpy : 1.24.1
pytz : 2022.7
dateutil : 2.8.2
setuptools : 65.6.3
pip : 22.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.8.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli :
fastparquet : None
fsspec : 2023.1.0
gcsfs : None
matplotlib : 3.6.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.0
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

@samyoung-dsci samyoung-dsci added bug 🦗 Something isn't working Triage 🩹 Issues that need triage labels Feb 13, 2023
@pyrito
Copy link
Collaborator

pyrito commented Feb 13, 2023

@samyoung-dsci thanks for opening the issue! I wasn't able to reproduce this bug on the latest master of modin on my machine. I tried reproducing with the same dependencies as well. I'm not sure if this is because I'm on a Mac. @vnlitvinov could you try reproducing this bug on your machine?

@pyrito pyrito added Needs more information ❔ Issues that require more information from the reporter and removed Triage 🩹 Issues that need triage labels Feb 13, 2023
@samyoung-dsci
Copy link
Author

I've tried it again with a fresh conda environment and the main branch of Modin, and can confirm that I still get the issue.

@pyrito
Copy link
Collaborator

pyrito commented Feb 14, 2023

@modin-project/modin-contributors @modin-project/modin-core is anyone else able to reproduce this on their machines?

@mvashishtha
Copy link
Collaborator

@samyoung-dsci @pyrito unfortunately I can't reproduce the error. My steps were:

  1. conda create --name=5648-reproduce python=3.10.8
  2. conda activate 5648-reproduce
  3. pip install modin[ray]==0.18.1
  4. pip install ipython
  5. in ipython, run the script.

I've also run the script 100 times in a loop to see whether I could catch a nondeterministic error.

I'm on:

  • macOS Monterey
  • Version 12.4
  • MacBook Pro (16-inch, 2019)
  • processor: 2.3 GHz 8-Core Intel Core i9
  • memory: 16 GB 2667 MHz DDR4

@dchigarev
Copy link
Collaborator

I've tried both Linux and Windows and have only been able to reproduce the failure on Windows.

Steps I used:

  1. conda create -n issue-5648 python=3.8
  2. conda activate issue-5648
  3. cd modin_latest_master
  4. pip install .[ray]
  5. python script.py

Here is my conda list output:

issue-5648 env
# packages in environment at C:\Users\dchigare\Miniconda3\envs\issue-5648:
#
# Name                    Version                   Build  Channel
aiohttp                   3.8.4                    pypi_0    pypi
aiohttp-cors              0.7.0                    pypi_0    pypi
aiosignal                 1.3.1                    pypi_0    pypi
ansicon                   1.89.0                   pypi_0    pypi
async-timeout             4.0.2                    pypi_0    pypi
attrs                     22.2.0                   pypi_0    pypi
blessed                   1.20.0                   pypi_0    pypi
ca-certificates           2023.01.10           haa95532_0
cachetools                5.3.0                    pypi_0    pypi
certifi                   2022.12.7        py38haa95532_0
charset-normalizer        3.0.1                    pypi_0    pypi
click                     8.1.3                    pypi_0    pypi
colorama                  0.4.6                    pypi_0    pypi
colorful                  0.5.5                    pypi_0    pypi
distlib                   0.3.6                    pypi_0    pypi
filelock                  3.9.0                    pypi_0    pypi
frozenlist                1.3.3                    pypi_0    pypi
fsspec                    2023.1.0                 pypi_0    pypi
google-api-core           2.11.0                   pypi_0    pypi
google-auth               2.16.0                   pypi_0    pypi
googleapis-common-protos  1.58.0                   pypi_0    pypi
gpustat                   1.0.0                    pypi_0    pypi
grpcio                    1.51.1                   pypi_0    pypi
idna                      3.4                      pypi_0    pypi
importlib-resources       5.10.2                   pypi_0    pypi
jinxed                    1.2.0                    pypi_0    pypi
jsonschema                4.17.3                   pypi_0    pypi
libffi                    3.4.2                hd77b12b_6
modin                     0.17.0+168.g75c82c88          pypi_0    pypi
msgpack                   1.0.4                    pypi_0    pypi
multidict                 6.0.4                    pypi_0    pypi
numpy                     1.24.2                   pypi_0    pypi
nvidia-ml-py              11.495.46                pypi_0    pypi
opencensus                0.11.1                   pypi_0    pypi
opencensus-context        0.1.3                    pypi_0    pypi
openssl                   1.1.1t               h2bbff1b_0
packaging                 23.0                     pypi_0    pypi
pandas                    1.5.3                    pypi_0    pypi
pip                       22.3.1           py38haa95532_0
pkgutil-resolve-name      1.3.10                   pypi_0    pypi
platformdirs              3.0.0                    pypi_0    pypi
prometheus-client         0.13.1                   pypi_0    pypi
protobuf                  4.21.12                  pypi_0    pypi
psutil                    5.9.4                    pypi_0    pypi
py-spy                    0.3.14                   pypi_0    pypi
pyarrow                   11.0.0                   pypi_0    pypi
pyasn1                    0.4.8                    pypi_0    pypi
pyasn1-modules            0.2.8                    pypi_0    pypi
pydantic                  1.10.4                   pypi_0    pypi
pyrsistent                0.19.3                   pypi_0    pypi
python                    3.8.16               h6244533_2
python-dateutil           2.8.2                    pypi_0    pypi
pytz                      2022.7.1                 pypi_0    pypi
pyyaml                    6.0                      pypi_0    pypi
ray                       2.2.0                    pypi_0    pypi
requests                  2.28.2                   pypi_0    pypi
rsa                       4.9                      pypi_0    pypi
setuptools                65.6.3           py38haa95532_0
six                       1.16.0                   pypi_0    pypi
smart-open                6.3.0                    pypi_0    pypi
sqlite                    3.40.1               h2bbff1b_0
typing-extensions         4.4.0                    pypi_0    pypi
urllib3                   1.26.14                  pypi_0    pypi
vc                        14.2                 h21ff451_1
virtualenv                20.19.0                  pypi_0    pypi
vs2015_runtime            14.27.29016          h5e58377_2
wcwidth                   0.2.6                    pypi_0    pypi
wheel                     0.37.1             pyhd3eb1b0_0
wincertstore              0.2              py38haa95532_2
yarl                      1.8.2                    pypi_0    pypi
zipp                      3.13.0                   pypi_0    pypi

And pd.show_versions()

spoiler
INSTALLED VERSIONS
------------------
commit           : 75c82c88ac538fdd4235d0d6f1cea323cd6bbb3a
python           : 3.8.16.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.22000
machine          : AMD64
processor        : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : English_Europe.1252

Modin dependencies
------------------
modin            : 0.17.0+168.g75c82c88
ray              : 2.2.0
dask             : None
distributed      : None
hdk              : None

pandas dependencies
-------------------
pandas           : 1.5.3
numpy            : 1.24.2
pytz             : 2022.7.1
dateutil         : 2.8.2
setuptools       : 65.6.3
pip              : 22.3.1
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : 2023.1.0
gcsfs            : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 11.0.0
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None
tzdata           : None

@pyrito
Copy link
Collaborator

pyrito commented Feb 15, 2023

@dchigarev can you take a look at this issue since you're able to reproduce it on Windows?

@anmyachev anmyachev added the External Pull requests and issues from people who do not regularly contribute to modin label Apr 19, 2023
@vnlitvinov
Copy link
Collaborator

@dchigarev @samyoung-dsci would it be possible for any of you to post python -m modin.config in an environment where this issue fires? I'm guessing it can have something to do with exact number of workers we use or something similar, if partitioning is to blame.

Here's sample output from my laptop:

python -m modin.config
MODIN_ASV_DATASIZE_CONFIG: Allows to override default size of data (shapes).
        Provide a string
        Current value: None
MODIN_ASV_USE_IMPL: Allows to select a library that we will use for testing performance.
        Provide a string (valid examples are: modin, pandas)
        Current value: modin
MODIN_ASYNC_READ_MODE: It does not wait for the end of reading information from the source.

Can break situations when reading occurs in a context, when exiting
from which the source is deleted.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
MODIN_BENCHMARK_MODE: Whether or not to perform computations synchronously.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
AWS_ACCESS_KEY_ID: Set to AWS_ACCESS_KEY_ID when running mock S3 tests for Modin in GitHub CI.
        Provide a case-insensitive string
        Current value: foobar_key
AWS_SECRET_ACCESS_KEY: Set to AWS_SECRET_ACCESS_KEY when running mock S3 tests for Modin in GitHub CI.
        Provide a case-insensitive string
        Current value: foobar_secret
MODIN_CPUS: How many CPU cores to use during initialization of the Modin engine.
        Provide an integer value
        Current value: 16
MODIN_LOG_RPYC: Whether to gather RPyC logs (applicable for remote context).
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: None
MODIN_TRACE_RPYC: Whether to trace RPyC calls (applicable for remote context).
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: None
MODIN_USE_CALCITE: Whether to use Calcite for OmniSci queries execution.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: True
MODIN_ENGINE: Distribution engine to run queries by.
        Provide a case-insensitive string (valid examples are: Ray, Dask, Python, Native, Unidist)
        Current value: Ray
MODIN_EXPERIMENTAL_GROUPBY: Set to true to use Modin's experimental group by implementation.

Experimental groupby is implemented using a range-partitioning technique,
note that it may not always work better than the original Modin's TreeReduce
and FullAxis implementations. For more information visit the according section
of Modin's documentation: TODO: add a link to the section once it's written.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
MODIN_EXPERIMENTAL_NUMPY_API: Set to true to use Modin's experimental NumPy API.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
MODIN_GITHUB_CI: Set to true when running Modin in GitHub CI.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
MODIN_GPUS: How may GPU devices to utilize across the whole distribution.
        Provide an integer value
        Current value: None
MODIN_HDK_FRAGMENT_SIZE: How big a fragment in HDK should be when creating a table (in rows).
        Provide an integer value
        Current value: None
MODIN_HDK_LAUNCH_PARAMETERS: Additional command line options for the HDK engine.

Please visit OmniSci documentation for the description of available parameters:
https://docs.omnisci.com/installation-and-configuration/config-parameters#configuration-parameters-for-omniscidb
        Provide a sequence of KEY=VALUE values separated by comma (Example: 'KEY1=VALUE1,KEY2=VALUE2,KEY3=VALUE3')
        Current value: {'enable_union': 1, 'enable_columnar_output': 1, 'enable_lazy_fetch': 0, 'null_div_by_zero': 1, 'enable_watchdog': 0, 'enable_thrift_logs': 0}
MODIN_DEBUG: Force Modin engine to be "Python" unless specified by $MODIN_ENGINE.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: None
MODIN_EXPERIMENTAL: Whether to Turn on experimental features.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: None
MODIN_RAY_CLUSTER: Whether Modin is running on pre-initialized Ray cluster.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: None
MODIN_LOG_FILE_SIZE: Max size of logs (in MBs) to store per Modin job.
        Provide an integer value
        Current value: 10
MODIN_LOG_MEMORY_INTERVAL: Interval (in seconds) to profile memory utilization for logging.
        Provide an integer value
        Current value: 5
MODIN_LOG_MODE: Set ``LogMode`` value if users want to opt-in.
        Provide a string (valid examples are: enable, disable, enable_api_only)
        Current value: disable
MODIN_MEMORY: How much memory (in bytes) give to an execution engine.

Notes
-----
* In Ray case: the amount of memory to start the Plasma object store with.
* In Dask case: the amount of memory that is given to each worker depending on CPUs used.
        Provide an integer value
        Current value: None
MODIN_MIN_PARTITION_SIZE: Minimum number of rows/columns in a single pandas partition split.

Once a partition for a pandas dataframe has more than this many elements,
Modin adds another partition.
        Provide an integer value
        Current value: 32
MODIN_NPARTITIONS: How many partitions to use for a Modin DataFrame (along each axis).
        Provide an integer value
        Current value: 16
MODIN_OMNISCI_FRAGMENT_SIZE: How big a fragment in OmniSci should be when creating a table (in rows).
        Provide an integer value
        Current value: None
MODIN_OMNISCI_LAUNCH_PARAMETERS: Additional command line options for the OmniSci engine.

Please visit OmniSci documentation for the description of available parameters:
https://docs.omnisci.com/installation-and-configuration/config-parameters#configuration-parameters-for-omniscidb
        Provide a sequence of KEY=VALUE values separated by comma (Example: 'KEY1=VALUE1,KEY2=VALUE2,KEY3=VALUE3')
        Current value: {'enable_union': 1, 'enable_columnar_output': 1, 'enable_lazy_fetch': 0, 'null_div_by_zero': 1, 'enable_watchdog': 0, 'enable_thrift_logs': 0}
MODIN_PERSISTENT_PICKLE: Whether serialization should be persistent.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
MODIN_PROGRESS_BAR: Whether or not to show the progress bar.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
MODIN_REDIS_ADDRESS: Redis address to connect to when running in Ray cluster.
        Provide a string
        Current value: None
MODIN_REDIS_PASSWORD: What password to use for connecting to Redis.
        Provide a string
        Current value: e79ba1c2610cdea4a2ca2f6ec677ee48926b8996cb8d3c972dd28a73f2f60e18
MODIN_READ_SQL_ENGINE: Engine to run `read_sql`.
        Provide a case-insensitive string (valid examples are: Pandas, Connectorx)
        Current value: Pandas
MODIN_SOCKS_PROXY: SOCKS proxy address if it is needed for SSH to work.
        Provide a string
        Current value: None
MODIN_STORAGE_FORMAT: Engine to run on a single node of distribution.
        Provide a case-insensitive string (valid examples are: Pandas, Hdk, Pyarrow, Cudf)
        Current value: Pandas
MODIN_TEST_DATASET_SIZE: Dataset size for running some tests.
        Provide a case-insensitive string (valid examples are: Small, Normal, Big)
        Current value: None
MODIN_TEST_RAY_CLIENT: Set to true to start and connect Ray client before a testing session starts.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
MODIN_TEST_READ_FROM_POSTGRES: Set to true to test reading from Postgres.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
MODIN_TEST_READ_FROM_SQL_SERVER: Set to true to test reading from SQL server.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
MODIN_TEST_TRACK_FILE_LEAKS: Whether to track for open file handles leakage during testing.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False

@dchigarev
Copy link
Collaborator

dchigarev commented Jun 21, 2023

@vnlitvinov the error is seem to be only reproducible if the CpuCount is set to 8, here's the script that should reproduce the error at any machine (tried on both windows and linux):

(I've also been able to reproduce the problem at the latest master be98fe6 with pandas 2.0.2)

script
import modin.pandas as pd
import ray
import modin.config as cfg

cfg.CpuCount.put(8)

ray.init(num_cpus=cfg.CpuCount.get(), runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}})
import requests

# Get the example data
r = requests.get('https://raw.githubusercontent.com/ES-Catapult/clock_plot/main/data/eden_2_houseid_324_combined_data.csv')
with open('data.csv', 'w') as f:
    f.write(r.text)
data = pd.read_csv('data.csv')
data = data.loc[8029:8033, ['datetime', 'reading_elec']].reset_index(drop=True)
data['reading_elec'] = 0
data['datetime'] = pd.to_datetime(data['datetime'])
# Try to sort the datetimes
data = data.sort_values('datetime')
print(data)
python -m modin.config
MODIN_ASV_DATASIZE_CONFIG: Allows to override default size of data (shapes).
        Provide a string
        Current value: None
MODIN_ASV_USE_IMPL: Allows to select a library that we will use for testing performance.
        Provide a string (valid examples are: modin, pandas)
        Current value: modin
MODIN_ASYNC_READ_MODE: It does not wait for the end of reading information from the source.

Can break situations when reading occurs in a context, when exiting
from which the source is deleted.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
MODIN_BENCHMARK_MODE: Whether or not to perform computations synchronously.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
AWS_ACCESS_KEY_ID: Set to AWS_ACCESS_KEY_ID when running mock S3 tests for Modin in GitHub CI.
        Provide a case-insensitive string
        Current value: foobar_key
AWS_SECRET_ACCESS_KEY: Set to AWS_SECRET_ACCESS_KEY when running mock S3 tests for Modin in GitHub CI.
        Provide a case-insensitive string
        Current value: foobar_secret
MODIN_CPUS: How many CPU cores to use during initialization of the Modin engine.
        Provide an integer value
        Current value: 8
MODIN_LOG_RPYC: Whether to gather RPyC logs (applicable for remote context).
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: None
MODIN_TRACE_RPYC: Whether to trace RPyC calls (applicable for remote context).
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: None
MODIN_USE_CALCITE: Whether to use Calcite for OmniSci queries execution.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: True
MODIN_ENGINE: Distribution engine to run queries by.
        Provide a case-insensitive string (valid examples are: Ray, Dask, Python, Native, Unidist)
        Current value: Ray
MODIN_EXPERIMENTAL_GROUPBY: Set to true to use Modin's experimental group by implementation.

Experimental groupby is implemented using a range-partitioning technique,
note that it may not always work better than the original Modin's TreeReduce
and FullAxis implementations. For more information visit the according section
of Modin's documentation: TODO: add a link to the section once it's written.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
MODIN_EXPERIMENTAL_NUMPY_API: Set to true to use Modin's experimental NumPy API.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
MODIN_GITHUB_CI: Set to true when running Modin in GitHub CI.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
MODIN_GPUS: How may GPU devices to utilize across the whole distribution.
        Provide an integer value
        Current value: None
MODIN_HDK_FRAGMENT_SIZE: How big a fragment in HDK should be when creating a table (in rows).
        Provide an integer value
        Current value: None
MODIN_HDK_LAUNCH_PARAMETERS: Additional command line options for the HDK engine.

Please visit OmniSci documentation for the description of available parameters:
https://docs.omnisci.com/installation-and-configuration/config-parameters#configuration-parameters-for-omniscidb
        Provide a sequence of KEY=VALUE values separated by comma (Example: 'KEY1=VALUE1,KEY2=VALUE2,KEY3=VALUE3')
        Current value: {'enable_union': 1, 'enable_columnar_output': 1, 'enable_lazy_fetch': 0, 'null_div_by_zero': 1, 'enable_watchdog': 0, 'enable_thrift_logs': 0}
MODIN_DEBUG: Force Modin engine to be "Python" unless specified by $MODIN_ENGINE.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: None
MODIN_EXPERIMENTAL: Whether to Turn on experimental features.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: None
MODIN_RAY_CLUSTER: Whether Modin is running on pre-initialized Ray cluster.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: None
MODIN_LOG_FILE_SIZE: Max size of logs (in MBs) to store per Modin job.
        Provide an integer value
        Current value: 10
MODIN_LOG_MEMORY_INTERVAL: Interval (in seconds) to profile memory utilization for logging.
        Provide an integer value
        Current value: 5
MODIN_LOG_MODE: Set ``LogMode`` value if users want to opt-in.
        Provide a string (valid examples are: enable, disable, enable_api_only)
        Current value: disable
MODIN_MEMORY: How much memory (in bytes) give to an execution engine.

Notes
-----
* In Ray case: the amount of memory to start the Plasma object store with.
* In Dask case: the amount of memory that is given to each worker depending on CPUs used.
        Provide an integer value
        Current value: None
MODIN_MIN_PARTITION_SIZE: Minimum number of rows/columns in a single pandas partition split.

Once a partition for a pandas dataframe has more than this many elements,
Modin adds another partition.
        Provide an integer value
        Current value: 32
MODIN_NPARTITIONS: How many partitions to use for a Modin DataFrame (along each axis).
        Provide an integer value
        Current value: 8
MODIN_OMNISCI_FRAGMENT_SIZE: How big a fragment in OmniSci should be when creating a table (in rows).
        Provide an integer value
        Current value: None
MODIN_OMNISCI_LAUNCH_PARAMETERS: Additional command line options for the OmniSci engine.

Please visit OmniSci documentation for the description of available parameters:
https://docs.omnisci.com/installation-and-configuration/config-parameters#configuration-parameters-for-omniscidb
        Provide a sequence of KEY=VALUE values separated by comma (Example: 'KEY1=VALUE1,KEY2=VALUE2,KEY3=VALUE3')
        Current value: {'enable_union': 1, 'enable_columnar_output': 1, 'enable_lazy_fetch': 0, 'null_div_by_zero': 1, 'enable_watchdog': 0, 'enable_thrift_logs': 0}
MODIN_PERSISTENT_PICKLE: Whether serialization should be persistent.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
MODIN_PROGRESS_BAR: Whether or not to show the progress bar.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
MODIN_REDIS_ADDRESS: Redis address to connect to when running in Ray cluster.
        Provide a string
        Current value: None
MODIN_REDIS_PASSWORD: What password to use for connecting to Redis.
        Provide a string
        Current value: a8dba2b876ff8fd950cbdfcb56f3dd8537e02f7b82a0b23292608622eb03a7b1
MODIN_READ_SQL_ENGINE: Engine to run `read_sql`.
        Provide a case-insensitive string (valid examples are: Pandas, Connectorx)
        Current value: Pandas
MODIN_SOCKS_PROXY: SOCKS proxy address if it is needed for SSH to work.
        Provide a string
        Current value: None
MODIN_STORAGE_FORMAT: Engine to run on a single node of distribution.
        Provide a case-insensitive string (valid examples are: Pandas, Hdk, Pyarrow, Cudf)
        Current value: Pandas
MODIN_TEST_DATASET_SIZE: Dataset size for running some tests.
        Provide a case-insensitive string (valid examples are: Small, Normal, Big)
        Current value: None
MODIN_TEST_RAY_CLIENT: Set to true to start and connect Ray client before a testing session starts.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
MODIN_TEST_READ_FROM_POSTGRES: Set to true to test reading from Postgres.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
MODIN_TEST_READ_FROM_SQL_SERVER: Set to true to test reading from SQL server.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
MODIN_TEST_TRACK_FILE_LEAKS: Whether to track for open file handles leakage during testing.
        Provide a boolean flag (any of 'true', 'yes' or '1' in case insensitive manner is considered positive)
        Current value: False
pd.show_versions()

UserWarning: Setuptools is replacing distutils.

INSTALLED VERSIONS

commit : be98fe6
python : 3.8.16.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_Europe.1252

Modin dependencies

modin : 0.17.0+434.gbe98fe6c
ray : 2.5.0
dask : None
distributed : None
hdk : None

pandas dependencies

pandas : 2.0.2
numpy : 1.24.3
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.8.0
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.6.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 12.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

conda list
# Name                    Version                   Build  Channel
aiohttp                   3.8.4                    pypi_0    pypi
aiohttp-cors              0.7.0                    pypi_0    pypi
aiosignal                 1.3.1                    pypi_0    pypi
ansicon                   1.89.0                   pypi_0    pypi
async-timeout             4.0.2                    pypi_0    pypi
attrs                     23.1.0                   pypi_0    pypi
blessed                   1.20.0                   pypi_0    pypi
ca-certificates           2023.05.30           haa95532_0
cachetools                5.3.1                    pypi_0    pypi
certifi                   2023.5.7                 pypi_0    pypi
charset-normalizer        3.1.0                    pypi_0    pypi
click                     8.1.3                    pypi_0    pypi
colorama                  0.4.6                    pypi_0    pypi
colorful                  0.5.5                    pypi_0    pypi
distlib                   0.3.6                    pypi_0    pypi
filelock                  3.12.2                   pypi_0    pypi
frozenlist                1.3.3                    pypi_0    pypi
fsspec                    2023.6.0                 pypi_0    pypi
google-api-core           2.11.1                   pypi_0    pypi
google-auth               2.20.0                   pypi_0    pypi
googleapis-common-protos  1.59.1                   pypi_0    pypi
gpustat                   1.1                      pypi_0    pypi
grpcio                    1.51.3                   pypi_0    pypi
idna                      3.4                      pypi_0    pypi
importlib-resources       5.12.0                   pypi_0    pypi
jinxed                    1.2.0                    pypi_0    pypi
jsonschema                4.17.3                   pypi_0    pypi
libffi                    3.4.4                hd77b12b_0
modin                     0.17.0+434.gbe98fe6c          pypi_0    pypi
msgpack                   1.0.5                    pypi_0    pypi
multidict                 6.0.4                    pypi_0    pypi
numpy                     1.24.3                   pypi_0    pypi
nvidia-ml-py              11.525.112               pypi_0    pypi
opencensus                0.11.2                   pypi_0    pypi
opencensus-context        0.1.3                    pypi_0    pypi
openssl                   3.0.8                h2bbff1b_0
packaging                 23.1                     pypi_0    pypi
pandas                    2.0.2                    pypi_0    pypi
pip                       23.1.2           py38haa95532_0
pkgutil-resolve-name      1.3.10                   pypi_0    pypi
platformdirs              3.7.0                    pypi_0    pypi
prometheus-client         0.17.0                   pypi_0    pypi
protobuf                  4.23.3                   pypi_0    pypi
psutil                    5.9.5                    pypi_0    pypi
py-spy                    0.3.14                   pypi_0    pypi
pyarrow                   12.0.1                   pypi_0    pypi
pyasn1                    0.5.0                    pypi_0    pypi
pyasn1-modules            0.3.0                    pypi_0    pypi
pydantic                  1.10.9                   pypi_0    pypi
pyrsistent                0.19.3                   pypi_0    pypi
python                    3.8.16               h1aa4202_4
python-dateutil           2.8.2                    pypi_0    pypi
pytz                      2023.3                   pypi_0    pypi
pyyaml                    6.0                      pypi_0    pypi
ray                       2.5.0                    pypi_0    pypi
requests                  2.31.0                   pypi_0    pypi
rsa                       4.9                      pypi_0    pypi
setuptools                67.8.0           py38haa95532_0
six                       1.16.0                   pypi_0    pypi
smart-open                6.3.0                    pypi_0    pypi
sqlite                    3.41.2               h2bbff1b_0
typing-extensions         4.6.3                    pypi_0    pypi
tzdata                    2023.3                   pypi_0    pypi
urllib3                   1.26.16                  pypi_0    pypi
vc                        14.2                 h21ff451_1
virtualenv                20.21.0                  pypi_0    pypi
vs2015_runtime            14.27.29016          h5e58377_2
wcwidth                   0.2.6                    pypi_0    pypi
wheel                     0.38.4           py38haa95532_0
yarl                      1.9.2                    pypi_0    pypi
zipp                      3.15.0                   pypi_0    pypi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working External Pull requests and issues from people who do not regularly contribute to modin Needs more information ❔ Issues that require more information from the reporter
Projects
None yet
Development

No branches or pull requests

6 participants