Skip to content

Conversation

@ArjunJagdale
Copy link
Contributor

@ArjunJagdale ArjunJagdale commented Oct 25, 2025

Summary

Fixes #7818

This PR resolves the ValueError: Unable to avoid copy while creating an array error that occurs when using train_test_split with stratify_by_column parameter in NumPy 2.0+.

Changes

  • Wrapped the stratify column array access with np.asarray() in arrow_dataset.py
  • This allows NumPy 2.0 to make a copy when the Arrow array is non-contiguous in memory

Testing

  • ✅ Tested with NumPy 2.3.4 - stratified splits work correctly
  • ✅ Tested with NumPy 1.26.4 - backward compatibility maintained
  • ✅ Verified class balance is preserved in stratified splits
  • ✅ Non-stratified splits continue to work as expected

NumPy 2.0 changed the behavior of the `copy=False` parameter to be stricter. When `train_test_split` converted Arrow arrays to NumPy format for stratification, it triggered this error for non-contiguous arrays. Using `np.asarray()` allows copying when necessary, which is the recommended migration path per NumPy 2.0 documentation.
@ArjunJagdale
Copy link
Contributor Author

ArjunJagdale commented Oct 25, 2025

Also i have done some tests on real dataset

Tested with real datasets:

  • ✅ IMDB dataset with NumPy 1.26.4 and 2.3.4
  • ✅ Rotten Tomatoes dataset with NumPy 1.26.4 and 2.3.4
  • ✅ Artificial datasets with ClassLabel features

Results:

  • Stratified splits work correctly in both NumPy versions
  • Class balance is perfectly maintained (e.g., Rotten Tomatoes: 426:426 train, 107:107 test)
  • Non-stratified splits continue to work as expected
  • Backward compatibility with NumPy 1.x confirmed

Below are the RAW logs of testing -

(venv) F:\Python\Machine learning\datasets>pip install "numpy<2.0"
Collecting numpy<2.0
  Using cached numpy-1.26.4-cp311-cp311-win_amd64.whl.metadata (61 kB)
Using cached numpy-1.26.4-cp311-cp311-win_amd64.whl (15.8 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.3.4
    Uninstalling numpy-2.3.4:
      Successfully uninstalled numpy-2.3.4
Successfully installed numpy-1.26.4

[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip

(venv) F:\Python\Machine learning\datasets>python test_fix.py
NumPy version: 1.26.4
============================================================

[Test 1] Testing with IMDB dataset...
README.md: 7.81kB [00:00, 7.78MB/s]
F:\Python\Machine learning\datasets\venv\Lib\site-packages\huggingface_hub\file_download.py:143: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\arjun\.cache\huggingface\hub\datasets--imdb. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
train-00000-of-00001.parquet: 100%|███████████████████████████████████████████████████████████████████████████████████| 21.0M/21.0M [00:06<00:00, 3.08MB/s]
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
test-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████| 20.5M/20.5M [00:07<00:00, 2.56MB/s]
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
unsupervised-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████| 42.0M/42.0M [00:14<00:00, 2.84MB/s]
Generating train split: 100%|█████████████████████████████████████████████████████████████████████████████| 25000/25000 [00:00<00:00, 267643.40 examples/s]
Generating test split: 100%|██████████████████████████████████████████████████████████████████████████████| 25000/25000 [00:00<00:00, 324697.85 examples/s]
Generating unsupervised split: 100%|██████████████████████████████████████████████████████████████████████| 50000/50000 [00:00<00:00, 289202.11 examples/s]
Loaded 1000 samples
✅ IMDB SUCCESS!
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 800
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 200
    })
})
Train class distribution: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Test class distribution: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

============================================================

[Test 2] Testing with Rotten Tomatoes dataset...
README.md: 7.46kB [00:00, ?B/s]
F:\Python\Machine learning\datasets\venv\Lib\site-packages\huggingface_hub\file_download.py:143: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\arjun\.cache\huggingface\hub\datasets--rotten_tomatoes. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
train.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 699k/699k [00:00<00:00, 3.46MB/s]
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
validation.parquet: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 90.0k/90.0k [00:00<00:00, 6.80MB/s]
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
test.parquet: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 92.2k/92.2k [00:00<00:00, 5.85MB/s]
Generating train split: 100%|███████████████████████████████████████████████████████████████████████████████| 8530/8530 [00:00<00:00, 856082.82 examples/s]
Generating validation split: 100%|██████████████████████████████████████████████████████████████████████████| 1066/1066 [00:00<00:00, 531075.91 examples/s]
Generating test split: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1066/1066 [00:00<?, ? examples/s]
Loaded 1066 samples
✅ Rotten Tomatoes SUCCESS!
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 852
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 214
    })
})

Train: class_0=426, class_1=426
Test:  class_0=107, class_1=107

============================================================

[Test 3] Testing without stratification (sanity check)...
✅ Non-stratified split SUCCESS!
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 80
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 20
    })
})

============================================================
All tests completed!

Upgrading numpy for >= 2

(venv) F:\Python\Machine learning\datasets>pip install "numpy>=2.0"
Collecting numpy>=2.0
  Using cached numpy-2.3.4-cp311-cp311-win_amd64.whl.metadata (60 kB)
Using cached numpy-2.3.4-cp311-cp311-win_amd64.whl (13.1 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
Successfully installed numpy-2.3.4

[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip

(venv) F:\Python\Machine learning\datasets>python test_fix.py
NumPy version: 2.3.4
============================================================

[Test 1] Testing with IMDB dataset...
Loaded 1000 samples
✅ IMDB SUCCESS!
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 800
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 200
    })
})
Train class distribution: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Test class distribution: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

============================================================

[Test 2] Testing with Rotten Tomatoes dataset...
Loaded 1066 samples
✅ Rotten Tomatoes SUCCESS!
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 852
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 214
    })
})

Train: class_0=426, class_1=426
Test:  class_0=107, class_1=107

============================================================

[Test 3] Testing without stratification (sanity check)...
✅ Non-stratified split SUCCESS!
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 80
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 20
    })
})

============================================================
All tests completed!

@ArjunJagdale
Copy link
Contributor Author

ArjunJagdale commented Oct 25, 2025

test_fix.py

here is the file I used for testing @lhoestq

@ArjunJagdale ArjunJagdale changed the title Fix argument passing in stratified shuffle split resolves the ValueError: Unable to avoid copy while creating an array Oct 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

train_test_split and stratify breaks with Numpy 2.0

1 participant