resolves the ValueError: Unable to avoid copy while creating an array #7831

ArjunJagdale · 2025-10-25T08:13:54Z

Summary

This PR resolves the ValueError: Unable to avoid copy while creating an array error that occurs when using train_test_split with stratify_by_column parameter in NumPy 2.0+.

Changes

Wrapped the stratify column array access with np.asarray() in arrow_dataset.py
This allows NumPy 2.0 to make a copy when the Arrow array is non-contiguous in memory

Testing

✅ Tested with NumPy 2.3.4 - stratified splits work correctly
✅ Tested with NumPy 1.26.4 - backward compatibility maintained
✅ Verified class balance is preserved in stratified splits
✅ Non-stratified splits continue to work as expected

NumPy 2.0 changed the behavior of the `copy=False` parameter to be stricter. When `train_test_split` converted Arrow arrays to NumPy format for stratification, it triggered this error for non-contiguous arrays. Using `np.asarray()` allows copying when necessary, which is the recommended migration path per NumPy 2.0 documentation.

ArjunJagdale · 2025-10-25T08:22:23Z

Also i have done some tests on real dataset

Tested with real datasets:

✅ IMDB dataset with NumPy 1.26.4 and 2.3.4
✅ Rotten Tomatoes dataset with NumPy 1.26.4 and 2.3.4
✅ Artificial datasets with ClassLabel features

Results:

Stratified splits work correctly in both NumPy versions
Class balance is perfectly maintained (e.g., Rotten Tomatoes: 426:426 train, 107:107 test)
Non-stratified splits continue to work as expected
Backward compatibility with NumPy 1.x confirmed

Below are the RAW logs of testing -

(venv) F:\Python\Machine learning\datasets>pip install "numpy<2.0"
Collecting numpy<2.0
  Using cached numpy-1.26.4-cp311-cp311-win_amd64.whl.metadata (61 kB)
Using cached numpy-1.26.4-cp311-cp311-win_amd64.whl (15.8 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.3.4
    Uninstalling numpy-2.3.4:
      Successfully uninstalled numpy-2.3.4
Successfully installed numpy-1.26.4

[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip

(venv) F:\Python\Machine learning\datasets>python test_fix.py
NumPy version: 1.26.4
============================================================

[Test 1] Testing with IMDB dataset...
README.md: 7.81kB [00:00, 7.78MB/s]
F:\Python\Machine learning\datasets\venv\Lib\site-packages\huggingface_hub\file_download.py:143: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\arjun\.cache\huggingface\hub\datasets--imdb. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
train-00000-of-00001.parquet: 100%|███████████████████████████████████████████████████████████████████████████████████| 21.0M/21.0M [00:06<00:00, 3.08MB/s]
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
test-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████| 20.5M/20.5M [00:07<00:00, 2.56MB/s]
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
unsupervised-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████| 42.0M/42.0M [00:14<00:00, 2.84MB/s]
Generating train split: 100%|█████████████████████████████████████████████████████████████████████████████| 25000/25000 [00:00<00:00, 267643.40 examples/s]
Generating test split: 100%|██████████████████████████████████████████████████████████████████████████████| 25000/25000 [00:00<00:00, 324697.85 examples/s]
Generating unsupervised split: 100%|██████████████████████████████████████████████████████████████████████| 50000/50000 [00:00<00:00, 289202.11 examples/s]
Loaded 1000 samples
✅ IMDB SUCCESS!
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 800
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 200
    })
})
Train class distribution: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Test class distribution: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

============================================================

[Test 2] Testing with Rotten Tomatoes dataset...
README.md: 7.46kB [00:00, ?B/s]
F:\Python\Machine learning\datasets\venv\Lib\site-packages\huggingface_hub\file_download.py:143: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\arjun\.cache\huggingface\hub\datasets--rotten_tomatoes. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
train.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 699k/699k [00:00<00:00, 3.46MB/s]
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
validation.parquet: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 90.0k/90.0k [00:00<00:00, 6.80MB/s]
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
test.parquet: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 92.2k/92.2k [00:00<00:00, 5.85MB/s]
Generating train split: 100%|███████████████████████████████████████████████████████████████████████████████| 8530/8530 [00:00<00:00, 856082.82 examples/s]
Generating validation split: 100%|██████████████████████████████████████████████████████████████████████████| 1066/1066 [00:00<00:00, 531075.91 examples/s]
Generating test split: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1066/1066 [00:00<?, ? examples/s]
Loaded 1066 samples
✅ Rotten Tomatoes SUCCESS!
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 852
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 214
    })
})

Train: class_0=426, class_1=426
Test:  class_0=107, class_1=107

============================================================

[Test 3] Testing without stratification (sanity check)...
✅ Non-stratified split SUCCESS!
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 80
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 20
    })
})

============================================================
All tests completed!

Upgrading numpy for >= 2

(venv) F:\Python\Machine learning\datasets>pip install "numpy>=2.0"
Collecting numpy>=2.0
  Using cached numpy-2.3.4-cp311-cp311-win_amd64.whl.metadata (60 kB)
Using cached numpy-2.3.4-cp311-cp311-win_amd64.whl (13.1 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
Successfully installed numpy-2.3.4

[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip

(venv) F:\Python\Machine learning\datasets>python test_fix.py
NumPy version: 2.3.4
============================================================

[Test 1] Testing with IMDB dataset...
Loaded 1000 samples
✅ IMDB SUCCESS!
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 800
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 200
    })
})
Train class distribution: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Test class distribution: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

============================================================

[Test 2] Testing with Rotten Tomatoes dataset...
Loaded 1066 samples
✅ Rotten Tomatoes SUCCESS!
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 852
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 214
    })
})

Train: class_0=426, class_1=426
Test:  class_0=107, class_1=107

============================================================

[Test 3] Testing without stratification (sanity check)...
✅ Non-stratified split SUCCESS!
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 80
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 20
    })
})

============================================================
All tests completed!

ArjunJagdale · 2025-10-25T08:25:01Z

test_fix.py

here is the file I used for testing @lhoestq

ArjunJagdale changed the title ~~Fix argument passing in stratified shuffle split~~ resolves the ValueError: Unable to avoid copy while creating an array Oct 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

resolves the ValueError: Unable to avoid copy while creating an array #7831

resolves the ValueError: Unable to avoid copy while creating an array #7831

Uh oh!

ArjunJagdale commented Oct 25, 2025 •

edited

Loading

Uh oh!

ArjunJagdale commented Oct 25, 2025 •

edited

Loading

Uh oh!

ArjunJagdale commented Oct 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

resolves the ValueError: Unable to avoid copy while creating an array #7831

Are you sure you want to change the base?

resolves the ValueError: Unable to avoid copy while creating an array #7831

Uh oh!

Conversation

ArjunJagdale commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Uh oh!

ArjunJagdale commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Also i have done some tests on real dataset

Tested with real datasets:

Results:

Upgrading numpy for >= 2

Uh oh!

ArjunJagdale commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ArjunJagdale commented Oct 25, 2025 •

edited

Loading

ArjunJagdale commented Oct 25, 2025 •

edited

Loading

ArjunJagdale commented Oct 25, 2025 •

edited

Loading