Description
Describe the bug
When loading a large 2D data (1000 × 1152) with a large number of (2,000 data in this case) in load_dataset
, the error message OSError: Invalid flatbuffers message
is reported.
When only 300 pieces of data of this size (1000 × 1152) are stored, they can be loaded correctly.
When 2,000 2D arrays are stored in each file, about 100 files are generated, each with a file size of about 5-6GB. But when 300 2D arrays are stored in each file, about 600 files are generated, which is too many files.
Steps to reproduce the bug
error:
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
Cell In[2], line 4
1 from datasets import Dataset
2 from datasets import load_dataset
----> 4 real_dataset = load_dataset("arrow", data_files='tensorData/real_ResidueTensor/*', split="train")#.with_format("torch") # , split="train"
5 # sim_dataset = load_dataset("arrow", data_files='tensorData/sim_ResidueTensor/*', split="train").with_format("torch")
6 real_dataset
File [~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/load.py:2151](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/load.py#line=2150), in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, keep_in_memory, save_infos, revision, token, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
2148 return builder_instance.as_streaming_dataset(split=split)
2150 # Download and prepare data
-> 2151 builder_instance.download_and_prepare(
2152 download_config=download_config,
2153 download_mode=download_mode,
2154 verification_mode=verification_mode,
2155 num_proc=num_proc,
2156 storage_options=storage_options,
2157 )
2159 # Build dataset for splits
2160 keep_in_memory = (
2161 keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
2162 )
File [~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/builder.py:924](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/builder.py#line=923), in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, dl_manager, base_path, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
922 if num_proc is not None:
923 prepare_split_kwargs["num_proc"] = num_proc
--> 924 self._download_and_prepare(
925 dl_manager=dl_manager,
926 verification_mode=verification_mode,
927 **prepare_split_kwargs,
928 **download_and_prepare_kwargs,
929 )
930 # Sync info
931 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())
File [~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/builder.py:978](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/builder.py#line=977), in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
976 split_dict = SplitDict(dataset_name=self.dataset_name)
977 split_generators_kwargs = self._make_split_generators_kwargs(prepare_split_kwargs)
--> 978 split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
980 # Checksums verification
981 if verification_mode == VerificationMode.ALL_CHECKS and dl_manager.record_checksums:
File [~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/packaged_modules/arrow/arrow.py:47](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/packaged_modules/arrow/arrow.py#line=46), in Arrow._split_generators(self, dl_manager)
45 with open(file, "rb") as f:
46 try:
---> 47 reader = pa.ipc.open_stream(f)
48 except pa.lib.ArrowInvalid:
49 reader = pa.ipc.open_file(f)
File [~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/ipc.py:190](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/ipc.py#line=189), in open_stream(source, options, memory_pool)
171 def open_stream(source, *, options=None, memory_pool=None):
172 """
173 Create reader for Arrow streaming format.
174
(...)
188 A reader for the given source
189 """
--> 190 return RecordBatchStreamReader(source, options=options,
191 memory_pool=memory_pool)
File [~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/ipc.py:52](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/ipc.py#line=51), in RecordBatchStreamReader.__init__(self, source, options, memory_pool)
50 def __init__(self, source, *, options=None, memory_pool=None):
51 options = _ensure_default_ipc_read_options(options)
---> 52 self._open(source, options=options, memory_pool=memory_pool)
File [~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/ipc.pxi:1006](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/ipc.pxi#line=1005), in pyarrow.lib._RecordBatchStreamReader._open()
File [~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/error.pxi:155](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/error.pxi#line=154), in pyarrow.lib.pyarrow_internal_check_status()
File [~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/error.pxi:92](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/error.pxi#line=91), in pyarrow.lib.check_status()
OSError: Invalid flatbuffers message.
reproduce:Here is just an example result, the real 2D matrix is the output of the ESM large model, and the matrix size is approximate
import numpy as np
import pyarrow as pa
random_arrays_list = [np.random.rand(1000, 1152) for _ in range(2000)]
table = pa.Table.from_pydict({
'tensor': [tensor.tolist() for tensor in random_arrays_list]
})
import pyarrow.feather as feather
feather.write_feather(table, 'test.arrow')
from datasets import load_dataset
dataset = load_dataset("arrow", data_files='test.arrow', split="train")
Expected behavior
load_dataset
load the dataset as normal as feather.read_feather
import pyarrow.feather as feather
feather.read_feather('tensorData/real_ResidueTensor/real_tensor_1.arrow')
Plus load_dataset("parquet", data_files='test.arrow', split="train")
works fine
Environment info
datasets
version: 3.2.0- Platform: Linux-6.8.0-49-generic-x86_64-with-glibc2.39
- Python version: 3.12.3
huggingface_hub
version: 0.26.5- PyArrow version: 18.1.0
- Pandas version: 2.2.3
fsspec
version: 2024.9.0