Skip to content

Commit

Permalink
Add HF_ prefix to env var MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES (#2409)
Browse files Browse the repository at this point in the history
* add HF_ prefix to env var MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES

* update tests

* style
Rate limit · GitHub

Access has been restricted

You have triggered a rate limit.

Please wait a few minutes before you try again;
in some cases this may take up to an hour.

lhoestq authored May 27, 2021
1 parent 1ef6fb5 commit aba604e
Showing 7 changed files with 27 additions and 25 deletions.
6 changes: 3 additions & 3 deletions src/datasets/arrow_dataset.py
Original file line number Diff line number Diff line change
@@ -654,10 +654,10 @@ def load_from_disk(dataset_path: str, fs=None, keep_in_memory: Optional[bool] =
Instance of the remote filesystem used to download the files from.
keep_in_memory (:obj:`bool`, default ``None``): Whether to copy the dataset in-memory. If `None`, the
dataset will be copied in-memory if its size is smaller than
`datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES` (default `250 MiB`). This behavior can be
`datasets.config.HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES` (default `250 MiB`). This behavior can be
disabled (i.e., the dataset will not be loaded in memory) by setting to ``0`` either the configuration
option ``datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`` (higher precedence) or the
environment variable ``MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`` (lower precedence).
option ``datasets.config.HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`` (higher precedence) or the
environment variable ``HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`` (lower precedence).
Returns:
:class:`Dataset` or :class:`DatasetDict`.
4 changes: 2 additions & 2 deletions src/datasets/config.py
Original file line number Diff line number Diff line change
@@ -145,8 +145,8 @@

# In-memory
DEFAULT_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES = 250 * 2 ** 20 # 250 MiB
MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES = float(
os.environ.get("MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES", DEFAULT_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES)
HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES = float(
os.environ.get("HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES", DEFAULT_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES)
)

# File names
6 changes: 3 additions & 3 deletions src/datasets/dataset_dict.py
Original file line number Diff line number Diff line change
@@ -687,10 +687,10 @@ def load_from_disk(dataset_dict_path: str, fs=None, keep_in_memory: Optional[boo
Instance of the remote filesystem used to download the files from.
keep_in_memory (:obj:`bool`, default ``None``): Whether to copy the dataset in-memory. If `None`, the
dataset will be copied in-memory if its size is smaller than
`datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES` (default `250 MiB`). This behavior can be
`datasets.config.HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES` (default `250 MiB`). This behavior can be
disabled (i.e., the dataset will not be loaded in memory) by setting to ``0`` either the configuration
option ``datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`` (higher precedence) or the environment
variable ``MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`` (lower precedence).
option ``datasets.config.HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`` (higher precedence) or the environment
variable ``HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`` (lower precedence).
Returns:
:class:`DatasetDict`
12 changes: 6 additions & 6 deletions src/datasets/load.py
Original file line number Diff line number Diff line change
@@ -684,10 +684,10 @@ def load_dataset(
ignore_verifications (:obj:`bool`, default ``False``): Ignore the verifications of the downloaded/processed dataset information (checksums/size/splits/...).
keep_in_memory (:obj:`bool`, default ``None``): Whether to copy the dataset in-memory. If `None`, the
dataset will be copied in-memory if its size is smaller than
`datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES` (default `250 MiB`). This behavior can be disabled
`datasets.config.HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES` (default `250 MiB`). This behavior can be disabled
(i.e., the dataset will not be loaded in memory) by setting to ``0`` either the configuration option
``datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`` (higher precedence) or the environment variable
``MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`` (lower precedence).
``datasets.config.HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`` (higher precedence) or the environment variable
``HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`` (lower precedence).
save_infos (:obj:`bool`, default ``False``): Save the dataset information (checksums/size/splits/...).
script_version (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:
@@ -777,10 +777,10 @@ def load_from_disk(dataset_path: str, fs=None, keep_in_memory: Optional[bool] =
Instance of of the remote filesystem used to download the files from.
keep_in_memory (:obj:`bool`, default ``None``): Whether to copy the dataset in-memory. If `None`, the
dataset will be copied in-memory if its size is smaller than
`datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES` (default `250 MiB`). This behavior can be disabled
`datasets.config.HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES` (default `250 MiB`). This behavior can be disabled
(i.e., the dataset will not be loaded in memory) by setting to ``0`` either the configuration option
``datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`` (higher precedence) or the environment variable
``MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`` (lower precedence).
``datasets.config.HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`` (higher precedence) or the environment variable
``HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`` (lower precedence).
Returns:
``datasets.Dataset`` or ``datasets.DatasetDict``
8 changes: 4 additions & 4 deletions src/datasets/utils/info_utils.py
Original file line number Diff line number Diff line change
@@ -85,15 +85,15 @@ def get_size_checksum_dict(path: str) -> dict:


def is_small_dataset(dataset_size):
"""Check if `dataset_size` is smaller than `config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`.
"""Check if `dataset_size` is smaller than `config.HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`.
Args:
dataset_size (int): Dataset size in bytes.
Returns:
bool: Whether `dataset_size` is smaller than `config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`.
bool: Whether `dataset_size` is smaller than `config.HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES`.
"""
if dataset_size and config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES:
return dataset_size < config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES
if dataset_size and config.HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES:
return dataset_size < config.HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES
else:
return False
8 changes: 5 additions & 3 deletions tests/test_info_utils.py
Original file line number Diff line number Diff line change
@@ -7,7 +7,7 @@
@pytest.fixture(params=[None, 0, 100 * 2 ** 20, 900 * 2 ** 20])
def env_max_in_memory_dataset_size(request, monkeypatch):
if request.param:
monkeypatch.setenv("MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES", request.param)
monkeypatch.setenv("HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES", request.param)


@pytest.mark.parametrize("dataset_size", [None, 400 * 2 ** 20, 600 * 2 ** 20])
@@ -16,9 +16,11 @@ def test_is_small_dataset(
dataset_size, config_max_in_memory_dataset_size, env_max_in_memory_dataset_size, monkeypatch
):
if config_max_in_memory_dataset_size != "default":
monkeypatch.setattr(datasets.config, "MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES", config_max_in_memory_dataset_size)
monkeypatch.setattr(
datasets.config, "HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES", config_max_in_memory_dataset_size
)

max_in_memory_dataset_size = datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES
max_in_memory_dataset_size = datasets.config.HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES
if config_max_in_memory_dataset_size == "default":
if env_max_in_memory_dataset_size:
assert max_in_memory_dataset_size == env_max_in_memory_dataset_size
8 changes: 4 additions & 4 deletions tests/test_load.py
Original file line number Diff line number Diff line change
@@ -233,9 +233,9 @@ def test_load_dataset_local_with_default_in_memory(
current_dataset_size = 148
if max_in_memory_dataset_size == "default":
# default = 250 * 2 ** 20
max_in_memory_dataset_size = datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES
max_in_memory_dataset_size = datasets.config.HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES
else:
monkeypatch.setattr(datasets.config, "MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES", max_in_memory_dataset_size)
monkeypatch.setattr(datasets.config, "HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES", max_in_memory_dataset_size)
if max_in_memory_dataset_size:
expected_in_memory = current_dataset_size < max_in_memory_dataset_size
else:
@@ -253,9 +253,9 @@ def test_load_from_disk_with_default_in_memory(
current_dataset_size = 512 # arrow file size = 512, in-memory dataset size = 148
if max_in_memory_dataset_size == "default":
# default = 250 * 2 ** 20
max_in_memory_dataset_size = datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES
max_in_memory_dataset_size = datasets.config.HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES
else:
monkeypatch.setattr(datasets.config, "MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES", max_in_memory_dataset_size)
monkeypatch.setattr(datasets.config, "HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES", max_in_memory_dataset_size)
if max_in_memory_dataset_size:
expected_in_memory = current_dataset_size < max_in_memory_dataset_size
else:

1 comment on commit aba604e

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==1.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.023902 / 0.011353 (0.012549) 0.017906 / 0.011008 (0.006898) 0.049160 / 0.038508 (0.010652) 0.037268 / 0.023109 (0.014159) 0.349009 / 0.275898 (0.073111) 0.390024 / 0.323480 (0.066545) 0.010895 / 0.007986 (0.002910) 0.005463 / 0.004328 (0.001135) 0.011706 / 0.004250 (0.007455) 0.050060 / 0.037052 (0.013008) 0.354008 / 0.258489 (0.095519) 0.394136 / 0.293841 (0.100295) 0.173234 / 0.128546 (0.044687) 0.140051 / 0.075646 (0.064405) 0.438056 / 0.419271 (0.018784) 0.675395 / 0.043533 (0.631862) 0.351709 / 0.255139 (0.096570) 0.379705 / 0.283200 (0.096506) 4.265217 / 0.141683 (4.123534) 1.797664 / 1.452155 (0.345509) 1.841281 / 1.492716 (0.348564)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.008878 / 0.018006 (-0.009128) 0.512539 / 0.000490 (0.512050) 0.000264 / 0.000200 (0.000064) 0.000053 / 0.000054 (-0.000002)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.041730 / 0.037411 (0.004318) 0.027576 / 0.014526 (0.013051) 0.029206 / 0.176557 (-0.147351) 0.051546 / 0.737135 (-0.685590) 0.029387 / 0.296338 (-0.266952)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.453625 / 0.215209 (0.238416) 4.544140 / 2.077655 (2.466485) 2.050299 / 1.504120 (0.546179) 1.722582 / 1.541195 (0.181387) 1.698021 / 1.468490 (0.229530) 7.278083 / 4.584777 (2.693306) 6.382591 / 3.745712 (2.636879) 8.780307 / 5.269862 (3.510445) 7.776589 / 4.565676 (3.210913) 0.711141 / 0.424275 (0.286866) 0.009974 / 0.007607 (0.002367) 0.590807 / 0.226044 (0.364763) 5.898077 / 2.268929 (3.629148) 2.652001 / 55.444624 (-52.792624) 2.043807 / 6.876477 (-4.832670) 2.214050 / 2.142072 (0.071977) 7.475281 / 4.805227 (2.670054) 4.967593 / 6.500664 (-1.533071) 7.445732 / 0.075469 (7.370263)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 10.832325 / 1.841788 (8.990538) 12.667087 / 8.074308 (4.592779) 38.474513 / 10.191392 (28.283121) 0.919750 / 0.680424 (0.239326) 0.560646 / 0.534201 (0.026445) 0.799170 / 0.579283 (0.219887) 0.658268 / 0.434364 (0.223904) 0.710692 / 0.540337 (0.170355) 1.510923 / 1.386936 (0.123987)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.024711 / 0.011353 (0.013358) 0.017223 / 0.011008 (0.006215) 0.046850 / 0.038508 (0.008342) 0.034384 / 0.023109 (0.011275) 0.301296 / 0.275898 (0.025398) 0.345677 / 0.323480 (0.022197) 0.011053 / 0.007986 (0.003067) 0.004999 / 0.004328 (0.000670) 0.010747 / 0.004250 (0.006497) 0.051282 / 0.037052 (0.014229) 0.298792 / 0.258489 (0.040303) 0.343628 / 0.293841 (0.049787) 0.172677 / 0.128546 (0.044131) 0.135311 / 0.075646 (0.059665) 0.419621 / 0.419271 (0.000350) 0.408980 / 0.043533 (0.365447) 0.295151 / 0.255139 (0.040012) 0.328618 / 0.283200 (0.045419) 1.584994 / 0.141683 (1.443311) 1.666463 / 1.452155 (0.214308) 1.801562 / 1.492716 (0.308846)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.008413 / 0.018006 (-0.009593) 0.491025 / 0.000490 (0.490535) 0.000475 / 0.000200 (0.000275) 0.000050 / 0.000054 (-0.000005)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.034294 / 0.037411 (-0.003117) 0.024557 / 0.014526 (0.010031) 0.024234 / 0.176557 (-0.152323) 0.041699 / 0.737135 (-0.695437) 0.027186 / 0.296338 (-0.269153)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.444168 / 0.215209 (0.228959) 4.602771 / 2.077655 (2.525117) 2.139921 / 1.504120 (0.635801) 1.942112 / 1.541195 (0.400917) 1.890469 / 1.468490 (0.421978) 6.838058 / 4.584777 (2.253281) 6.033525 / 3.745712 (2.287813) 8.614037 / 5.269862 (3.344175) 7.502540 / 4.565676 (2.936864) 0.734435 / 0.424275 (0.310159) 0.011302 / 0.007607 (0.003695) 0.633966 / 0.226044 (0.407921) 6.196067 / 2.268929 (3.927139) 2.785862 / 55.444624 (-52.658763) 2.143996 / 6.876477 (-4.732481) 2.213010 / 2.142072 (0.070938) 7.202346 / 4.805227 (2.397119) 5.467262 / 6.500664 (-1.033402) 4.798703 / 0.075469 (4.723234)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 10.735949 / 1.841788 (8.894161) 12.408478 / 8.074308 (4.334170) 35.142762 / 10.191392 (24.951370) 0.701181 / 0.680424 (0.020757) 0.571423 / 0.534201 (0.037222) 0.767021 / 0.579283 (0.187738) 0.621249 / 0.434364 (0.186886) 0.711086 / 0.540337 (0.170748) 1.480574 / 1.386936 (0.093638)

CML watermark

Please sign in to comment.