Skip to content

Commit c0765de

Browse files
authored
Initial support of user-provided datasets (#164)
1 parent d8ad679 commit c0765de

File tree

4 files changed

+67
-15
lines changed

4 files changed

+67
-15
lines changed

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,6 @@ flowchart TB
100100
- [Benchmarks Runner](sklbench/runner/README.md)
101101
- [Report Generator](sklbench/report/README.md)
102102
- [Benchmarks](sklbench/benchmarks/README.md)
103-
- [Data Processing](sklbench/datasets/README.md)
103+
- [Data Processing and Storage](sklbench/datasets/README.md)
104104
- [Emulators](sklbench/emulators/README.md)
105105
- [Developer Guide](docs/README.md)

sklbench/datasets/README.md

+41-10
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Data Handling in Benchmarks
1+
# Data Processing and Storage in Benchmarks
22

33
Data handling steps:
44
1. Load data:
@@ -7,6 +7,14 @@ Data handling steps:
77
2. Split data into subsets if requested
88
3. Convert to requested form (data type, format, order, etc.)
99

10+
Existing data sources:
11+
- Synthetic data from sklearn
12+
- OpenML datasets
13+
- Custom loaders for named datasets
14+
- User-provided datasets in compatible format
15+
16+
## Data Caching
17+
1018
There are two levels of caching with corresponding directories: `raw cache` for files downloaded from external sources, and just `cache` for files applicable for fast-loading in benchmarks.
1119

1220
Each dataset has few associated files in usual `cache`: data component files (`x`, `y`, `weights`, etc.) and JSON file with dataset properties (number of classes, clusters, default split arguments).
@@ -21,16 +29,39 @@ data_cache/
2129
```
2230

2331
Cached file formats:
24-
| Format | File extension | Associated Python types |
25-
| --- | --- | --- |
26-
| [Parquet](https://parquet.apache.org) | `.parq` | pandas.DataFrame |
27-
| Numpy uncompressed binary dense data | `.npz` | numpy.ndarray, pandas.Series |
28-
| Numpy uncompressed binary CSR data | `.csr.npz` | scipy.sparse.csr_matrix |
32+
| Format | File extension | Associated Python types | Comment |
33+
| --- | --- | --- | --- |
34+
| [Parquet](https://parquet.apache.org) | `.parq` | pandas.DataFrame | |
35+
| Numpy uncompressed binary dense data | `.npz` | numpy.ndarray, pandas.Series | Data is stored under `arr_0` name |
36+
| Numpy uncompressed binary CSR data | `.csr.npz` | scipy.sparse.csr_matrix | Data is stored under `data`, `indices` and `indptr` names |
2937

30-
Existing data sources:
31-
- Synthetic data from sklearn
32-
- OpenML datasets
33-
- Custom loaders for named datasets
38+
## How to Modify Dataset for Compatibility with Scikit-learn_bench
39+
40+
In order to reuse an existing dataset in scikit-learn_bench, you need to convert its file(s) into compatible format for dataset cache loader.
41+
42+
Cached dataset consist of few files:
43+
- `{dataset name}.json` file which store required and optional dataset information
44+
- `{dataset name}_{data component name}.{data component extension}` files which store dataset components (data, labels, etc.)
45+
46+
Example of `{dataset name}.json`:
47+
```json
48+
{"n_classes": 2, "default_split": {"test_size": 0.2, "random_state": 11}}
49+
```
50+
51+
`n_classes` property in a dataset info file is *required* for classification datasets.
52+
53+
Currently, `x` (data) and `y` (labels) are the only supported and *required* data components.
54+
55+
Scikit-learn_bench-compatible dataset should be stored in `data:cache_directory` (`${PWD}/data_cache` or `{repository root}/data_cache` by default).
56+
57+
You can specify created compatible dataset in config files the same way as datasets explicitly registered in scikit-learn_bench using its name:
58+
```json
59+
{
60+
"data": {
61+
"dataset": "{dataset name}"
62+
}
63+
}
64+
```
3465

3566
---
3667
[Documentation tree](../../README.md#-documentation)

sklbench/datasets/__init__.py

+12-3
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222
from ..utils.custom_types import BenchCase
2323
from .loaders import (
2424
dataset_loading_functions,
25+
load_custom_data,
2526
load_openml_data,
2627
load_sklearn_synthetic_data,
2728
)
@@ -47,9 +48,17 @@ def load_data(bench_case: BenchCase) -> Tuple[Dict, Dict]:
4748
dataset = get_bench_case_value(bench_case, "data:dataset")
4849
if dataset is not None:
4950
dataset_params = get_bench_case_value(bench_case, "data:dataset_kwargs", dict())
50-
return dataset_loading_functions[dataset](
51-
**common_kwargs, preproc_kwargs=preproc_kwargs, dataset_params=dataset_params
52-
)
51+
if dataset in dataset_loading_functions:
52+
# registered dataset loading branch
53+
return dataset_loading_functions[dataset](
54+
**common_kwargs,
55+
preproc_kwargs=preproc_kwargs,
56+
dataset_params=dataset_params,
57+
)
58+
else:
59+
# user-provided dataset loading branch
60+
return load_custom_data(**common_kwargs, preproc_kwargs=preproc_kwargs)
61+
5362
# load by source
5463
source = get_bench_case_value(bench_case, "data:source")
5564
if source is not None:

sklbench/datasets/loaders.py

+13-1
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
make_regression,
3030
)
3131

32-
from .common import cache, preprocess
32+
from .common import cache, load_data_description, load_data_from_cache, preprocess
3333
from .downloaders import (
3434
download_and_read_csv,
3535
download_kaggle_files,
@@ -84,6 +84,18 @@ def load_sklearn_synthetic_data(
8484
return {"x": x, "y": y}, data_desc
8585

8686

87+
@preprocess
88+
def load_custom_data(
89+
data_name: str,
90+
data_cache: str,
91+
raw_data_cache: str,
92+
):
93+
"""Function to load data specified by user and stored in format compatible with scikit-learn_bench cache"""
94+
return load_data_from_cache(data_cache, data_name), load_data_description(
95+
data_cache, data_name
96+
)
97+
98+
8799
"""
88100
Classification datasets
89101
"""

0 commit comments

Comments
 (0)