Skip to content

Commit 742f1f1

Browse files
author
Github Actions
committed
Ravin Kohli: [ADD] Subsampling Dataset (#398)
1 parent f002c93 commit 742f1f1

34 files changed

+490
-352
lines changed
Binary file not shown.
Binary file not shown.
Loading
Loading
Loading
Loading

development/_modules/autoPyTorch/api/tabular_classification.html

Lines changed: 24 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,8 @@ <h1>Source code for autoPyTorch.api.tabular_classification</h1><div class="highl
125125
<span class="p">)</span>
126126
<span class="kn">from</span> <span class="nn">autoPyTorch.data.tabular_validator</span> <span class="kn">import</span> <span class="n">TabularInputValidator</span>
127127
<span class="kn">from</span> <span class="nn">autoPyTorch.data.utils</span> <span class="kn">import</span> <span class="p">(</span>
128-
<span class="n">get_dataset_compression_mapping</span>
128+
<span class="n">DatasetCompressionSpec</span><span class="p">,</span>
129+
<span class="n">get_dataset_compression_mapping</span><span class="p">,</span>
129130
<span class="p">)</span>
130131
<span class="kn">from</span> <span class="nn">autoPyTorch.datasets.base_dataset</span> <span class="kn">import</span> <span class="n">BaseDatasetPropertiesType</span>
131132
<span class="kn">from</span> <span class="nn">autoPyTorch.datasets.resampling_strategy</span> <span class="kn">import</span> <span class="p">(</span>
@@ -279,7 +280,7 @@ <h1>Source code for autoPyTorch.api.tabular_classification</h1><div class="highl
279280
<span class="n">resampling_strategy</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">ResamplingStrategies</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
280281
<span class="n">resampling_strategy_args</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
281282
<span class="n">dataset_name</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
282-
<span class="n">dataset_compression</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">Mapping</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
283+
<span class="n">dataset_compression</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">DatasetCompressionSpec</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
283284
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Tuple</span><span class="p">[</span><span class="n">TabularDataset</span><span class="p">,</span> <span class="n">TabularInputValidator</span><span class="p">]:</span>
284285
<span class="sd">&quot;&quot;&quot;</span>
285286
<span class="sd"> Returns an object of `TabularDataset` and an object of</span>
@@ -303,6 +304,10 @@ <h1>Source code for autoPyTorch.api.tabular_classification</h1><div class="highl
303304
<span class="sd"> in ```datasets/resampling_strategy.py```.</span>
304305
<span class="sd"> dataset_name (Optional[str]):</span>
305306
<span class="sd"> name of the dataset, used as experiment name.</span>
307+
<span class="sd"> dataset_compression (Optional[DatasetCompressionSpec]):</span>
308+
<span class="sd"> specifications for dataset compression. For more info check</span>
309+
<span class="sd"> documentation for `BaseTask.get_dataset`.</span>
310+
306311
<span class="sd"> Returns:</span>
307312
<span class="sd"> TabularDataset:</span>
308313
<span class="sd"> the dataset object.</span>
@@ -509,14 +514,23 @@ <h1>Source code for autoPyTorch.api.tabular_classification</h1><div class="highl
509514
<span class="sd"> listed in ``&quot;methods&quot;`` will not be performed.</span>
510515

511516
<span class="sd"> **methods**</span>
512-
<span class="sd"> We currently provide the following methods for reducing the dataset size.</span>
513-
<span class="sd"> These can be provided in a list and are performed in the order as given.</span>
514-
<span class="sd"> * ``&quot;precision&quot;`` - We reduce floating point precision as follows:</span>
515-
<span class="sd"> * ``np.float128 -&gt; np.float64``</span>
516-
<span class="sd"> * ``np.float96 -&gt; np.float64``</span>
517-
<span class="sd"> * ``np.float64 -&gt; np.float32``</span>
518-
<span class="sd"> * pandas dataframes are reduced using the downcast option of `pd.to_numeric`</span>
519-
<span class="sd"> to the lowest possible precision.</span>
517+
<span class="sd"> We currently provide the following methods for reducing the dataset size.</span>
518+
<span class="sd"> These can be provided in a list and are performed in the order as given.</span>
519+
<span class="sd"> * ``&quot;precision&quot;`` -</span>
520+
<span class="sd"> We reduce floating point precision as follows:</span>
521+
<span class="sd"> * ``np.float128 -&gt; np.float64``</span>
522+
<span class="sd"> * ``np.float96 -&gt; np.float64``</span>
523+
<span class="sd"> * ``np.float64 -&gt; np.float32``</span>
524+
<span class="sd"> * pandas dataframes are reduced using the downcast option of `pd.to_numeric`</span>
525+
<span class="sd"> to the lowest possible precision.</span>
526+
<span class="sd"> * ``subsample`` -</span>
527+
<span class="sd"> We subsample data such that it **fits directly into</span>
528+
<span class="sd"> the memory allocation** ``memory_allocation * memory_limit``.</span>
529+
<span class="sd"> Therefore, this should likely be the last method listed in</span>
530+
<span class="sd"> ``&quot;methods&quot;``.</span>
531+
<span class="sd"> Subsampling takes into account classification labels and stratifies</span>
532+
<span class="sd"> accordingly. We guarantee that at least one occurrence of each</span>
533+
<span class="sd"> label is included in the sampled set.</span>
520534

521535
<span class="sd"> Returns:</span>
522536
<span class="sd"> self</span>

development/_modules/autoPyTorch/api/tabular_regression.html

Lines changed: 23 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,8 @@ <h1>Source code for autoPyTorch.api.tabular_regression</h1><div class="highlight
125125
<span class="p">)</span>
126126
<span class="kn">from</span> <span class="nn">autoPyTorch.data.tabular_validator</span> <span class="kn">import</span> <span class="n">TabularInputValidator</span>
127127
<span class="kn">from</span> <span class="nn">autoPyTorch.data.utils</span> <span class="kn">import</span> <span class="p">(</span>
128-
<span class="n">get_dataset_compression_mapping</span>
128+
<span class="n">DatasetCompressionSpec</span><span class="p">,</span>
129+
<span class="n">get_dataset_compression_mapping</span><span class="p">,</span>
129130
<span class="p">)</span>
130131
<span class="kn">from</span> <span class="nn">autoPyTorch.datasets.base_dataset</span> <span class="kn">import</span> <span class="n">BaseDatasetPropertiesType</span>
131132
<span class="kn">from</span> <span class="nn">autoPyTorch.datasets.resampling_strategy</span> <span class="kn">import</span> <span class="p">(</span>
@@ -280,7 +281,7 @@ <h1>Source code for autoPyTorch.api.tabular_regression</h1><div class="highlight
280281
<span class="n">resampling_strategy</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">ResamplingStrategies</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
281282
<span class="n">resampling_strategy_args</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
282283
<span class="n">dataset_name</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
283-
<span class="n">dataset_compression</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">Mapping</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
284+
<span class="n">dataset_compression</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">DatasetCompressionSpec</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
284285
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Tuple</span><span class="p">[</span><span class="n">TabularDataset</span><span class="p">,</span> <span class="n">TabularInputValidator</span><span class="p">]:</span>
285286
<span class="sd">&quot;&quot;&quot;</span>
286287
<span class="sd"> Returns an object of `TabularDataset` and an object of</span>
@@ -304,6 +305,9 @@ <h1>Source code for autoPyTorch.api.tabular_regression</h1><div class="highlight
304305
<span class="sd"> in ```datasets/resampling_strategy.py```.</span>
305306
<span class="sd"> dataset_name (Optional[str]):</span>
306307
<span class="sd"> name of the dataset, used as experiment name.</span>
308+
<span class="sd"> dataset_compression (Optional[DatasetCompressionSpec]):</span>
309+
<span class="sd"> specifications for dataset compression. For more info check</span>
310+
<span class="sd"> documentation for `BaseTask.get_dataset`.</span>
307311
<span class="sd"> Returns:</span>
308312
<span class="sd"> TabularDataset:</span>
309313
<span class="sd"> the dataset object.</span>
@@ -510,14 +514,23 @@ <h1>Source code for autoPyTorch.api.tabular_regression</h1><div class="highlight
510514
<span class="sd"> listed in ``&quot;methods&quot;`` will not be performed.</span>
511515

512516
<span class="sd"> **methods**</span>
513-
<span class="sd"> We currently provide the following methods for reducing the dataset size.</span>
514-
<span class="sd"> These can be provided in a list and are performed in the order as given.</span>
515-
<span class="sd"> * ``&quot;precision&quot;`` - We reduce floating point precision as follows:</span>
516-
<span class="sd"> * ``np.float128 -&gt; np.float64``</span>
517-
<span class="sd"> * ``np.float96 -&gt; np.float64``</span>
518-
<span class="sd"> * ``np.float64 -&gt; np.float32``</span>
519-
<span class="sd"> * pandas dataframes are reduced using the downcast option of `pd.to_numeric`</span>
520-
<span class="sd"> to the lowest possible precision.</span>
517+
<span class="sd"> We currently provide the following methods for reducing the dataset size.</span>
518+
<span class="sd"> These can be provided in a list and are performed in the order as given.</span>
519+
<span class="sd"> * ``&quot;precision&quot;`` -</span>
520+
<span class="sd"> We reduce floating point precision as follows:</span>
521+
<span class="sd"> * ``np.float128 -&gt; np.float64``</span>
522+
<span class="sd"> * ``np.float96 -&gt; np.float64``</span>
523+
<span class="sd"> * ``np.float64 -&gt; np.float32``</span>
524+
<span class="sd"> * pandas dataframes are reduced using the downcast option of `pd.to_numeric`</span>
525+
<span class="sd"> to the lowest possible precision.</span>
526+
<span class="sd"> * ``subsample`` -</span>
527+
<span class="sd"> We subsample data such that it **fits directly into</span>
528+
<span class="sd"> the memory allocation** ``memory_allocation * memory_limit``.</span>
529+
<span class="sd"> Therefore, this should likely be the last method listed in</span>
530+
<span class="sd"> ``&quot;methods&quot;``.</span>
531+
<span class="sd"> Subsampling takes into account classification labels and stratifies</span>
532+
<span class="sd"> accordingly. We guarantee that at least one occurrence of each</span>
533+
<span class="sd"> label is included in the sampled set.</span>
521534

522535
<span class="sd"> Returns:</span>
523536
<span class="sd"> self</span>

development/_sources/examples/20_basics/example_image_classification.rst.txt

Lines changed: 10 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -85,26 +85,23 @@ Image Classification
8585
Pipeline Random Config:
8686
________________________________________
8787
Configuration(values={
88-
'image_augmenter:GaussianBlur:use_augmenter': False,
88+
'image_augmenter:GaussianBlur:sigma_min': 1.2329755725391824,
89+
'image_augmenter:GaussianBlur:sigma_offset': 2.17995589356565,
90+
'image_augmenter:GaussianBlur:use_augmenter': True,
8991
'image_augmenter:GaussianNoise:use_augmenter': False,
90-
'image_augmenter:RandomAffine:rotate': 237,
91-
'image_augmenter:RandomAffine:scale_offset': 0.32734086551151986,
92-
'image_augmenter:RandomAffine:shear': 43,
93-
'image_augmenter:RandomAffine:translate_percent_offset': 0.2150833219469362,
94-
'image_augmenter:RandomAffine:use_augmenter': True,
95-
'image_augmenter:RandomCutout:p': 0.6425251463645631,
96-
'image_augmenter:RandomCutout:use_augmenter': True,
97-
'image_augmenter:Resize:use_augmenter': False,
98-
'image_augmenter:ZeroPadAndCrop:percent': 0.2638607299100123,
99-
'normalizer:__choice__': 'ImageNormalizer',
92+
'image_augmenter:RandomAffine:use_augmenter': False,
93+
'image_augmenter:RandomCutout:use_augmenter': False,
94+
'image_augmenter:Resize:use_augmenter': True,
95+
'image_augmenter:ZeroPadAndCrop:percent': 0.33852145254374955,
96+
'normalizer:__choice__': 'NoNormalizer',
10097
})
10198

10299
Fitting the pipeline...
103100
________________________________________
104101
ImageClassificationPipeline
105102
________________________________________
106103
0-) normalizer:
107-
ImageNormalizer
104+
NoNormalizer
108105

109106
1-) preprocessing:
110107
EarlyPreprocessing
@@ -176,7 +173,7 @@ Image Classification
176173
177174
.. rst-class:: sphx-glr-timing
178175

179-
**Total running time of the script:** ( 0 minutes 8.801 seconds)
176+
**Total running time of the script:** ( 0 minutes 6.608 seconds)
180177

181178

182179
.. _sphx_glr_download_examples_20_basics_example_image_classification.py:

development/_sources/examples/20_basics/example_tabular_classification.rst.txt

Lines changed: 11 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,7 @@ Search for an ensemble of machine learning algorithms
134134
.. code-block:: none
135135
136136
137-
<autoPyTorch.api.tabular_classification.TabularClassificationTask object at 0x7fcb49398130>
137+
<autoPyTorch.api.tabular_classification.TabularClassificationTask object at 0x7f2712518fd0>
138138
139139
140140
@@ -165,26 +165,21 @@ Print the final ensemble performance
165165

166166
.. code-block:: none
167167
168-
{'accuracy': 0.861271676300578}
168+
{'accuracy': 0.8670520231213873}
169169
| | Preprocessing | Estimator | Weight |
170170
|---:|:-------------------------------------------------------------------------------------------------|:----------------------------------------------------------------|---------:|
171-
| 0 | SimpleImputer,Variance Threshold,MinorityCoalescer,OneHotEncoder,QuantileTransformer,KitchenSink | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.24 |
172-
| 1 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,StandardScaler,NoFeaturePreprocessing | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.16 |
173-
| 2 | SimpleImputer,Variance Threshold,MinorityCoalescer,OneHotEncoder,NoScaler,KitchenSink | embedding,ResNetBackbone,FullyConnectedHead,nn.Sequential | 0.14 |
174-
| 3 | None | CBLearner | 0.1 |
175-
| 4 | None | SVMLearner | 0.08 |
176-
| 5 | SimpleImputer,Variance Threshold,MinorityCoalescer,OneHotEncoder,NoScaler,KitchenSink | embedding,ResNetBackbone,FullyConnectedHead,nn.Sequential | 0.06 |
177-
| 6 | SimpleImputer,Variance Threshold,MinorityCoalescer,OneHotEncoder,NoScaler,KitchenSink | embedding,ResNetBackbone,FullyConnectedHead,nn.Sequential | 0.06 |
178-
| 7 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,StandardScaler,SRC | embedding,MLPBackbone,FullyConnectedHead,nn.Sequential | 0.04 |
179-
| 8 | None | LGBMLearner | 0.04 |
180-
| 9 | None | RFLearner | 0.04 |
181-
| 10 | None | KNNLearner | 0.04 |
171+
| 0 | None | CBLearner | 0.32 |
172+
| 1 | SimpleImputer,Variance Threshold,MinorityCoalescer,OneHotEncoder,QuantileTransformer,KitchenSink | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.2 |
173+
| 2 | SimpleImputer,Variance Threshold,NoCoalescer,OneHotEncoder,StandardScaler,SRC | embedding,MLPBackbone,FullyConnectedHead,nn.Sequential | 0.2 |
174+
| 3 | SimpleImputer,Variance Threshold,MinorityCoalescer,OneHotEncoder,NoScaler,KitchenSink | embedding,ResNetBackbone,FullyConnectedHead,nn.Sequential | 0.12 |
175+
| 4 | SimpleImputer,Variance Threshold,MinorityCoalescer,OneHotEncoder,QuantileTransformer,KitchenSink | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.08 |
176+
| 5 | SimpleImputer,Variance Threshold,MinorityCoalescer,OneHotEncoder,QuantileTransformer,KitchenSink | no embedding,ShapedMLPBackbone,FullyConnectedHead,nn.Sequential | 0.08 |
182177
autoPyTorch results:
183178
Dataset name: Australian
184179
Optimisation Metric: accuracy
185180
Best validation score: 0.8713450292397661
186-
Number of target algorithm runs: 20
187-
Number of successful target algorithm runs: 18
181+
Number of target algorithm runs: 21
182+
Number of successful target algorithm runs: 19
188183
Number of crashed target algorithm runs: 0
189184
Number of target algorithms that exceeded the time limit: 2
190185
Number of target algorithms that exceeded the memory limit: 0
@@ -196,7 +191,7 @@ Print the final ensemble performance
196191
197192
.. rst-class:: sphx-glr-timing
198193

199-
**Total running time of the script:** ( 5 minutes 32.091 seconds)
194+
**Total running time of the script:** ( 5 minutes 20.869 seconds)
200195

201196

202197
.. _sphx_glr_download_examples_20_basics_example_tabular_classification.py:

0 commit comments

Comments
 (0)