Skip to content

Commit 989511c

Browse files
authored
Merge pull request Azure#40 from rastala/master
Update automl readme
2 parents 2bdd131 + d5c247b commit 989511c

File tree

1 file changed

+75
-45
lines changed

1 file changed

+75
-45
lines changed

automl/README.md

Lines changed: 75 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,52 @@
11
# Table of Contents
2-
1. [Auto ML Introduction](#introduction)
3-
2. [Running samples in a Local Conda environment](#localconda)
4-
3. [Auto ML SDK Sample Notebooks](#samples)
5-
4. [Documentation](#documentation)
6-
5. [Running using python command](#pythoncommand)
7-
6. [Troubleshooting](#troubleshooting)
2+
1. [Automated ML Introduction](#introduction)
3+
1. [Running samples in Azure Notebooks](#jupyter)
4+
1. [Running samples in a Local Conda environment](#localconda)
5+
1. [Automated ML SDK Sample Notebooks](#samples)
6+
1. [Documentation](#documentation)
7+
1. [Running using python command](#pythoncommand)
8+
1. [Troubleshooting](#troubleshooting)
9+
10+
<a name="introduction"></a>
11+
# Automated ML introduction
12+
Automated machine learning (automated ML) builds high quality machine learning models for you by automating model and hyperparameter selection. Bring a labelled dataset that you want to build a model for, automated ML will give you a high quality machine learning model that you can use for predictions.
813

9-
# Auto ML Introduction <a name="introduction"></a>
10-
AutoML builds high quality Machine Learning models for you by automating model and hyperparameter selection. Bring a labelled dataset that you want to build a model for, AutoML will give you a high quality machine learning model that you can use for predictions.
1114

1215
If you are new to Data Science, AutoML will help you get jumpstarted by simplifying machine learning model building. It abstracts you from needing to perform model selection, hyperparameter selection and in one step creates a high quality trained model for you to use.
1316

1417
If you are an experienced data scientist, AutoML will help increase your productivity by intelligently performing the model and hyperparameter selection for your training and generates high quality models much quicker than manually specifying several combinations of the parameters and running training jobs. AutoML provides visibility and access to all the training jobs and the performance characteristics of the models to help you further tune the pipeline if you desire.
1518

19+
<a name="jupyter"></a>
20+
## Running samples in Azure Notebooks - Jupyter based notebooks in the Azure cloud
21+
22+
1. [![Azure Notebooks](https://notebooks.azure.com/launch.png)](https://aka.ms/aml-clone-azure-notebooks)
23+
[Import sample notebooks ](https://aka.ms/aml-clone-azure-notebooks) into Azure Notebooks.
24+
1. Follow the instructions in the [../00.configuration](00.configuration.ipynb) notebook to create and connect to a workspace.
25+
1. Open one of the sample notebooks.
26+
27+
**Make sure the Azure Notebook kernel is set to `Python 3.6`** when you open a notebook.
28+
29+
![set kernel to Python 3.6](../images/python36.png)
1630

17-
# Running samples in a Local Conda environment <a name="localconda"></a>
31+
<a name="localconda"></a>
32+
## Running samples in a Local Conda environment
1833

19-
You can run these notebooks in Azure Notebooks without any extra installation. To run these notebook on your own notebook server, use these installation instructions.
34+
To run these notebook on your own notebook server, use these installation instructions.
35+
36+
The instructions below will install everything you need and then start a Jupyter notebook. To start your Jupyter notebook manually, use:
37+
38+
```
39+
conda activate azure_automl
40+
jupyter notebook
41+
```
42+
43+
or on Mac:
44+
45+
```
46+
source activate azure_automl
47+
jupyter notebook
48+
```
2049

21-
It is best if you create a new conda environment locally to try this SDK, so it doesn't mess up with your existing Python environment.
2250

2351
### 1. Install mini-conda from [here](https://conda.io/miniconda.html), choose Python 3.7 or higher.
2452
- **Note**: if you already have conda installed, you can keep using it but it should be version 4.4.10 or later (as shown by: conda -V). If you have a previous version installed, you can update it using the command: conda update conda.
@@ -48,19 +76,19 @@ bash automl_setup_mac.sh
4876
cd to the **automl** folder where the sample notebooks were extracted and then run:
4977

5078
```
51-
automl_setup_linux.sh
79+
bash automl_setup_linux.sh
5280
```
5381

5482
### 4. Running configuration.ipynb
5583
- Before running any samples you next need to run the configuration notebook. Click on 00.configuration.ipynb notebook
56-
- Please make sure you use the Python [conda env:azure_automl] kernel when running this notebook.
5784
- Execute the cells in the notebook to Register Machine Learning Services Resource Provider and create a workspace. (*instructions in notebook*)
5885

5986
### 5. Running Samples
6087
- Please make sure you use the Python [conda env:azure_automl] kernel when trying the sample Notebooks.
6188
- Follow the instructions in the individual notebooks to explore various features in AutoML
6289

63-
# Auto ML SDK Sample Notebooks <a name="samples"></a>
90+
<a name="samples"></a>
91+
# Automated ML SDK Sample Notebooks
6492
- [00.configuration.ipynb](00.configuration.ipynb)
6593
- Register Machine Learning Services Resource Provider
6694
- Create new Azure ML Workspace
@@ -87,7 +115,7 @@ automl_setup_linux.sh
87115

88116
- [03b.auto-ml-remote-batchai.ipynb](03b.auto-ml-remote-batchai.ipynb)
89117
- Dataset: scikit learn's [digit dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits)
90-
- Example of using Auto ML for classification using a remote Batch AI compute for training
118+
- Example of using automated ML for classification using a remote Batch AI compute for training
91119
- Parallel execution of iterations
92120
- Async tracking of progress
93121
- Cancelling individual iterations or entire run
@@ -143,20 +171,17 @@ automl_setup_linux.sh
143171
- [13.auto-ml-dataprep.ipynb](13.auto-ml-dataprep.ipynb)
144172
- Using DataPrep for reading data
145173

146-
- [14a.auto-ml-classification-ensemble.ipynb](14a.auto-ml-classification-ensemble.ipynb)
147-
- Classification with ensembling
148-
149-
- [14b.auto-ml-regression-ensemble.ipynb](14b.auto-ml-regression-ensemble.ipynb)
150-
- Regression with ensembling
151-
152-
# Documentation <a name="documentation"></a>
174+
<a name="documentation"></a>
175+
# Documentation
153176
## Table of Contents
154-
1. [Auto ML Settings ](#automlsettings)
155-
2. [Cross validation split options](#cvsplits)
156-
3. [Get Data Syntax](#getdata)
157-
4. [Data pre-processing and featurization](#preprocessing)
177+
1. [Automated ML Settings ](#automlsettings)
178+
1. [Cross validation split options](#cvsplits)
179+
1. [Get Data Syntax](#getdata)
180+
1. [Data pre-processing and featurization](#preprocessing)
181+
182+
<a name="automlsettings"></a>
183+
## Automated ML Settings
158184

159-
## Auto ML Settings <a name="automlsettings"></a>
160185
|Property|Description|Default|
161186
|-|-|-|
162187
|**primary_metric**|This is the metric that you want to optimize.<br><br> Classification supports the following primary metrics <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>balanced_accuracy</i><br><i>average_precision_score_weighted</i><br><i>precision_score_weighted</i><br><br> Regression supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i><br><i>normalized_root_mean_squared_log_error</i>| Classification: accuracy <br><br> Regression: spearman_correlation
@@ -170,7 +195,8 @@ automl_setup_linux.sh
170195
|**exit_score**|*double* value indicating the target for *primary_metric*. <br> Once the target is surpassed the run terminates|None|
171196
|**blacklist_algos**|*Array* of *strings* indicating pipelines to ignore for Auto ML.<br><br> Allowed values for **Classification**<br><i>LogisticRegression</i><br><i>SGDClassifierWrapper</i><br><i>NBWrapper</i><br><i>BernoulliNB</i><br><i>SVCWrapper</i><br><i>LinearSVMWrapper</i><br><i>KNeighborsClassifier</i><br><i>DecisionTreeClassifier</i><br><i>RandomForestClassifier</i><br><i>ExtraTreesClassifier</i><br><i>gradient boosting</i><br><i>LightGBMClassifier</i><br><br>Allowed values for **Regression**<br><i>ElasticNet</i><br><i>GradientBoostingRegressor</i><br><i>DecisionTreeRegressor</i><br><i>KNeighborsRegressor</i><br><i>LassoLars</i><br><i>SGDRegressor</i><br><i>RandomForestRegressor</i><br><i>ExtraTreesRegressor</i>|None|
172197

173-
## Cross validation split options <a name="cvsplits"></a>
198+
<a name="cvsplits"></a>
199+
## Cross validation split options
174200
### K-Folds Cross Validation
175201
Use *n_cross_validations* setting to specify the number of cross validations. The training data set will be randomly split into *n_cross_validations* folds of equal size. During each cross validation round, one of the folds will be used for validation of the model trained on the remaining folds. This process repeats for *n_cross_validations* rounds until each fold is used once as validation set. Finally, the average scores accross all *n_cross_validations* rounds will be reported, and the corresponding model will be retrained on the whole training data set.
176202

@@ -180,7 +206,8 @@ Use *validation_size* to specify the percentage of the training data set that sh
180206
### Custom train and validation set
181207
You can specify seperate train and validation set either through the get_data() or directly to the fit method.
182208

183-
## get_data() syntax <a name="getdata"></a>
209+
<a name="getdata"></a>
210+
## get_data() syntax
184211
The *get_data()* function can be used to return a dictionary with these values:
185212

186213
|Key|Type|Dependency|Mutually Exclusive with|Description|
@@ -196,21 +223,23 @@ The *get_data()* function can be used to return a dictionary with these values:
196223
|columns|Array of strings|data_train||*Optional* Whitelist of columns to use for features|
197224
|cv_splits_indices|Array of integers|data_train||*Optional* List of indexes to split the data for cross validation|
198225

199-
## Data pre-processing and featurization <a name="preprocessing"></a>
200-
If you use "preprocess=True", the following data preprocessing steps are performed automatically for you:
201-
### 1. Dropping high cardinality or no variance features
202-
- Features with no useful information are dropped from training and validation sets. These include features with all values missing, same value across all rows or with extremely high cardinality (e.g., hashes, IDs or GUIDs).
203-
### 2. Missing value imputation
204-
- For numerical features, missing values are imputed with average of values in the column.
205-
- For categorical features, missing values are imputed with most frequent value.
206-
### 3. Generating additional features
207-
- For DateTime features: Year, Month, Day, Day of week, Day of year, Quarter, Week of the year, Hour, Minute, Second.
208-
- For Text features: Term frequency based on bi-grams and tri-grams, Count vectorizer.
209-
### 4. Transformations and encodings
210-
- Numeric features with very few unique values are transformed into categorical features.
211-
- Depending on cardinality of categorical features label encoding or (hashing) one-hot encoding is performed.
212-
213-
# Running using python command <a name="pythoncommand"></a>
226+
<a name="preprocessing"></a>
227+
## Data pre-processing and featurization
228+
If you use `preprocess=True`, the following data preprocessing steps are performed automatically for you:
229+
230+
1. Dropping high cardinality or no variance features
231+
- Features with no useful information are dropped from training and validation sets. These include features with all values missing, same value across all rows or with extremely high cardinality (e.g., hashes, IDs or GUIDs).
232+
2. Missing value imputation
233+
- For numerical features, missing values are imputed with average of values in the column.
234+
- For categorical features, missing values are imputed with most frequent value.
235+
3. Generating additional features
236+
- For DateTime features: Year, Month, Day, Day of week, Day of year, Quarter, Week of the year, Hour, Minute, Second.
237+
- For Text features: Term frequency based on bi-grams and tri-grams, Count vectorizer.
238+
4. Transformations and encodings
239+
- Numeric features with very few unique values are transformed into categorical features.
240+
241+
<a name="pythoncommand"></a>
242+
# Running using python command
214243
Jupyter notebook provides a File / Download as / Python (.py) option for saving the notebook as a Python file.
215244
You can then run this file using the python command.
216245
However, on Windows the file needs to be modified before it can be run.
@@ -220,7 +249,8 @@ The following condition must be added to the main code in the file:
220249

221250
The main code of the file must be indented so that it is under this condition.
222251

223-
# Troubleshooting <a name="troubleshooting"></a>
252+
<a name="troubleshooting"></a>
253+
# Troubleshooting
224254
## Iterations fail and the log contains "MemoryError"
225255
This can be caused by insufficient memory on the DSVM. AutoML loads all training data into memory. So, the available memory should be more than the training data size.
226256
If you are using a remote DSVM, memory is needed for each concurrent iteration. The concurrent_iterations setting specifies the maximum concurrent iterations. For example, if the training data size is 8Gb and concurrent_iterations is set to 10, the minimum memory required is at least 80Gb.

0 commit comments

Comments
 (0)