Skip to content

Optimizing an ML Pipeline in Azure - A Machine Learning Engineer Project

Notifications You must be signed in to change notification settings

dimikara/Optimizing-an-ML-Pipeline-in-Azure

Repository files navigation

Optimizing an ML Pipeline in Azure

Table of contents


Overview

This project is part of the Udacity Azure ML Nanodegree. In this project, I had the opportunity to build and optimize an Azure ML pipeline using the Python SDK and a custom Scikit-learn Logistic Regression model. I optimised the hyperparameters of this model using HyperDrive. Then, I used Azure AutoML to find an optimal model using the same dataset, so that I can compare the results of the two methods.

Below you can see an image illustrating the main steps I followed during the project.

Diagram

Step 1: Set up the train script, create a Tabular Dataset from this set & evaluate it with the custom-code Scikit-learn logistic regression model.

Step 2: Creation of a Jupyter Notebook and use of HyperDrive to find the best hyperparameters for the logistic regression model.

Step 3: Next, load the same dataset in the Notebook with TabularDatasetFactory and use AutoML to find another optimized model.

Step 4: Finally, compare the results of the two methods and write a research report i.e. this Readme file.


Summary

This dataset contains marketing data about individuals. The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict whether the client will subscribe a bank term deposit (column y).

The best performing model was the HyperDrive model with ID HD_fda34223-a94c-456b-8bf7-52e84aa1d17e_14. It derived from a Scikit-learn pipeline and had an accuracy of 0.91760. In contrast, for the AutoML model with ID AutoML_ee4a685e-34f2-4031-a4f9-fe96ff33836c_13, the accuracy was 0.91618 and the algorithm used was VotingEnsemble.

Scikit-learn Pipeline

Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm.

Parameter sampler

I specified the parameter sampler as such:

ps = RandomParameterSampling(
    {
        '--C' : choice(0.001,0.01,0.1,1,10,20,50,100,200,500,1000),
        '--max_iter': choice(50,100,200,300)
    }
)

I chose discrete values with choice for both parameters, C and max_iter.

C is the Regularization while max_iter is the maximum number of iterations.

RandomParameterSampling is one of the choices available for the sampler and I chose it because it is the faster and supports early termination of low-performance runs. If budget is not an issue, we could use GridParameterSampling to exhaustively search over the search space or BayesianParameterSampling to explore the hyperparameter space.

Early stopping policy

An early stopping policy is used to automatically terminate poorly performing runs thus improving computational efficiency. I chose the BanditPolicy which I specified as follows:

policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

evaluation_interval: This is optional and represents the frequency for applying the policy. Each time the training script logs the primary metric counts as one interval.

slack_factor: The amount of slack allowed with respect to the best performing training run. This factor specifies the slack as a ratio.

Any run that doesn't fall within the slack factor or slack amount of the evaluation metric with respect to the best performing run will be terminated. This means that with this policy, the best performing runs will execute until they finish and this is the reason I chose it.

AutoML

Model and hyperparameters generated by AutoML.

I defined the following configuration for the AutoML run:

automl_config = AutoMLConfig(
    compute_target = compute_target,
    experiment_timeout_minutes=15,
    task='classification',
    primary_metric='accuracy',
    training_data=ds,
    label_column_name='y',
    enable_onnx_compatible_models=True,
    n_cross_validations=2)

experiment_timeout_minutes=15

This is an exit criterion and is used to define how long, in minutes, the experiment should continue to run. To help avoid experiment time out failures, I used the minimum of 15 minutes.

task='classification'

This defines the experiment type which in this case is classification.

primary_metric='accuracy'

I chose accuracy as the primary metric.

enable_onnx_compatible_models=True

I chose to enable enforcing the ONNX-compatible models. Open Neural Network Exchange (ONNX) is an open standard created from Microsoft and a community of partners for representing machine learning models. More info here.

n_cross_validations=2

This parameter sets how many cross validations to perform, based on the same number of folds (number of subsets). As one cross-validation could result in overfit, in my code I chose 2 folds for cross-validation; thus the metrics are calculated with the average of the 2 validation metrics.


Pipeline comparison

Comparison of the two models and their performance. Differences in accuracy & architecture - comments

HyperDrive Model
id HD_fda34223-a94c-456b-8bf7-52e84aa1d17e_14
Accuracy 0.9176024279210926
AutoML Model
id AutoML_ee4a685e-34f2-4031-a4f9-fe96ff33836c_13
Accuracy 0.916176024279211
AUC_weighted 0.9469939634729121
Algortithm VotingEnsemble

The difference in accuracy between the two models is rather trivial and although the HyperDrive model performed better in terms of accuracy, I am of the opinion that the AutoML model is actually better because of its AUC_weighted metric which equals to 0.9469939634729121 and is more fit for the highly imbalanced data that we have here. If we were given more time to run the AutoML, the resulting model would certainly be much more better. And the best thing is that AutoML would make all the necessary calculations, trainings, validations, etc. without the need for us to do anything. This is the difference with the Scikit-learn Logistic Regression pipeline, in which we have to make any adjustments, changes, etc. by ourselves and come to a final model after many trials & errors.


Future work

Some areas of improvement for future experiments and how these improvements might help the model

  • Our data is highly imbalanced:

Highly imbalanced data

Class imbalance is a very common issue in classification problems in machine learning. Imbalanced data negatively impact the model's accuracy because it is easy for the model to be very accurate just by predicting the majority class, while the accuracy for the minority class can fail miserably. This means that taking into account a simple metric like accuracy in order to judge how good our model is can be misleading.

There are many ways to deal with imbalanced data. These include using:

  1. A different metric; for example, AUC_weighted which is more fit for imbalanced data
  2. A different algorithm
  3. Random Under-Sampling of majority class
  4. Random Over-Sampling of minority class
  5. The imbalanced-learn package

There are many other methods as well, but I will not get into much details here as it is out of scope.

Concluding, the high data imbalance is something that can be handled in a future execution, leading to an obvious improvement of the model.

  • Another factor that I would improve is n_cross_validations. As cross-validation is the process of taking many subsets of the full training data and training a model on each subset, the higher the number of cross validations is, the higher the accuracy achieved is. However, a high number also raises computation time (a.k.a. training time) thus costs so there must be a balance between the two factors.

    Note: In case I would be able to improve n_cross_validations, I would also have to increase experiment_timeout_minutes as the current setting of 15 minutes would not be enough.


Proof of cluster clean up

Cluster Clean Up


Citation

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014


References