This project demonstrates how to get started with MLflow for machine learning projects, focusing on tracking experiments, logging models, and utilizing the MLflow Model Registry. We will be working with two separate projects:
- Iris Dataset Linear Regression: This project uses the Iris dataset to train a linear regression model and logs the experiment using MLFLOW.
- House Price Prediction: This project uses the California Housing dataset to train a model (currently a placeholder) and logs the experiment using MLFLOW.
The following steps outline the MLFLOW workflow used in this project:
- All MLflow operations are directed to the tracking server.
# Set the tracking URI to store experiment data mlflow.set_tracking_uri(uri="http://127.0.0.1:5000")
- Experiments organize your runs. All runs under a specific experiment will be grouped together in the MLflow UI.
# Create an experiment mlflow.set_experiment("MLFLOW Quickstart")
- Each iteration of training your model (e.g., with different hyperparameters) should be logged as a "run". The
with mlflow.start_run():block ensures that the run is properly started and ended.# Start a new run with mlflow.start_run(): # ... MLflow logging operations ...
- Record the parameters used for your model training.
# Log parameters mlflow.log_params(params)
- Log the evaluation metrics of your model, such as accuracy or MSE.
# Log metrics mlflow.log_metric("accuracy", accuracy)
- Add custom tags to your run for better organization and searchability in the MLflow UI.
# Set tags mlflow.set_tag("Training Info", "Basic LR model for iris data")
- The model signature defines the input and output schema of your model, which is useful for deployment and validation.
# Infer signature from mlflow.models import infer_signature signature = infer_signature(X_train, lr.predict(X_train))
- Save your trained model to MLflow. This makes it available for later loading and deployment. The
registered_model_nameargument registers the model in the MLflow Model Registry.# Log the model model_info = mlflow.sklearn.log_model(model, "model", signature=signature)
- You can load a model logged with MLflow for making predictions.
# Loading logged model loaded_model = mlflow.pyfunc.load_model(model_info.model_uri) predictions = loaded_model.predict(X_test)
- The Model Registry provides versioning, aliasing, and centralized management of your models. The
registered_model_nameparameter inmlflow.sklearn.log_modelautomatically registers the model. You can then load models by name and version from the registry.# Model registry model_name = "tracking-quickstart" model_version = "latest" # or a specific version number model_uri = f"models:/{model_name}/{model_version}" model = mlflow.sklearn.load_model(model_uri)
import mlflow
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_iris
# Load iris dataset
iris = load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
# Set up MLFLOW tracking
mlflow.set_tracking_uri('http://127.0.0.1:5000')
mlflow.set_experiment('iris-linear-regression')
# Start a new run
with mlflow.start_run():
# Log parameters
mlflow.log_params({'param1': 'value1', 'param2': 'value2'})
# Train model
model = LinearRegression()
model.fit(X, y)
# Log metrics
mlflow.log_metric('accuracy', 0.9)
# Log model
mlflow.sklearn.log_model(model, 'model')
# Set tags
mlflow.set_tag('dataset', 'iris')
# Infer signature
model_input = [[1, 2]]
model_output = model.predict(model_input)
mlflow.models.infer_signature(model_input, model_output)The house-price-predict notebook follows a similar workflow, but with the California Housing dataset. The notebook demonstrates hyperparameter tuning with GridSearchCV and logging the best model using MLflow.
- The script loads the California Housing dataset and prepares it for training.
- The
hyperparameter_tuningfunction usesGridSearchCVto find the best hyperparameters for aRandomForestRegressor.
- Inside the
mlflow.start_run()block, the best hyperparameters found byGridSearchCVand the Mean Squared Error (MSE) on the test set are logged.with mlflow.start_run(): # ... perform hyperparameter tuning ... mlflow.log_param("best_n_estimators", grid_search.best_params_["n_estimators"]) mlflow.log_metric("mse", mse) # ...
- The best model from
GridSearchCVis logged and registered in the MLflow Model Registry.mlflow.sklearn.log_model( best_model, "model", registered_model_name="Best Randomforest Model", signature=signature, )
mlflow set_tracking_uri: sets the tracking URI for storing experiment datamlflow set_experiment: sets the experiment namemlflow.start_run: starts a new runmlflow.log_params: logs parametersmlflow.log_metric: logs metricsmlflow.set_tag: sets tagsmlflow.models.infer_signature: infers the signature of the modelmlflow.sklearn.log_model: logs the model
Note: This is a basic example to demonstrate the MLFLOW workflow. You may need to modify the code to suit your specific use case.