sdk/python/using-mlflow/train-and-log/logging_and_customizing

{ "cells": [ { "cell_type": "markdown", "id": "74cba1ed", "metadata": {}, "source": [ "# Logging models with MLflow\n", "\n", "The following notebook explains the differences between an asset and a model in MLflow and how to transition from one to the other. It also contains working examples to start logging models as MLflow models depending on the characteristics of the assets you are dealing with and how they are supported on MLflow." ] }, { "cell_type": "markdown", "id": "fb4571f3", "metadata": {}, "source": [ "## What's the difference between an artifact and a model?\n", "\n", "Any file generated (and captured) from an experiment's run or job is an artifact. It may represent a model serialized as a Pickle file, the weights of a PyTorch or TensorFlow model, or even a text file containing the coefficients of a linear regression. Other artifacts can have nothing to do with the model itself, but they can contain configuration to run the model, pre-processing information, sample data, etc. As you can see, an artifact can come in any format. You can log artifacts in MLflow in the same way you log a file with Azure ML SDK v1 (`Run.log_file`). \n", "\n", "```python\n", "mlflow.log_artifact()\n", "```\n", "\n", "On the other hand, a model in MLflow is also an artifact, as it matches the definition we just made about them above. However, we make stronger assumptions about this type of artifacts. Such assumptions allow us to create a clear contract between the saved artifacts and what they mean. When you log your models as artifacts (simple files), you need to know what the model builder meant for each of them in order to know how to load it. When you log your models as a Model entity, you should be able to tell what it is based on the contract we just mentioned. \n", "\n", "MLflow adopts the MLModel format as a way to create this contract between the artifacts and what they represent. The MLModel format stores assets in a folder, where there is a particular file named `MLModel`. This file is the single source of truth about how a model can be loaded and used.\n", "\n", "![](https://miro.medium.com/max/325/1*SASE6235jPLuzqrHtj_Fzg.png)\n", "\n", "Considering the variety of machine learning frameworks available to use, MLflow introduced the concept of flavor which indicates what to expect for a given model created with a given framework. For instance, TensorFlow has its own flavor on MLflow which specifies how a TensorFlow model should be persisted and loaded. Because each model flavor indicates how they want to persist and load models, the MLModel format doesn't enforce a single standard that all the models need to support. Such decision allows each flavor to use the methods that provide the best performance or best support according to their best practices, without compromising compatibility with the standard." ] }, { "cell_type": "markdown", "id": "dea8f7f6", "metadata": {}, "source": [ "## How do I start logging models instead of artifacts?\n", "\n", "If you are working on a new project we encourage to start to embrace the MLflow MLModel format to log your models instead of relying on logging artifacts. Models in the MLFlow format enjoy of a better experience in Azure ML like no-code deployments on our managed inference services.\n", "\n", "To show the different strategies, let's consider the [Heart Disease Data Set](https://archive.ics.uci.edu/ml/datasets/heart+disease). To simplify the modeling part, we are going to use only a XGBoost based model and explore different ways to get the model logged as needed." ] }, { "cell_type": "markdown", "id": "f5a4d2ef", "metadata": {}, "source": [ "### Prerequisites to run this notebook" ] }, { "cell_type": "code", "execution_count": null, "id": "fccfe0c7", "metadata": {}, "outputs": [], "source": [ "# Ensure you have the dependencies for this notebook\n", "%pip install -r logging_and_customizing_models.txt" ] }, { "cell_type": "code", "execution_count": null, "id": "c384f488", "metadata": {}, "outputs": [], "source": [ "import mlflow\n", "\n", "mlflow.set_experiment(\"heart-disease-classifier\")" ] }, { "cell_type": "markdown", "id": "1fae6e41", "metadata": {}, "source": [ "Reading the data:" ] }, { "cell_type": "code", "execution_count": null, "id": "e2c361e6", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "file_url = \"https://azuremlexampledata.blob.core.windows.net/data/heart-disease-uci/data/heart.csv\"\n", "df = pd.read_csv(file_url)\n", "df[\"thal\"] = df[\"thal\"].astype(\"category\").cat.codes" ] }, { "cell_type": "markdown", "id": "9df947eb", "metadata": {}, "source": [ "Let's do some basic train-test split:" ] }, { "cell_type": "code", "execution_count": null, "id": "e94b3d25", "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " df.drop(\"target\", axis=1), df[\"target\"], test_size=0.3\n", ")" ] }, { "cell_type": "markdown", "id": "e288761d", "metadata": {}, "source": [ "### Logging models using `autolog()`\n", "\n", "One of the simplest ways to start using this approach is by using MLflow autolog functionality. Autolog allows MLflow to instruct the framework associated to with the framework you are using to log all the metrics, parameters, artifacts and models that the framework considers relevant. By default, most models will be log if autolog is enabled. Some flavors may decide not to do that in specific situations. For instance, the flavor PySpark won't log models if they exceed a certain size. \n", "\n", "You can turn on autologging by using the autolog method for the framework you are using. For instance, for XGBoost models,\n", "\n", "```python\n", "mlflow.xgboost.autolog()\n", "```\n", "\n", "Our training would look like this:" ] }, { "cell_type": "code", "execution_count": null, "id": "f215ea43", "metadata": {}, "outputs": [], "source": [ "import mlflow\n", "from xgboost import XGBClassifier\n", "from sklearn.metrics import accuracy_score\n", "\n", "with mlflow.start_run():\n", "\n", " mlflow.xgboost.autolog()\n", "\n", " model = XGBClassifier(use_label_encoder=False, eval_metric=\"logloss\")\n", " model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)\n", " y_pred = model.predict(X_test)\n", "\n", " accuracy = accuracy_score(y_test, y_pred)\n", " mlflow.log_metric(\"accuracy\", accuracy)" ] }, { "cell_type": "markdown", "id": "180a729b", "metadata": {}, "source": [ "> Note: It is important to notice that even with autolog, you may still need to track some metrics that are important to you. For instance, `accuracy` is not being tracked automatically by XGBoost, but we can log it manually. The model, parameters and other metrics were tracked correctly though." ] }, { "cell_type": "markdown", "id": "1e1037fd", "metadata": {}, "source": [ "Some times, autolog can't guess the best way to log a model and hence you would need to do it yourself. For those cases, you can indicate autolog to log everything but the model. You do that using the parameter `log_models=False` of `autolog()` function." ] }, { "cell_type": "markdown", "id": "0a3ba624", "metadata": {}, "source": [ "### Logging models supported by MLFlow\n", "\n", "If you need to log the models in a particular way, then you can use the method `log_model` to log the models as you need to. Usually, you will log the model in this way when:\n", "\n", "* You need to indicate `pip` packages or dependencies different from the ones that are automatically detected.\n", "* You need to indicate a `conda` environment different from the default one.\n", "* Your models uses a signature different from the one inferred. This is specifically important when you deal with inputs that are tensors where the signature needs specific shapes.\n", "* You want to include input examples.\n", "* You want to include specific artifacts into the package that will be needed.\n", "* Somehow the default behaviour of autolog doesn't fill your purpoise.\n", "\n", "To log a model, you use the `log_method` model of the flavor you are working with. For instance, the following code logs a model for an XGBoost classifier:" ] }, { "cell_type": "code", "execution_count": null, "id": "ba4faceb", "metadata": {}, "outputs": [], "source": [ "import mlflow\n", "from xgboost import XGBClassifier\n", "from sklearn.metrics import accuracy_score\n", "from mlflow.models import infer_signature\n", "\n", "with mlflow.start_run():\n", "\n", " mlflow.xgboost.autolog(log_models=False)\n", "\n", " model = XGBClassifier(use_label_encoder=False, eval_metric=\"logloss\")\n", " model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)\n", " y_pred = model.predict(X_test)\n", "\n", " accuracy = accuracy_score(y_test, y_pred)\n", " mlflow.log_metric(\"accuracy\", accuracy)\n", "\n", " signature = infer_signature(X_test, y_test)\n", " mlflow.xgboost.log_model(model, \"classifier\", signature=signature)" ] }, { "cell_type": "markdown", "id": "3bfad4c4", "metadata": {}, "source": [ "If you need to indicate a custom environment with packages, you can use:\n", "\n", "```python\n", "from mlflow.utils.environment import _mlflow_conda_env\n", "\n", "custom_packages =_mlflow_conda_env(\n", " additional_conda_deps=None,\n", " additional_pip_deps=[\"xgboost==1.5.2\"],\n", " additional_conda_channels=None,\n", ")\n", "\n", "mlflow.xgboost.log_model(model, \"classifier\", conda_env=custom_packages, signature=signature)\n", "```" ] }, { "cell_type": "markdown", "id": "56174ee7", "metadata": {}, "source": [ "### Logging custom models\n", "\n", "Mlflow supports a big variety of frameworks, including FastAI, MXNet Gluon, PyTorch, TensorFlow, XGBoost, CatBoost, h2o, Keras, LightGBM, MLeap, ONNX, Prophet, spaCy, Spark MLLib, Scikit-Learn, and Statsmodels. However, they may be times where you need to change how a flavor works or even log a model that uses custom objects that are not part of any particular framework. For those cases, you may need to create a custom model flavor.\n", "\n", "Other typical cases where you need to create a custom model is when you need to extend beyond the `predict` function available in the framework you are using. For instance, it is typical that people implement forecasting routines using Scikit-learn. However, Scikit-learn doesn't have a `forecast` function. If you want to enable this `forecast` function in your model, you may need to create a custom model flavor." ] }, { "cell_type": "markdown", "id": "0af2a148", "metadata": {}, "source": [ "#### Logging custom models that are serializable\n", "\n", "Python objects are serializable when the object can be stored in the file system as a file. During runtime, the object can be materialized from such file and all the values, properties and methods available when it was saved will be restored." ] }, { "cell_type": "markdown", "id": "a582a310", "metadata": {}, "source": [ "> **Note:** If your model implements the Scikit-learn API, then you can use the Scikit-learn flavor to log the model. For instance, if your model used to (or can be) persisted in Pickle format and the object has methods `predict` and `predict_proba` (at least), then you can use `mlflow.sklearn.log_model` to log it inside a MLflow run." ] }, { "cell_type": "markdown", "id": "46c3d1ea", "metadata": {}, "source": [ "For this type of models, MLflow introduces a flavor called `pyfunc` (standing from Python function). Basically this flavor allows you to log any object you want as a model, as log as it satisfies two conditions:\n", "\n", "- The Python objects inherits from `mlflow.pyfunc.PythonModel`.\n", "- You implement the method `predict` (at least).\n", "\n" ] }, { "cell_type": "markdown", "id": "468e4d0b", "metadata": {}, "source": [ "The following sample wraps a model created with XGBoost to make it behaves in a different way to the default implementation of the XGBoost flavor (it returns the probabilities instead of the classes):" ] }, { "cell_type": "code", "execution_count": null, "id": "70d372e9", "metadata": {}, "outputs": [], "source": [ "from mlflow.pyfunc import PythonModel, PythonModelContext\n", "\n", "\n", "class ModelWrapper(PythonModel):\n", " def __init__(self, model):\n", " self._model = model\n", "\n", " def predict(self, context: PythonModelContext, data):\n", " # You don't have to keep the semantic meaning of `predict`. You can use here model.recommend(), model.forecast(), etc\n", " return self._model.predict_proba(data)\n", "\n", " # You can even add extra functions if you need to. Since the model is serialized,\n", " # all of them will be available when you load your model back.\n", " def predict_batch(self, data):\n", " pass" ] }, { "cell_type": "markdown", "id": "d4edea9a", "metadata": {}, "source": [ "Then, a custom model can be logged in the run like this:" ] }, { "cell_type": "code", "execution_count": null, "id": "85e82ae2", "metadata": {}, "outputs": [], "source": [ "import mlflow\n", "from xgboost import XGBClassifier\n", "from sklearn.metrics import accuracy_score\n", "from mlflow.models import infer_signature\n", "\n", "with mlflow.start_run():\n", " mlflow.xgboost.autolog(log_models=False)\n", "\n", " model = XGBClassifier(use_label_encoder=False, eval_metric=\"logloss\")\n", " model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)\n", " y_probs = model.predict_proba(X_test)\n", "\n", " accuracy = accuracy_score(y_test, y_probs.argmax(axis=1))\n", " mlflow.log_metric(\"accuracy\", accuracy)\n", "\n", " signature = infer_signature(X_test, y_probs)\n", " mlflow.pyfunc.log_model(\n", " \"classifier\", python_model=ModelWrapper(model), signature=signature\n", " )" ] }, { "cell_type": "markdown", "id": "bc19d4a9", "metadata": {}, "source": [ "> Important: note how the `infer_signature` method now uses `y_probs` to infer the signature. Our target column has the target class, but our model now returns the 2 probabilities for each class." ] }, { "cell_type": "markdown", "id": "ef0c8777", "metadata": {}, "source": [ "#### Logging custom models that are not serializable\n", "\n", "Models that are not serializable means that they cannot be serialized as a Pickle file. This includes models that holds references to code that can't be serialized, that do not support serialization, or that provides a more efficient way to be persisted in disk.\n", "\n", "In this case, you are required to use a different method to persist the artifacts that you need for your model to run. Then, Mlflow will snapshot all these artifacts and package them all for you. You have two different ways to do this, depending on your preferences:\n", "\n", "##### Option 1: Use artifacts with the `PythonModel` object\n", "\n", "Use this if you want to retain the state of your model's properties. For instance, in a recommender system you might want to store the number of elements to recommend to any user as a parameter. Here, you will implement a model wrapper as you did in the option above, but in this case you will use `artifacts` to indicate MLflow extra files that you want to include for loading the model state.\n", "\n", "To log a custom model using artifacts, you can do something as follows:\n", "\n", "\n", "```python\n", "model_path = 'xgb.model'\n", "model.save_model(model_path)\n", "\n", "mlflow.pyfunc.log_model(\"classifier\", \n", " python_model=ModelWrapper(),\n", " artifacts={ 'model': model_path },\n", " signature=signature)\n", "```\n", "\n", "A few things to notice:\n", "\n", "* The model was saved using the `save` method of the framework used (it's not saved as a pickle).\n", "* `ModelWrapper()` is the model wrapper, but the model is not passed as a parameter to the constructor.\n", "* A new parameter is indicated, `artifacts`, that is a dictionary of `key`, `path` where `key` is the name of the artifact, and `path` is the local path in the file system where the artifact is stored.\n", "\n", "The corresponding model wrapper then would look as follows:" ] }, { "cell_type": "code", "execution_count": null, "id": "285987d4", "metadata": {}, "outputs": [], "source": [ "from mlflow.pyfunc import PythonModel, PythonModelContext\n", "\n", "\n", "class ModelWrapper(PythonModel):\n", " def load_context(self, context: PythonModelContext):\n", " from xgboost import XGBClassifier\n", "\n", " self._model = XGBClassifier(use_label_encoder=False, eval_metric=\"logloss\")\n", " model.load_model(context.artifacts[\"model\"])\n", "\n", " def predict(self, context: PythonModelContext, data):\n", " return self._model.predict_proba(data)" ] }, { "cell_type": "markdown", "id": "11260936", "metadata": {}, "source": [ "The complete training routine would look as follows:" ] }, { "cell_type": "code", "execution_count": null, "id": "7c44066e", "metadata": {}, "outputs": [], "source": [ "import mlflow\n", "from xgboost import XGBClassifier\n", "from sklearn.metrics import accuracy_score\n", "from mlflow.models import infer_signature\n", "\n", "with mlflow.start_run():\n", " mlflow.xgboost.autolog(log_models=False)\n", "\n", " model = XGBClassifier(use_label_encoder=False, eval_metric=\"logloss\")\n", " model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)\n", " y_probs = model.predict_proba(X_test)\n", "\n", " accuracy = accuracy_score(y_test, y_probs.argmax(axis=1))\n", " mlflow.log_metric(\"accuracy\", accuracy)\n", "\n", " model_path = \"xgb.model\"\n", " model.save_model(model_path)\n", "\n", " signature = infer_signature(X_test, y_probs)\n", " mlflow.pyfunc.log_model(\n", " \"classifier\",\n", " python_model=ModelWrapper(),\n", " artifacts={\"model\": model_path},\n", " signature=signature,\n", " )" ] }, { "cell_type": "markdown", "id": "dac09e38", "metadata": {}, "source": [ "##### Option 2: Use a loader module\n", "\n", "Sometimes your model logic is complex and there are several source code files being used to make your model work. This would be the case when you have a Python library for your model for instance. In this scenario, you want to package the library all along with your model so it can move from one place to another as a single piece.\n", "\n", "MLflow supports this kind of models too by allowing you to specify any arbitrary source code to package along with the model.\n", "\n", "```python\n", "model_path = 'xgb.model'\n", "model.save_model(model_path)\n", "\n", "mlflow.pyfunc.log_model(\"classifier\", \n", " data_path=model_path,\n", " code_path=['loader_module.py'],\n", " loader_module='loader_module'\n", " signature=signature)\n", "```\n", "\n", "A few things to notice:\n", "\n", "* The model was saved using the `save` method of the framework used (it's not saved as a pickle).\n", "* A new parameter, `data_path`, was added pointing to the folder where the model's artifacts are located. This can be a folder or a file. Whatever is on that folder or file, it will be packaged with the model.\n", "* A new parameter, `code_path`, was added pointing to the location where the source code is placed. This can be a path or a single file. Whatever is on that folder or file, it will be packaged with the model.\n", "* The parameter `loader_module` represents a namespace in Python where the code to load the model is placed. You are required to to implement in this namespace a function called `_load_pyfunc(data_path: str)` that received the path of the artifacts (being the value you just passed at `data_path`) and returns an object with a method `predict` (at least).\n", "\n", "The corresponding `loader_module.py` implementation would be:" ] }, { "cell_type": "code", "execution_count": null, "id": "be6eb65c", "metadata": {}, "outputs": [], "source": [ "%%writefile loader_module.py\n", "\n", "class MyModel():\n", " def __init__(self, model):\n", " self._model = model\n", " \n", " def predict(self, data):\n", " return self._model.predict_proba(data)\n", "\n", "def _load_pyfunc(data_path: str):\n", " import os\n", " \n", " model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')\n", " model.load_model(os.path.abspath(data_path))\n", " \n", " return MyModel(model)" ] }, { "cell_type": "markdown", "id": "3ff32ee0", "metadata": {}, "source": [ "Note here that:\n", " \n", "* The model `MyModel` doesn't inherits from `PythonModel` as we did before, but it has a `predict` function.\n", "* The model's source code is on a file. This can be any source code you want. If your project has a folder `src`, it is a great candidate.\n", "* We added a function `_load_pyfunc` which returns an instance of the model class." ] }, { "cell_type": "markdown", "id": "70da136d", "metadata": {}, "source": [ "The complete training code would look as follows:" ] }, { "cell_type": "code", "execution_count": null, "id": "17102d65", "metadata": {}, "outputs": [], "source": [ "import mlflow\n", "from xgboost import XGBClassifier\n", "from sklearn.metrics import accuracy_score\n", "from mlflow.models import infer_signature\n", "\n", "with mlflow.start_run():\n", " mlflow.xgboost.autolog(log_models=False)\n", "\n", " model = XGBClassifier(use_label_encoder=False, eval_metric=\"logloss\")\n", " model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)\n", " y_probs = model.predict_proba(X_test)\n", "\n", " accuracy = accuracy_score(y_test, y_probs.argmax(axis=1))\n", " mlflow.log_metric(\"accuracy\", accuracy)\n", "\n", " model_path = \"xgb.model\"\n", " model.save_model(model_path)\n", "\n", " signature = infer_signature(X_test, y_probs)\n", " mlflow.pyfunc.log_model(\n", " \"classifier\",\n", " data_path=model_path,\n", " code_path=[\"loader_module.py\"],\n", " loader_module=\"loader_module\",\n", " signature=signature,\n", " )" ] } ], "metadata": { "interpreter": { "hash": "115d980ba0b06ac33dd7478fa3d990f21efd59bb027628063689062dcf356ae9" }, "kernelspec": { "display_name": "Python 3.10 - SDK V2", "language": "python", "name": "python310-sdkv2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]" } }, "nbformat": 4, "nbformat_minor": 5 }

sdk/python/using-mlflow/train-and-log/logging_and_customizing_models.ipynb (600 lines of code) (raw):