Databricks MLOps basic pipeline

Luis Garcia Fuentes
4 min readJul 14, 2023

An introduction to how to perform MLOps in Databricks for data scientists used to code locally.

Performing Machine Learning in Databricks is not so much different than working on your local computer. However, it may be confusing the first time you try it. This article will aim to provide a 101 on this topic.

What libraries to use

If you have large amounts of data, is advised that you write in native pyspark and for ML models to use the library MLlib. MLlib can be thought of as the sklearn equivalent written in pyspark. Hence, the code itself will handle multi-node computing, and the library comes prepackaged with popular machine-learning algorithms.

However, if each batch is relatively small (below million records, or fitting in the RAM of the node), it is advised to directly use sklearn in Databricks for performance reasons. Furthermore, there are still some algorithms implemented in sklearn not implemented in MLlib.

Training a model

In order to train a model you will need to either load your training data into the Databricks file storage system (DBFS) or query it out of a table saved into the data lake connected to your Databricks environment.

Depending on how you store your data, you will have to use different functions to read it

Load data from DBFS
Load data from the Data lake

Once your data is read, your traditional ML steps follow. (1) clean your data, (2) vectorize your data for ML ingestion, (3) train-test split, (4) train a model, (5) test the model on the test data set.

Where things get interesting is that Databricks has a very helpful MLOps functionality based on the MLflow library.

MLflow library

The library basically automates documentation related to model creation and provides an easy-to-navigate model registry to use developed models as part of batch processes, API call applications, and by other data scientist experimenting in other notebooks

During the experimentation process, the tracking capabilities allow for easy comparison of multiple models with different parameters. However, you can also use the libary to simply save your model for later use.

Saving a model is a simple process:

Once a model is saved, you will be able to see all your saved models in the model registry menu of Databricks. Here you will see all the models saved in this environment.

For each model, you will be able to see all the versions that exist. Each time you re-train the model, a new version will be created.

Once a model is registered it can be called via an API, via a notebook scheduled to be run as part of a workflow, or in a data science exploration notebook on a need basis. For the last two scenarios, code like the one that appears below will be required.

Databricks Production Philosophy

Finally, Databricks advises promoting into production environment code, not models. This means that the code that trains a model, as well as the data used to train the model, is what needs to be tested and deployed into production. Thanks to MLlib, this can be achieved without additional effort.

--

--