MLflow: A Professional Way to Keep Track of Your Experiments
The Issue That No One Talks About (But Everyone Has)
You spend weeks working on a model for machine learning. You change the hyperparameters. You try out different algorithms. You do 50, 100, or even 200 experiments. Then disaster strikes: you can’t remember which set of parameters gave you the best results. Your laptop has a lot of notebooks on it. The names of your CSV files are hard to understand, like model_v3_lr0.01_bs64_acc0.82.h5. Your team members don’t know which version of the model is actually being used.
Welcome to the nightmare that is the chaos of tracking experiments.
This is what most data scientists and ML engineers deal with every day. It’s possible to manage machine learning experiments without the right tools, but it’s very painful, like trying to keep a detailed lab notebook while wearing oven mitts.
The best part is that most people don’t know that this problem can be fixed. And what is the answer? MLflow. An open-source platform that makes your messy experiment management into a system that is easy to use and repeat.
This guide will teach you everything you need to know about MLflow, including how it works, why it’s important, and most importantly, how to use it to keep track of your ML experiments like a pro.
Here’s a visual representation of the iterative machine learning lifecycle, highlighting the key steps from setting business goals to monitoring deployed models.

What is MLflow, anyway?
MLflow is a free platform that helps you manage the entire machine learning lifecycle. Think of it as a central place where all the parts of your ML projects come together and work well together.
MLflow is the tool that data science teams all over the world use the most. It was made by the people at Databricks. MLflow makes it easy to organize, track, and deploy your models, whether you’re working on your own or with a team of 50 data scientists.
What makes MLflow so great? It doesn’t care what language or framework you use. MLflow works well with all machine learning frameworks, including Python, R, TensorFlow, PyTorch, scikit-learn, and others.
The Four Pillars of MLflow and Why They Are Important
There are four main parts to MLflow, and each one solves a different problem in the ML lifecycle:
1. Tracking: Your own journal for experiments
MLflow Tracking is where the magic happens. This part keeps track of everything you need to repeat your experiments:
Parameters: Your hyperparameters, like the learning rate, batch size, number of layers, and so on.
Metrics: measures of performance (accuracy, precision, recall, F1 score, and loss values)
Artifacts are any files that your model makes, like saved models, plots, datasets, and images.
Versions of the Source Code: The exact code that ran each test
MLflow automatically collects all of this information instead of you having to write it down in a spreadsheet or make hundreds of file variations. In MLflow, every training run makes a “run,” which is a timestamped snapshot of your experiment and all of its data.
2. Projects: How to Make Your Code Work Again
MLflow Projects is basically a standard way to put your ML code together. It says, “Hey, here’s how to run my project, what environment it needs, and what the entry points are.”
Think about giving your project to a coworker and having them run it exactly how you wanted the first time, with no problems setting it up. That’s what Projects is all about.
3. Models: The Universal Format for Packaging
MLflow Models gives your trained models a standard wrapper. This is a big deal because models come in a lot of different types, like TensorFlow SavedFormat, PyTorch .pth files, and scikit-learn pickles.
MLflow says, “I don’t care what format your model is in. I’ll package it so that anyone can load and use it anywhere.”
4. Model Registry: Your Main Place for Versioning
The Model Registry is the place where models that have been registered are stored. In short, it’s version control for your ML models. You can see all the versions of a model, move models from one stage to another (Development → Staging → Production), and keep track of each model version’s whole lifecycle.
Here’s an infographic summarising the four main components of the MLflow platform:

How MLflow Tracking Works (Without All the Boring Terms)
Let’s say you’re using different hyperparameters to train a random forest model. This is what MLflow does:
import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
# Load the data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2
)
# Start a run in MLflow
with mlflow.start_run():
# Write down your parameters
mlflow.log_param("n_estimators", 100)
mlflow.log_param("max_depth", 10)
# Teach the model
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)
# Keep track of your metrics
accuracy = model.score(X_test, y_test)
mlflow.log_metric("accuracy", accuracy)
# Keep the model
mlflow.sklearn.log_model(model, "model")That’s all. MLflow automatically records everything and keeps it safe. No spreadsheets. No naming files by hand. Just tidy up and clean the experiment data.
The Game-Changer: Autologging
This is where MLflow starts to seem almost magical. You can turn on autologging for well-known frameworks. You don’t have to write any extra code for MLflow to automatically log parameters, metrics, and models.
import mlflow
# Turn on autologging for your framework
mlflow.autolog()
# Now just train like normal
model.fit(X_train, y_train)
# Everything is automatically recorded!TensorFlow/Keras, PyTorch, scikit-learn, XGBoost, LightGBM, Spark MLlib, and many more frameworks are supported.
This is huge because it makes things easier. You go from “I need to remember to log this” to “it just happens.” Most modern data scientists are already using these frameworks, so autologging feels like a cheat code.
MLflow vs. Other Tools for Keeping Track of Experiments (The Real Talk)
There are other tools on the market, such as Weights & Biases, Neptune, Comet, and more. So how does MLflow compare?
Here’s a quick table comparing MLflow to other popular experiment tracking tools:
| Feature | MLflow (Open-Source) | Weights & Biases | Neptune.ai | Comet ML |
| Cost | Free | Paid (Free tier) | Paid | Paid |
| Framework Agnostic | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes |
| Self-Hosted Option | ✅ Yes | No | No | No |
| Managed Cloud Option | Databricks | ✅ Yes | ✅ Yes | ✅ Yes |
| Community Support | Strong | Strong | Good | Good |
| Advanced Team Features | Basic (Needs setup) | ✅ Yes | ✅ Yes | ✅ Yes |
| User Management | Manual setup | Built-in | Built-in | Built-in |
| Model Registry | ✅ Yes | Limited | Limited | Limited |
MLflow’s Strong Points:
Completely free and open-source: There are no hidden costs or “enterprise pricing surprises.”
Framework agnostic: It works with any ML framework at all.
Easy to set up: You can start tracking experiments in a few minutes.
Strong community: The open-source community is helpful and active.
Self-hosted option: You are in charge of all of your data.
Where MLflow Gets Hard:
Team features: The open-source version doesn’t have advanced features for working together, like Weights & Biases
Managed hosting: You have to set up your own tracking server for your team to use.
User management: Custom implementation is needed for enterprise features.
For one data scientist or a small group? There is no better tool than MLflow. For large enterprises that need advanced user management and hosted solutions, Databricks offers managed MLflow (which is paid, but worth it).
Success Stories with MLflow in the Real World
Let’s talk about businesses that are using MLflow to solve real problems:
Databricks: Running Thousands of Tests
Databricks, the company that made MLflow, uses it to run thousands of ML experiments at the same time with many different teams.
The result was a 40% cut in the time it took to develop the model.
Impact: Teams spent less time running experiments and more time making models that worked better.
Uber: Growing to Teams Around the World
Uber used MLflow to keep track of hundreds of machine learning models that were used in different parts of the business. They were in charge of everything from recommendation systems to prediction engines.
Metrics: Model iteration cycles that are 50% faster
Benefit: A big drop in mistakes made by global teams when deploying
Comcast: Making the Customer Experience Better
The big telecom company used MLflow to make ML workflows more consistent and create models that better predict when customers will leave.
Achievement: 25% better accuracy in predictive models
Bonus: New projects can be deployed faster
Zoom: How to Scale Up Quickly During the Pandemic
When the world went remote, Zoom used MLflow to handle ML models for improving video quality and adding AI-powered features.
Focus: Making sure that models are deployed the same way all over the world
Result: Quickly changing communication features while keeping quality
These are real-life situations. These are real businesses that are getting real value from MLflow.
Setting Up MLflow Is Easier Than You Think
Option 1: Set up locally (start here)
pip install mlflow
mlflow uiThat’s all there is to it. You can start logging experiments right away by going to http://localhost:5000.
Option 2: Remote Tracking Server (for groups)
You need a central server when you work with a group. Here is how to set it up:
# On the machine that runs your server
mlflow server --backend-store-uri postgresql://user:pass@host/db \
--default-artifact-root s3://bucket/path \
--host 0.0.0.0You will need:
A metadata database, like PostgreSQL, MySQL, or SQLite
Artifacts stored in the cloud, like S3
A server or cloud instance to run it on
Option 3: Managed MLflow (No Headaches)
Databricks provides fully managed MLflow. You don’t have to set anything up. You just use it. This is something to think about if you want production-grade reliability without having to do DevOps work, but it costs money.
A Real-Life Example: The Right Way to Hyperparameter Tuning
For example, you’re tuning a model and trying out 20 different combinations of hyperparameters. It would be a mess without MLflow. You can stay organized with MLflow.
Here’s how to use nested runs to keep your tuning experiments in order:
import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import random
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2
)
# Make a parent run for the whole tuning experiment
with mlflow.start_run(run_name="hyperparameter_tuning") as parent_run:
mlflow.set_tag("experiment_type", "hyperparameter_tuning")
# Try out different hyperparameters
for depth in range(5, 20):
for estimators in [50, 100, 200]:
# For each combination, make a nested child run
with mlflow.start_run(run_name=f"depth_{depth}_est_{estimators}", nested=True):
model = RandomForestClassifier(
max_depth=depth,
n_estimators=estimators
)
model.fit(X_train, y_train)
# Keep track of parameters and metrics
mlflow.log_param("max_depth", depth)
mlflow.log_param("n_estimators", estimators)
mlflow.log_metric("accuracy", model.score(X_test, y_test))In the MLflow UI, you can now see one clean parent run with child runs underneath it. You have structure instead of 16 random runs all over the place.
Here’s an infographic demonstrating the MLflow experiment tracking workflow, from training to deployment, including how nested runs help organize hyperparameter tuning.

Advanced features that set professionals apart from amateurs
Search Runs with SQL-Like Syntax
You can query your experiments in MLflow like you would a database:
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Get all runs with an accuracy of more than 0.9
runs = client.search_runs(
experiment_ids=["0"],
filter_string="metrics.accuracy > 0.9"
)
# Find runs that meet certain criteria
client.search_runs(
experiment_ids=["0"],
filter_string = "params.learning_rate = '0.01'"
)This is a game-changer for finding your best models without having to click through hundreds of runs by hand.
Model Signatures and Examples of Input
You can log model signatures that tell you what the input and output formats should be:
from mlflow.models import ModelSignature, ColSpec
import numpy as np
signature = ModelSignature(
inputs = [ColSpec("double", "feature1"), ColSpec("double", "feature2")],
outputs=[ColSpec("double")]
)
mlflow.sklearn.log_model(
model,
"model",
signature=signature,
input_example = np.array([[1.0, 2.0]])
)If someone tries to use your model later, they will immediately know what inputs it needs. No more games of guesswork.
Seeing the Differences in Runs
You can look at several runs next to each other in the MLflow UI. You can see which parameters improved performance, find patterns, and use data to decide which model to put into production.
The Real Cost: Things You Might Not Know
MLflow is free, but let’s be honest about how much it costs in total:
If you’re hosting yourself:
Costs for server infrastructure (a small AWS instance costs about $30 a month)
Taking care of databases like PostgreSQL, MySQL, etc.
S3 or cloud storage for things
Time to set up and DevOps work
Putting in place security and authentication
The total cost is $50 to $200 per month for a small team, plus your own work.
If you use Databricks’ Managed MLflow:
Starting at about $0.30 per hour for each DBU (Databricks Unit)
Costs that are easier to plan for
Included professional help
No problems with DevOps
For data scientists who work alone? TThe free version of MLflow, which you can install on your own computer, is genuinely free. For groups? Plan for infrastructure costs.
MLfloBest Practices: Learn From Others’ Mistakes A consistent tagging plan
Always use tags to sort your runs:
mlflow.set_tag("model_type", "classification")
mlflow.set_tag("dataset", "iris")
mlflow.set_tag("team", "analytics")
mlflow.set_tag("environment", "development")You’ll be glad you did this later when you can use these tags to sort runs.
2. Use names for experiments that mean something
Not this: experiment, exp_1, exp_2
Do this: iris_classification_v1, customer_churn_baseline, sentiment_analysis_bert
3. Keep track of your data versions
Don’t just write down the parameters for your model. Write down the version of the dataset you used:
mlflow.log_param("dataset_version", "2.1.0")
mlflow.log_param("data_split_date", "2024-11-01")4. Keep a record of your runs
When you start a run, use the description parameter:
with mlflow.start_run(description="Testing a new way to do feature engineering"):
# your code5. Use automation when you can
Instead of writing everything down by hand, use mlflow.autolog(). Fewer mistakes and less code to keep up with.
How to Avoid Common Mistakes People Make
Mistake #1: Not setting up a server to track things
It’s okay to log experiments on your own computer for personal projects, but teams need a server that is not in the same place as them. If not, each person will have their own separate history of experiments.
Fix: When there are more than one person on the project, set up a tracking server.
Mistake #2: Logging Too Much
Logging 1000 metrics every time you run makes your tracking server slow and makes noise. Choose carefully.
Fix: Keep track of the metrics that are important for making decisions.
Mistake #3: Not writing down data information
You logged a model with great accuracy, but you forgot which version of the dataset you used. You can’t do it again now.
Fix: Always keep track of the dataset, the steps you took to preprocess it, and any changes you made to the data.
Mistake #4: Not paying attention to the Model Registry
MLflow Tracking is great, but you’re not getting the full value if you don’t use Model Registry for deployment.
Solution: Move working models to Model Registry and use it as your main source of truth for production models.
MLflow for Different ML Frameworks (Quick Reference)
MLflow autologging works with:
| Framework | Key Features Supported |
| scikit-learn | ✅ All parameters, metrics, and models |
| TensorFlow/Keras | ✅ Full: Training metrics, model weights, and callbacks |
| PyTorch | ✅ Full training loops and checkpoints |
| XGBoost | ✅ Full feature importance and early stopping |
| LightGBM | ✅ All metrics, parameters, and models |
| Spark MLlib | ✅ Full Pipelines, models |
| Statsmodels | ✅ Full Statistical parameters and results |
| Hugging Face | ✅ Partially Training metrics (experimental) |
All you have to do is turn on autologging, and your framework of choice will log automatically.
The MLflow UI: Your Dashboard for Experiments
You get the following when you open the MLflow UI:
List of experiments: A list of all your experiments
Run comparison: Compare metrics, parameters, and plots side by side
Metric charts: Trends over time that you can see
Artifact browser: See all of your saved files
Run details: every piece of metadata, artifact, metric, and parameter
The UI turns raw data from experiments into useful information. You don’t have to write any code to see right away which hyperparameters made things work better.
Here’s an example of an MLflow experiment tracking dashboard, allowing for visual comparison of multiple model runs to understand how different parameters impact performance.

Using MLflow to Deploy Models (From Testing to Production)
After you find the best model with MLflow Tracking, you can:
Put it in the Model Registry
Promote it in stages: Development, Staging, and Production
Use MLflow Models to serve it as a REST API.
Put it in a Docker container
# Give your model a REST API
mlflow models serve -m "models:/my_model/production" --port 5000
# You can now access your model at http://localhost:5000/invocationsThis complete workflow, from tracking experiments to deploying them in production, is what makes MLflow so useful for real businesses.
MLflow for Generative AI and LLMs (The Next Big Thing)
MLflow’s support for LLM tracking and prompt engineering is one of the most exciting new things that has happened.
You can now do the following with MLflow:
Keep an eye on changes to prompts and how well they work
Keep track of LLM responses and quality metrics
Look at different LLM setups side by side.
Yes, really, keep track of your prompts.
This is a big deal for anyone who uses ChatGPT, GPT-4, Llama, or any other LLM. Prompt engineering follows the same rules for tracking experiments.
When things go wrong, here’s how to fix them:
Issue: Experiments aren’t being recorded
To fix the problem: Check that MLflow is imported and that you are using mlflow.start_run().
Issue: The MLflow UI won’t load
To fix this: Make sure the MLflow server is running: mlflow ui
Problem: Artifacts not being saved
To fix the problem: Check the settings and permissions for your artifact storage.
Problem: The connection to the remote tracking server was denied.
To fix the problem: Check that the MLFLOW_TRACKING_URI environment variable is set up right.
FAQ’s
Q1: Do I really need MLflow if I’m the only one working?
A: For personal projects? Most likely not. MLflow saves time, though, if you do even a few tests. And once you get used to it, you’ll want it everywhere.
Q2: Is MLflow just for data scientists?
A: No. Everyone benefits: ML engineers, DevOps people, and product managers. Anyone working on ML projects can use MLflow.
Q3: Is it possible to switch from MLflow to a different tool later?
A: Yes. MLflow keeps data in a standard way. If you need to, you can export your runs and move them to another place (but you probably won’t want to).
Q4: What kind of storage do I need for MLflow?
A: It depends on what you’re logging. A normal run could be between 1 and 10 MB. 1,000 runs equal 1 to 10 GB. If you use S3 or cloud storage, plan your budget accordingly.
Q5: Is it safe enough to use open-source MLflow in production?
A: For teams inside? Yes. For very private information? You might want Managed MLflow with security features for businesses.


