How to import your data with MLflow. Data entry into MLOps for… | by Jack Chang | January, 2025

nimda January 19, 2025

0 18 4 minutes read

How to import your data with MLflow. Data entry into MLOps for… | by Jack Chang | January, 2025

Setting up the MLflow server locally is straightforward. Run the following command:

mlflow server --host 127.0.0.1 --port 8080

Then set the tracking URI.

mlflow.set_tracking_uri("http://127.0.0.1:8080")

For more advanced configuration, see the MLflow documentation.

In this article, we use the California real estate dataset (CC BY license). However, you can use the same principles to log and track any dataset you choose.

For more information on the California housing dataset, see this document.

`mlflow.data.dataset.Dataset`

Before getting into dataset writing, testing, and retrieval, it's important to understand the concept of datasets in MLflow. MLflow offers a mlflow.data.dataset.Dataset object, which represents datasets used with MLflow Tracking.

class mlflow.data.dataset.Dataset(source: mlflow.data.dataset_source.DatasetSource, name: Optional[str] = None, digest: Optional[str] = None)

This item comes with important features:

The required parameter, the source (data source for your dataset as mlflow.data.dataset_source.DatasetSource thing)
grind (the fingerprint of your data set) and name (name of your dataset), which can be set with parameters.
schema again profile to describe dataset structure and statistical properties.
Information about the dataset the sourceas its final destination.

You can easily convert a dataset into a dictionary using to_dict() or JSON string using to_json().

Support for popular data set formats

MLflow makes it easy to work with different types of datasets with special classes that extend the backbone mlflow.data.dataset.Dataset. At the time of writing this article, here are some dataset classes supported by MLflow:

the pandas: mlflow.data.pandas_dataset.PandasDataset
NumPy: mlflow.data.numpy_dataset.NumpyDataset
A spark: mlflow.data.spark_dataset.SparkDataset
A Hugging Face: mlflow.data.huggingface_dataset.HuggingFaceDataset
TensorFlow: mlflow.data.tensorflow_dataset.TensorFlowDataset
Experimental datasets: mlflow.data.evaluation_dataset.EvaluationDataset

All these classes come with a simple plan mlflow.data.from_* API to load datasets directly into MLflow. This makes it easy to create and manage datasets, regardless of their underlying format.

mlflow.data.dataset_source.DatasetSource

I mlflow.data.dataset.DatasetSource class is used to represent the origin of the dataset in MLflow. When you create a mlflow.data.dataset.Dataset thing, i source The parameter can be specified as a string (eg, a file path or URL) or as an instance of mlflow.data.dataset.DatasetSource class.

class mlflow.data.dataset_source.DatasetSource

If the string is given as sourceMLflow internally calls the resolve_dataset_source work. This function iterates through a predefined list of data sources once DatasetSource classes to determine the most appropriate source type. However, MLflow's ability to accurately resolve the source dataset is limited, especially if candidate_sources argument (a list of possible sources) is set to Nonewhich is the default.

In cases where the DatasetSource class cannot resolve the raw source, an MLflow exception is raised. For best practice, I recommend that you explicitly create and use an instance of mlflow.data.dataset.DatasetSource class where we define the origin of the dataset.

class HTTPDatasetSource(DatasetSource)
class DeltaDatasetSource(DatasetSource)
class FileSystemDatasetSource(DatasetSource)
class HuggingFaceDatasetSource(DatasetSource)
class SparkDatasetSource(DatasetSource)

One of the most straightforward ways to access a dataset in MLflow is to mlflow.log_input() API. This allows you to import datasets in any compatible format mlflow.data.dataset.Datasetwhich can be very useful when managing large-scale testing.

Step by Step Guide

First, let's download the California real estate dataset and convert it to a pandas.DataFrame to facilitate manipulation. Here, we create a data frame that includes both feature data (california_data) and target data (california_target).

california_housing = fetch_california_housing()
california_data: pd.DataFrame = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
california_target: pd.DataFrame = pd.DataFrame(california_housing.target, columns=['Target'])california_housing_df: pd.DataFrame = pd.concat([california_data, california_target], axis=1)

To access the dataset with logical metadata, we define several parameters such as the URL of the data source, the name of the dataset, and the target column. This will provide useful context when retrieving the dataset later.

If we look deeper into fetch_california_housing source code, we can see the data from https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz.

dataset_source_url: str = 'https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz'
dataset_source: DatasetSource = HTTPDatasetSource(url=dataset_source_url)
dataset_name: str = 'California Housing Dataset'
dataset_target: str = 'Target'
dataset_tags = {
'description': california_housing.DESCR,
}

Once the data and metadata are defined, we can modify them pandas.DataFrame be i mlflow.data.Dataset something.

dataset: PandasDataset = mlflow.data.from_pandas(
df=california_housing_df, source=dataset_source, targets=dataset_target, name=dataset_name
)print(f'Dataset name: {dataset.name}')
print(f'Dataset digest: {dataset.digest}')
print(f'Dataset source: {dataset.source}')
print(f'Dataset schema: {dataset.schema}')
print(f'Dataset profile: {dataset.profile}')
print(f'Dataset targets: {dataset.targets}')
print(f'Dataset predictions: {dataset.predictions}')
print(dataset.df.head())

Example output:

Dataset name: California Housing Dataset
Dataset digest: 55270605
Dataset source: 
Dataset schema: ['MedInc': double (required), 'HouseAge': double (required), 'AveRooms': double (required), 'AveBedrms': double (required), 'Population': double (required), 'AveOccup': double (required), 'Latitude': double (required), 'Longitude': double (required), 'Target': double (required)]
Dataset profile: {'num_rows': 20640, 'num_elements': 185760}
Dataset targets: Target
Dataset predictions: None
MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude  Target
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88    -122.23   4.526
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86    -122.22   3.585
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85    -122.24   3.521
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85    -122.25   3.413
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85    -122.25   3.422

Note that You can even convert the dataset into a dictionary to access additional features such as source_type:

for k,v in dataset.to_dict().items():
print(f"{k}: {v}")

name: California Housing Dataset
digest: 55270605
source: {"url": "https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz"}
source_type: http
schema: {"mlflow_colspec": [{"type": "double", "name": "MedInc", "required": true}, {"type": "double", "name": "HouseAge", "required": true}, {"type": "double", "name": "AveRooms", "required": true}, {"type": "double", "name": "AveBedrms", "required": true}, {"type": "double", "name": "Population", "required": true}, {"type": "double", "name": "AveOccup", "required": true}, {"type": "double", "name": "Latitude", "required": true}, {"type": "double", "name": "Longitude", "required": true}, {"type": "double", "name": "Target", "required": true}]}
profile: {"num_rows": 20640, "num_elements": 185760}

Now that we have our dataset ready, it's time to put it into an MLflow run. This allows us to capture the metadata of the dataset, making it part of the test for future use.

with mlflow.start_run():
mlflow.log_input(dataset=dataset, context='training', tags=dataset_tags)

🏃 View run sassy-jay-279 at: http://127.0.0.1:8080/#/experiments/0/runs/5ef16e2e81bf40068c68ce536121538c
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/0

Let's explore the dataset in the MLflow UI (). You will find your dataset listed under the automatic test. Of Data sets used section, you can view the context of the dataset, this time marked as used for training. Additionally, all relevant fields and properties of the dataset will be displayed.