ANI

It processes large data with dask and skiit-funda

It processes large data with dask and skiit-funda
Image editor

The obvious Getting started

You are painted A collection of packages that summarize the capabilities of parallel computing – especially useful when handling large datasets or building efficient, powerful applications such as advanced learning systems. Among its most prominent advantages is Dask's integration of DeasKask with the existing Python framework, including support for large data processing on the side Scikit-Learn modules in a parallel workflow. This document outlines how to integrate dask to process data, even under limited constraints.

The obvious Step-by-step walkthrough

Although it is not very large, the California dataset is very large, which makes it a good choice for a gentle, illustrative writing example that shows how we can combine dask and skik-learn how to process data at a reasonable level.

Dask provides a dataframe A module that fixes many aspects of pandas DataFrame Things to manage large datasets that may not be convenient to remember. We will use this dask DataFrame structure to load our data from CSV into the githib repository, as follows:

import dask.dataframe as dd

url = "
df = dd.read_csv(url)

df.head()

Overview of California housing dataOverview of California housing data

An important note here. If you want to see the “structure” of the data – the number of rows and columns – The method is a little trickier than just using it df.shape. Instead, you should do something like this:

num_rows = df.shape[0].compute()
num_cols = df.shape[1]
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_cols}")

Output:

Number of rows: 20640
Number of columns: 10

Note that we used dasks compute() obviously smoking the number of rows, but not the number of columns. Dataset Metadata allows us to find the number of columns (features) quickly, while determining the number of rows in the database can be (therefore) very expensive – and thus a distributed computer: something compute() it's because of us.

Unstructured Data It is usually the previous step in building a machine learning model or estimator. Before moving on to that part, and since the main focus of this article is to show you how dask can be used to process data, let's clean it up and prepare it.

One common step in data processing Dealing with missing values. With Dask, the process is as seamless as if we were using pandas. For example, the code below removes rows with instances that contain missing values ​​in any of their attributes:

df = df.dropna()

num_rows = df.shape[0].compute()
num_cols = df.shape[1]
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_cols}")

The Datase is now reduced by more than 200 instances, with a total of 20433 rows.

Next, we can measure some of the values ​​in the dataset by installing skikit-learn StandardScaler or any other suitable method of measurement:

from sklearn.preprocessing import StandardScaler

numeric_df = df.select_dtypes(include=["number"])
X_pd = numeric_df.drop("median_house_value", axis=1).compute()

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_pd)

Importantly, note that the sequence of data-efficient operations we perform in dask, such as extracting rows containing missing values ​​followed by the target column "median_house_value"we should add compute() at the end of a sequence of bound functions. This is because Dataset changes in Dask are done with obscenity. Once compute() is called, the result of a bounded transformation on bounded data like pandas DataFrame (Dask depends on pandas, so you won't need to explicitly include the pandas library in your code unless you directly call a pandas-specific function).

What if we want Train a machine learning model? After that we have to extract the target variable "median_house_value" and use the same principle to convert it to a pandas object:

y = df["median_house_value"]
y_pd = y.compute()

From now on, the process of dividing the dataset into training and testing, train a matching model like RandomForestRegressorand check their error in test details that are fully identical to the traditional method using pandas and skikit-learn in an orchestrated way. Since tree-based models are insensitive to scaling, you can use either unweighted features (X_pd) or scated (X_scaled). Below we continue the rated features covered above:

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Use the scaled feature matrix produced earlier
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_pd, test_size=0.2, random_state=42)

model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"RMSE: {rmse:.2f}")

Output:

The obvious Wrapping up

Dask and Skikit-Funda It can be used together to take advantage of visual flexibility, structured data processing, for example, to efficiently deal with large-scale machine learning models. This document demonstrated how to load, clean, prepare and transform data using the standard task of skikit-read skikit-read memory tools – all while speeding up the pipeline and speeding up pipelines when dealing with large data.

Ván Palomares Carrascosa is a leader, author, and consultant in AI, machine learning, deep learning and llms. He trains and guides others in integrating AI into the real world.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button