Ray or dask? A practical guide for data scientists


Photo by writer | Ideogram
As a data scientist, we carry large datasets or complex models require a certain amount of time running. Saving time and achieving immediate effects, using jobs that kill jobs at one time or many machines. Two popular dating libraries for this Fish large including Drowned. Both help accelerate the processing of data and exemplary training, but are used for different types of jobs.
In this article, we will explain what Ray and Daski said and when they have made each choice.
Obvious What is Dask and Ray?
Dask is the library used to manage large data. Designed to work in a way that feels familiar with users Pings to the head, Destructionor Scikit-learn. Dask breaks data and jobs into small parts and runs them alike. This makes it ready for data scientists who want to measure their data analysis without learning new ideas.
The ray is a standard tool that helps you create and use distribution apps. It is mainly strong in the study of the machine and the activities of AI.
Ray and you have additional libraries built on top of it, such as:
- Ray tune In order to get Tuning Hyperparemers in a machine learning
- Ray train Training models in many GPUs
- Discussion work With a shipping model as a web service
Ray is good if you want to create a system of organized machine reading or sending AI programs that need to submit complex tasks aloud.
Obvious Feature comparison
Dask is planned for Dask and Ray based on primary attributes:
| Feature | Drowned | Fish large |
|---|---|---|
| Basic Release | Dataframes, ARTs, Functional Activities | Deserving activities, players |
| The best | Data processing, pipes to study machine | Training of a machinery, order, and operating machine |
| Easy Use | Top of pandas / nunpy users | Middle Boilerplate, more |
| Ecosystem | It meets it scikit-learn, XGBOST |
Built-in libraries: Tune, Serve, Rlib |
| Cribal | Very nice to process batch processing | The best, additional control and flexibility |
| Arrangement | The Editor of Stealing Work | Motivating Schedule, based on ACTOR |
| Group Management | And to a native or in the zernetes, the cord | Ray Dashboard, Bernes, AWS, GCP |
| Community / Maturity | Old, mature, widely accepted | Fast Growth, Social Learning Machine Support |
Obvious When are you using?
Select Dask if:
- Use
Pandas/NumPyand you want to measure - Processing Tabular data or ARAY
- Make a batch etl or engineering of the feature
- Require
dataframeorarrayA sluggish murder
Select Ray if:
- You need to use many Python independent jobs in the same way
- Want to build machine study pipes, work models, or treat long running tasks
- Require MicServices. Like Scaling for visual activities
Obvious Ecosystem Tools
Both libraries provide or support a list of tools covering the scientific LifeCycleClecle data, but with different emphasis:
| Task | Drowned | Fish large |
|---|---|---|
| Dafrads | dask.dataframe |
Medin (Designed to Ray or Dask) |
| Hake | dask.array |
No traditional support, reliance on nunpy |
| Hyperparameter Tuning | Manually or with dask-ml | Ray tune (Advanced features) |
| Machine | dask-mlconversion |
Ray train, Ray tune, Ray air |
| The ministerial model | Flask / Fastapage Setup | Discussion work |
| Emphasis on Reading | Not supported | Rlib |
| Dipboard | Built in, too much detailed | Built in, simplified |
Obvious Real Earth Conditions
// Cleaning large data and elements of feature
Spend Drowned.
Why? Dask meets well with pandas including NumPy. Many data groups have already used these tools. If your data is too big to fit memory, the dask can distinguish into smaller parts and process these parts alike. This helps with services such as cleaning data and creating new features.
For example:
import dask.dataframe as dd
import numpy as np
df = dd.read_csv('s3://data/large-dataset-*.csv')
df = df[df['amount'] > 100]
df['log_amount'] = df['amount'].map_partitions(np.log)
df.to_parquet('s3://processed/output/')
This code reads many hundred files of CSV from S3 bucket using Dask similar. The lines where the column of value is larger 100, using the log change, and saves the result as parquet files.
// Parallel Hyperpareter Tuning for Material Reading Models
Work Fish large.
Why? Ray tune It is good about trying different settings when training the machine reading models. Meets with similar tools Pytorch including XGBoostAnd it can stop the bad run in advance to save time.
For example:
from ray import tune
from ray.tune.schedulers import ASHAScheduler
def train_fn(config):
# Model training logic here
...
tune.run(
train_fn,
config={"lr": tune.grid_search([0.01, 0.001, 0.0001])},
scheduler=ASHAScheduler(metric="accuracy", mode="max")
)
This code describes the training function and uses Ray Tune to test learning prices automatically. It automatically plans and inspects the best configuration using an Asha schedule.
// Distribution of Arrays Arrays
Use Drowned.
Why? Dask Arroes helps when working with large number sets. Divides Array on blocks and process them the same.
For example:
import dask.array as da
x = da.random.random((10000, 10000), chunks=(1000, 1000))
y = x.mean(axis=0).compute()
This code creates a large random Array separated by chunks can be processed. The spot that is that each column uses mask force
// Creating a Last Machine Learning Service
Work Fish large.
Why? The ray is not just for the model training but also the functioning and management of lives. Reference Discussion workYou can send models to producing, run well in the same, and even the level of good characters.
For example:
from ray import serve
@serve.deployment
class ModelDeployment:
def __init__(self):
self.model = load_model()
def __call__(self, request_body):
data = request_body
return self.model.predict([data])[0]
serve.run(ModelDeployment.bind())
This code describes the class to load a machine reading model and use API using Ray Phores. The class receives a request, makes the forecast using model, and returns the result.
Obvious Last recommendations
| Use the case | Recommended Tool |
|---|---|
| The best data analysis (Pandas style) | Drowned |
| Great Training of a Main Machine | Fish large |
| Hyperparameter Activilation | Fish large |
| Out-of-Core DataFrame Complaint | Drowned |
| The Real-Time Machine Time Machine | Fish large |
| Wishening pipes that have higher matches | Fish large |
| Compilations with Pydata Stack | Drowned |
Obvious Store
Ray and Dask with both tools that help data scientists carry a large amount of information and immediate running programs. Ray is ready for jobs that require a lot of flexibility, such as the machine learning projects. Dask helps if you want to work with large datasets using similar tools with Pandas or NumPy.
Which one you choose depends on what your project is required and your data type you have. It is a good idea to try both little examples to see which one is in line with your job.
Jayita the Gulati Is a typical typewriter and a technological author driven by his love by building a machine learning models. He holds a master degree in computer science from the University of Liverpool.



