Machine Learning

Ray: Distributed Computing for All, Part 2

installment in my two-part series on the Ray library, a Python framework created by AnyScale for distribution and computing compatibility. Part 1 covers how to parallelize CPU-intensive Python tasks on your local PC by distributing the workload across all available cores, resulting in significant performance improvements. I will leave a link to Part 1 at the end of this article.

This part deals with a similar theme, except that we take distributed Python workloads to the next level by using Ray to parallelize them across multiple cloud server clusters.

If you have come this far without reading Part 1, TL; Ray's DR is a corporation open source distributed computing framework it is designed to do easy to measure Python programs go from laptop to cluster with minor code changes. That alone should hopefully be enough to pique your interest. In my tests, on my desktop PC, I took a straightforward, simple Python program that finds prime numbers and reduced its runtime by 10 by adding four lines of code.

Where can you run Ray collections?

Ray groups can be set to the following:

  • AWS and GCP Cloud, although unofficial integrations exist with other providers, too, such as Azure
  • AnyScale, a fully managed platform built by the creators of Ray.
  • Kubernetes can also be used with the official KubeRay project.

What is required

To follow my process, you'll need a few things set up beforehand. I will use AWS for my demo, as I have an existing account there; however, I expect setups from other cloud providers and platforms to be very similar. You must be:

  • Authentication is set up using Cloud CLI commands for your chosen provider.
  • A default VPC and at least one associated public subnet with a publicly accessible IP address.
  • A paired SSH Key (.pem) file that you can download to your local system so that Ray (and you) can connect to nodes in your cluster
  • You have enough allocations to satisfy the requested number of nodes and vCPUs in any cluster you set up.

If you want to perform local testing of your Ray code before submitting it to the repository, you'll also need to install the Ray library. We can do that using pip.

$ pip install ray

I will run everything from the WSL2 Ubuntu shell on my Windows desktop.

To ensure that Ray is installed correctly, you must be able to use its command line interpreter. In a terminal window, type the following command.

$ ray --help

Usage: ray [OPTIONS] COMMAND [ARGS]...

Options:
  --logging-level TEXT   The logging level threshold, choices=['debug',
                         'info', 'warning', 'error', 'critical'],
                         default='info'
  --logging-format TEXT  The logging format.
                         default="%%(asctime)st%%(levelname)s
                         %%(filename)s:%%(lineno)s -- %%(message)s"
  --version              Show the version and exit.
  --help                 Show this message and exit.

Commands:
  attach               Create or attach to a SSH session to a Ray cluster.
  check-open-ports     Check open ports in the local Ray cluster.
  cluster-dump         Get log data from one or more nodes.
...
...
...

If you don't see this, something went wrong, and you should double-check the output of your install command.

Assuming everything is OK, we are about to leave.

One last important point, though. Creating resources, such as computing clusters, in a cloud provider such as AWS will bring costsso it is important that you remember this. The good news is that Ray has a built-in command that will tear down any infrastructure you create, but to be safe, you should double-check that there are no unused and potentially costly resources left. “on” by mistake.

Our Python code example

The first step is to modify our existing Ray code from Part 1 to run on a cluster. Here is the original code for your reference. Remember that we are trying to count the number of prime numbers within a certain range of numbers.

import math
import time

# -----------------------------------------
# Change No. 1
# -----------------------------------------
import ray
ray.init()

def is_prime(n: int) -> bool:
    if n < 2: return False
    if n == 2: return True
    if n % 2 == 0: return False
    r = int(math.isqrt(n)) + 1
    for i in range(3, r, 2):
        if n % i == 0:
            return False
    return True

# -----------------------------------------
# Change No. 2
# -----------------------------------------
@ray.remote(num_cpus=1)  # pure-Python loop → 1 CPU per task
def count_primes(a: int, b: int) -> int:
    c = 0
    for n in range(a, b):
        if is_prime(n):
            c += 1
    return c

if __name__ == "__main__":
    A, B = 10_000_000, 20_000_000
    total_cpus = int(ray.cluster_resources().get("CPU", 1))

    # Start "chunky"; we can sweep this later
    chunks = max(4, total_cpus * 2)
    step = (B - A) // chunks

    print(f"nodes={len(ray.nodes())}, CPUs~{total_cpus}, chunks={chunks}")
    t0 = time.time()
    refs = []
    for i in range(chunks):
        s = A + i * step
        e = s + step if i < chunks - 1 else B
        # -----------------------------------------
        # Change No. 3
        # -----------------------------------------
        refs.append(count_primes.remote(s, e))

    # -----------------------------------------
    # Change No. 4
    # -----------------------------------------
    total = sum(ray.get(refs))

    print(f"total={total}, time={time.time() - t0:.2f}s")

What changes are needed to run it on a cluster? The answer is just one small change it is necessary.

Change 

ray.init() 

to

ray.init(address=auto)

That's one of Ray's beauties. The same code runs virtually unchanged on your local PC, or anywhere else you care to use it, including large, multi-server clusters.

Setting up our collection

In the cloud, a Ray cluster consists of a head node and one or more worker nodes. In AWS, all these nodes are EC2 instances. Ray clusters can be fixed-sized or automatically scaled up and down based on the resources requested by the applications running on the cluster. The head node is started first, and the workstations are configured with the address of the head node to create a cluster. If auto-scaling is enabled, worker nodes scale up or down automatically based on application load and will scale down after a user-specified time (5 minutes by default).

Ray uses YAML files to configure collections. A YAML file is a plain text file with JSON-like syntax used to configure a program.

Here is the YAML file I will use to set up my collection. I found that the closest EC2 instance on my desktop PC, in terms of CPU core count and performance, was c7g.8xlarge. For simplicity, I have the head node be the same server type as all workers, but you can mix and match different types of EC2 if you like.

cluster_name: ray_test

provider:
  type: aws
  region: eu-west-1
  availability_zone: eu-west-1a

auth:
  # For Amazon Linux AMIs the SSH user is 'ec2-user'.
  # If you switch to an Ubuntu AMI, change this to 'ubuntu'.
  ssh_user: ec2-user
  ssh_private_key: ~/.ssh/ray-autoscaler_eu-west-1.pem

max_workers: 10
idle_timeout_minutes: 10

head_node_type: head_node

available_node_types:
  head_node:
    node_config:
      InstanceType: c7g.8xlarge
      ImageId: ami-06687e45b21b1fca9
      KeyName: ray-autoscaler_eu-west-1

  worker_node:
    min_workers: 5
    max_workers: 5
    node_config:
      InstanceType: c7g.8xlarge
      ImageId: ami-06687e45b21b1fca9
      KeyName: ray-autoscaler_eu-west-1
      InstanceMarketOptions:
        MarketType: spot

# =========================
# Setup commands (run on head + workers)
# =========================
setup_commands:
  - |
    set -euo pipefail

    have_cmd() { command -v "$1" >/dev/null 2>&1; }
    have_pip_py() {
      python3 -c 'import importlib.util, sys; sys.exit(0 if importlib.util.find_spec("pip") else 1)'
    }

    # 1) Ensure Python 3 is present
    if ! have_cmd python3; then
      if have_cmd dnf; then
        sudo dnf install -y python3
      elif have_cmd yum; then
        sudo yum install -y python3
      elif have_cmd apt-get; then
        sudo apt-get update -y
        sudo apt-get install -y python3
      else
        echo "No supported package manager found to install python3." >&2
        exit 1
      fi
    fi

    # 2) Ensure pip exists
    if ! have_pip_py; then
      python3 -m ensurepip --upgrade >/dev/null 2>&1 || true
    fi
    if ! have_pip_py; then
      if have_cmd dnf; then
        sudo dnf install -y python3-pip || true
      elif have_cmd yum; then
        sudo yum install -y python3-pip || true
      elif have_cmd apt-get; then
        sudo apt-get update -y || true
        sudo apt-get install -y python3-pip || true
      fi
    fi
    if ! have_pip_py; then
      curl -fsS  -o /tmp/get-pip.py
      python3 /tmp/get-pip.py
    fi

    # 3) Upgrade packaging tools and install Ray
    python3 -m pip install -U pip setuptools wheel
    python3 -m pip install -U "ray[default]"

Here is a brief description of each critical YAML class.

cluster_name: Assigns a name to the cluster, allowing Ray to track and manage 
it separately from others.

provider:  Specifies which cloud to use (AWS here), along with the region and 
availability zone for launching instances.

auth:  Defines how Ray connects to instances over SSH - the user name and the 
private key used for authentication.

max_workers:  Sets the maximum number of worker nodes Ray can scale up to when 
more compute is needed.

idle_timeout_minutes:  Tells Ray how long to wait before automatically terminating 
idle worker nodes.

available_node_types:  Describes the different node types (head and workers), their 
instance sizes, AMI images, and scaling limits.

head_node_type:  Identifies which of the node types acts as the cluster's controller
(the head node).

setup_commands:  Lists shell commands that run once on each node when it's first 
created, typically to install software or set up the environment.

To start creating a cluster, run this ray command from the terminal.

$ ray up -y ray_test.yaml

Ray will do its thing, create all the necessary infrastructure, and after a few minutes, you should see something like this in your terminal window.

...
...
...
Next steps
  To add another node to this Ray cluster, run
    ray start --address='10.0.9.248:6379'

  To connect to this Ray cluster:
    import ray
    ray.init()

  To submit a Ray job using the Ray Jobs CLI:
    RAY_ADDRESS=' ray job submit --working-dir . -- python my_script.py

  See 
  for more information on submitting Ray jobs to the Ray cluster.

  To terminate the Ray runtime, run
    ray stop

  To view the status of the cluster, use
    ray status

  To monitor and debug Ray, view the dashboard at
    10.0.9.248:8265

  If connection to the dashboard fails, check your firewall settings and network configuration.
Shared connection to 108.130.38.255 closed.
  New status: up-to-date

Useful commands:
  To terminate the cluster:
    ray down /mnt/c/Users/thoma/ray_test.yaml

  To retrieve the IP address of the cluster head:
    ray get-head-ip /mnt/c/Users/thoma/ray_test.yaml

  To port-forward the cluster's Ray Dashboard to the local machine:
    ray dashboard /mnt/c/Users/thoma/ray_test.yaml

  To submit a job to the cluster, port-forward the Ray Dashboard in another terminal and run:
    ray job submit --address  --working-dir . -- python my_script.py

  To connect to a terminal on the cluster head for debugging:
    ray attach /mnt/c/Users/thoma/ray_test.yaml

  To monitor autoscaling:
    ray exec /mnt/c/Users/thoma/ray_test.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'

Executing a Ray function on a cluster

At this stage, the cluster has been created, and we are ready to send our Ray function to it. To give the cluster something more important to work with, I increased the scope of the main search in my code from 10,000,000 to 20,000,000 to 10,000,000–60,000,000. On my local desktop, Ray ran this in 18 seconds.

I waited a short time for all cluster nodes to fully boot, then ran the code on the cluster with this command.

$  ray exec ray_test.yaml 'python3 ~/ray_test.py'

Here is my output.

(base) tom@tpr-desktop:/mnt/c/Users/thoma$ ray exec ray_test2.yaml 'python3 ~/primes_ray.py'
2025-11-01 13:44:22,983 INFO util.py:389 -- setting max workers for head node type to 0
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 52.213.155.130
Warning: Permanently added '52.213.155.130' (ED25519) to the list of known hosts.
2025-11-01 13:44:26,469 INFO worker.py:1832 -- Connecting to existing Ray cluster at address: 10.0.5.86:6379...
2025-11-01 13:44:26,477 INFO worker.py:2003 -- Connected to Ray cluster. View the dashboard at 
nodes=6, CPUs~192, chunks=384
(autoscaler +2s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +2s) No available node types can fulfill resource requests {'CPU': 1.0}*160. Add suitable node types to this cluster to resolve this issue.
total=2897536, time=5.71s
Shared connection to 52.213.155.130 closed.

As you can see the time taken to run on the cluster was more than 5 seconds. So, five workstations did the same job in less than a third of the time it took on my local PC. Not too shabby.

When you are done with your cluster, please use the following Ray command to collapse it.

$ ray down -y ray_test.yaml

As I mentioned before, you should always double check your account to make sure this command worked as expected.

Summary

This article, the second in a two-part series, shows how to run CPU-intensive Python code on cloud-based clusters using the Ray library. By spreading the workload across all available vCPUs, Ray ensures that our code delivers fast performance and runtimes.

I explained and demonstrated how to create a cluster using a YAML file and how to use the Ray command line interface to submit code for use in the cluster.

Using AWS as an example platform, I took the Ray Python code, which had been running on my local PC and ran it – almost unchanged – on a 6-node EC2 cluster. This showed a significant performance improvement (3x) in uncompromised runtime.

Finally, I showed how to use the ray command line tool to tear down the AWS cluster infrastructure created by Ray.

If you haven't read my first article in this series, click the link below to check it out.

Please note that other than being a part-time user of their services, I am not affiliated with AnyScale or AWS or any other organization mentioned in this article.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button