AI on Multiple GPUs: Understanding the Host and Device Paradigm

nimda February 12, 2026

0 18 5 minutes read

AI on Multiple GPUs: Understanding the Host and Device Paradigm

is part of a series about distributed AI on multiple GPUs:

Part 1: Understanding the Host and Device Paradigm (This article)
Part 2: Point-to-Point and Combined Operations (coming soon)
Part 3: How GPUs Interact (coming soon)
Part 4: Gradient Collection & Distributed Data Parallelism (DDP) (coming soon)
Part 5: ZeRO (coming soon)
Part 6: Tensor Parallelism (coming soon)

Introduction

This guide explains the basic concepts of how the CPU and discrete graphics card (GPU) work together. It is a high-level presentation designed to help you build a mental model of the host device paradigm. We will focus mainly on NVIDIA GPUs, which are the most widely used for offloading AI work.

For integrated GPUs, such as those found in Apple Silicon chips, the architecture is slightly different, and will not be covered in this post.

The Big Picture: Host and Device

The most important concept to grasp is the relationship between The host as well as Device.

Presenter: This is yours CPU. It runs the app and executes your Python script line by line. Somninikhaya is the commander; it handles all the logic and tells the Device what to do.
Device: This is yours The GPU. It is a powerful but specialized coprocessor designed for highly parallel computing. The device is an accelerator; it does nothing until the Opponent gives it a job.

Your program always starts on the CPU. When you want the GPU to perform a task, such as multiplying two large matrices, the CPU sends instructions and data to the GPU.

CPU-GPU interaction

The host communicates with the device through the queuing system.

CPU Start Commands: Your script, running on the CPU, encounters a line of code intended for the GPU (eg tensor.to('cuda')).
The commands are on the line: The CPU is not waiting. It simply places this command in a special GPU to-do list called a CUDA streaming – more on this in the next section.
Disagreement: The CPU does not wait for the actual work to be completed by the GPU, the host moves on to the next line of your script. This is called asynchronous executionand is the key to achieving high performance. While the GPU is busy crunching numbers, the CPU can work on other tasks, such as processing the next set of data.

CUDA streaming

A CUDA streaming is an ordered queue of GPU tasks. Tasks submitted to single stream ordering so thatone after another. Anyway, working on them all different streams can do once and for all – the GPU can handle multiple independent tasks at the same time.

By default, all PyTorch GPU functions are listed current active stream (usually an automatic stream that is automatically created). This is simple and predictable: every task waits for the previous one to finish before starting. In most code, you don't see this. But it leaves work on the table if you have that job he can it jumped.

Multiple Streams: Concurrency

A classic use case for multicast is cross counting and data transfer. While GPU is processing batch N, you can simultaneously copy batch N+1 from CPU RAM to GPU VRAM:

Stream 0 (compute): [process batch 0]────[process batch 1]───
Stream 1 (data):   ────[copy batch 1]────[copy batch 2]───

This pipeline is possible because computation and data transfer occur on separate hardware units within the GPU, allowing for true parallelism. In PyTorch, you create streams and organize work on them with content managers:

compute_stream = torch.cuda.Stream()
transfer_stream = torch.cuda.Stream()

with torch.cuda.stream(transfer_stream):
    # Enqueue the transfer on transfer_stream
    next_batch = next_batch_cpu.to('cuda', non_blocking=True)

with torch.cuda.stream(compute_stream):
    # This runs concurrently with the transfer above
    output = model(current_batch)

Note the non_blocking=True the flag is on .to(). Without it, the transfer will still block the CPU thread even if you intend it to run parallely.

Synchronization Between Streams

Since streams are independent, you need to clearly signal if one is dependent on another. A blunt instrument is:

torch.cuda.synchronize()  # waits for ALL streams on the device to finish

An additional surgical technique is used CUDA events. An event marks a certain point in the stream, and other streams can wait on it without stopping the CPU thread:

event = torch.cuda.Event()

with torch.cuda.stream(transfer_stream):
    next_batch = next_batch_cpu.to('cuda', non_blocking=True)
    event.record()  # mark: transfer is done

with torch.cuda.stream(compute_stream):
    compute_stream.wait_event(event)  # don't start until transfer completes
    output = model(next_batch)

This is more efficient than stream.synchronize() because it only stops the dependent stream on the GPU side – the CPU thread is always free to continue the work.

With daily PyTorch training code you won't need to manage streams manually. But the features like DataLoader(pin_memory=True) and the download first depends a lot on this device under the hood. Understanding streaming helps you see why those settings exist and gives you the tools to diagnose subtle performance issues when they arise.

PyTorch Tensors

PyTorch is a powerful framework that abstracts a lot of information, but this abstraction can sometimes hide what's going on under the hood.

When you create a PyTorch tensor, it has two parts: metadata (such as its shape and data type) and the actual numeric data. So if you run something like this t = torch.randn(100, 100, device=device)the tensor's metadata is stored in the host's RAM, while its data is stored in the GPU's VRAM.

This distinction is important. If you run print(t.shape)the CPU can access this information quickly because the metadata is already in its RAM. But what happens when you run print


Host Device Synchronization
Accessing GPU data from the CPU can cause a Host Device Synchronizationa common working bottle. This happens whenever the CPU needs an output from the GPU that is not yet available in the CPU's RAM.
For example, consider a line print(gpu_tensor) which prints the tensor that was compiled by the GPU. The CPU cannot print the tensor values until the GPU has completed all the calculations to get the final result. When the script reaches this line, the CPU is forced to blockwhich means it stops and waits for the GPU to finish. Only after the GPU finishes its work and copies data from its VRAM to the CPU's RAM can the CPU continue.
As another example, what is the difference between torch.randn(100, 100).to(device) again torch.randn(100, 100, device=device)? The first method is inefficient because it creates data on the CPU and transfers it to the GPU. The second method is more efficient because it creates the tensor directly on the GPU; the CPU only sends the create command.
These synchronization points can significantly affect performance. Effective GPU scheduling involves limiting them to ensure that both the Host and the Device stay as busy as possible. After all, you want your GPUs to go brrrr.
Image by author: generated by ChatGPT
Scaling Up: Distributed Computing and Standards
Training large models, such as Large-scale Language Models (LLMs), often requires more computing power than a single GPU can provide. Linking work to multiple GPUs brings you into the world of distributed computing.
In this context, a new and important concept emerges: i Level.

Each one position a CPU process assigned to a single device (GPU) and a unique ID. If you run the training script on all two GPUs, you will create two processes: one with rank=0 with another one rank=1.

This means that you present two different instances of your Python script. In a single machine with multiple GPUs (one node), these processes run on the same CPU but remain independent, without sharing memory or state. Rank 0 commands its assigned GPU (cuda:0), while Rank 1 commands another GPU (cuda:1). Although both ranks use the same code, you can use a different rank ID to assign different tasks to each GPU, such as making each one process a different part of the data (we'll see examples of this in the next blog post of this series).
The conclusion
Congratulations on reading to the end! In this post, learn about:

Host/Device relationship
Asynchronous execution
CUDA Streams and how they enable simultaneous GPU work
Host-Device Synchronization

In the next blog post, we'll dive deeper into Point-to-Point and Collective Operations, which enable multiple GPUs to coordinate complex workflows like distributed neural network training.


Source link



				
			
				
						
							
						
					nimda
						
							
							Send an email
						
					February 12, 20260 18  5 minutes read 


		
			
				
				
					 Facebook
				
				
					 X
				
				
					 LinkedIn
				
				
					 Tumblr
				
				
					 Pinterest
				
				
					 Reddit
				
				
					 VKontakte
				
				
					 Odnoklassniki
				
				
					 Pocket
							

		


					
		

		
		

		
			
										
							
							 Share
						
						
				
					 Facebook
				
				
					 X
				
				
					 LinkedIn
				
				
					 Tumblr
				
				
					 Pinterest
				
				
					 Reddit
				
				
					 VKontakte
				
				
					 Odnoklassniki
				
				
					 Pocket
				
				
					 Share via Email
				
				
					 Print