Machine Learning

Pipeling AI / ML Task Training Full of Cuda Streams

The ninth of our series in operation planning and well-efficiency aimed at emphasizing an important role of operating and efficiency in the development of a machine reading. Throughout a series we review various assessment tools and experts to expand the implementation of the Pytorch models based on AI / ML. Our goal is two:

  1. To emphasize the importance of normal assessment and efficiency of AI / ML activities.
  2. To show access to various variety tools and analytical strategies and performing AI / ML performance. You do not need to be a Cuda specialist to improve your model performance and reduce computer costs.

In this post, we will examine the use of cuda streams, a powerful feature of Cuda Programing Model that provides a complex approach to GPU activities and runs at the same time. Although we usually associate our load of AI / ML training Images Running in GPU, there are some conditions where graph can be tied to two different kingdoms G1 including Kindthere G = g2 * g1. In such cases Cuda streams enable “Pipeling” Graph, ie, plans for our training step to work G1 (in batch input n + 1) alike Kind (Use l th The release of G1). This method is especially useful when:

  • Subgraph does not fully use GPU when working alone, and
  • These two structures are the same computational cost (ie, and do not rule the runtime).

We will check the two common conditions when the “pipes” occurred:

  1. Partial Training or Full:
    It is normal to free the professional model spine (eg. head (eg decoder). Since the ice spine does not rely on gradients from headBoth can be done together.
  2. Loading data that uploaded well to GPU:
    The usual way of looking at the installation sheets of installation (also known as GPU hunger), data data can be submitted to GPU. While the Presprocessing functionality in the model graph is developing performance, it can be obtained by running ahead to different cuda model-thinking model is less than comparison with the computer model.

To facilitate our conversation, we will explain the two Toy training documents and measure the training performance under different circumstances. The test was held on the Amazon EC2 G5.2xlarge Sister (containing NVIDIA A10G GPU and 8 vcpus) using the Pytorch (2.6) Deep Learning My (Depline).

Please note: The Snippets of the Code we share for display purposes only – please do not rely on their diet or good. The impact of using Cuda streams varies depending on the art of model and system configuration. We encourage you to conduct your ability to install and evaluate before combining the circulation (any other method of toolbar to refer to) on your work travel.

Part 1: Encoder-Decoder model

The first-based application case includes a CNN implants model containing a consistent encoder (previously trained) and a trained decoder. In this case, because Encoder instruments are frozen and are not undergoing background, Encoder can be killed independently of decoder training. At this stage, we examine the impact of the Pipelining process that uses the broadcasts of the Cause.

The temptation to train toys

We begin by explaining the Ecoder of the high-quality level of CNN and its corresponding decoder.

undefined

Next, we create a dataset for random animation and separation maps.

from torch.utils.data import DataLoader
from torchvision.datasets.vision import VisionDataset

# A dataset with random images and per-pixel labels
class FakeDataset(VisionDataset):
    def __init__(self):
        super().__init__(root=None)
        self.size = 1000000

    def __getitem__(self, index):
        # create a random image
        img = torch.randint(0, 256, (3, img_size, img_size),
                            dtype=torch.uint8)

        # create a random label map
        target = torch.randint(0, num_classes, (img_size, img_size))

        return img, target

    def __len__(self):
        return self.size

train_set = FakeDataset()

train_loader = DataLoader(
    dataset=train_set,
    batch_size=8,
    num_workers=8
)

Finally, we explain the job of loss, efficiency, and training. Be careful, that we set up encoder instruments and train a decoder only.

import time

device = torch.device("cuda")
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(decoder.parameters())

# Freeze the encoder weights
encoder.requires_grad_(False)
encoder.eval().to(device)

decoder.train().to(device)

warmup = 10
active_batches = 100
total_iters = warmup + active_batches

for idx, data in enumerate(train_loader):
    inputs = data[0].to(device=device, non_blocking=True).float()
    labels = data[1].to(device=device, non_blocking=True)
    optimizer.zero_grad()
    with torch.no_grad():
        features = encoder(inputs)
    output = decoder(features)
    loss = criterion(output, labels)
    loss.backward()
    optimizer.step()

    if idx == warmup:
        # sync the GPU and start the timer
        torch.cuda.synchronize()
        t0 = time.perf_counter()

    if idx == total_iters:
        break

# wait for the GPU to finnish and then stop the timer
torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')

Our Basic Training Script has reached a normal average number of 83 steps per second, with between 85% of GPU.

Treating the murder of model with cuda streams

In the revised version of the training loop shown below, we launch two cuda streams: one of the encoder and one to train decoder. In each contitation, we do two tasks at the same time:

  1. Train decoder using photos and labels from batch Ni.
  2. Make the encoder to the input batch N + 1 Producing its image features.
encoder_stream = torch.cuda.Stream()
decoder_stream = torch.cuda.Stream()

# initialize the features to None
features = None

for idx, data in enumerate(train_loader):
    inputs = data[0].to(device, non_blocking=True).float()
    labels_next = data[1].to(device, non_blocking=True)

    if features is not None:
        with torch.cuda.stream(decoder_stream):
            decoder_stream.wait_stream(encoder_stream)

            optimizer.zero_grad()
            output = decoder(features)
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()

    with torch.cuda.stream(encoder_stream):
        with torch.no_grad():
            features =  encoder(inputs)
        # Record that features was produced on s1_backbone
        features.record_stream(encoder_stream)

    labels = labels_next

    if idx == warmup:
        # sync the GPU and start the timer
        torch.cuda.synchronize()
        t0 = time.perf_counter()
    if idx == total_iters:
        break

# wait for the GPU to finish and then stop the timer
torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')

This convert is revealing an average between 91 steps per second, represents 9.6% Speedup. This is a great development – especially by checking that our foundation has already had higher-quality GPU (85% use).

Pipeling Propectoad Properties

Pipeline performance with Cuda Streams depends largely on the training work policy and the environmental time nature. If the Encoder is greater than decoder (or vice versa), the pipes can provide a small benefit or prevent the operation. On the other hand, when the GPU is used, pipes are tend to renew the greatest benefits.

This is to illustrate this leaning, we make discrimination in different batch size. The results are prevented below:

The impact of the piper with a Cuda broadcast in the Tiction (by the writer)

As batch size increases, the pipe profits diminishes. This is possible because large size batches are naturally leading to the highest use (and the most efficient) of GPU, leaving a small room to improve improvement.

Part 2: Loading to load in GPU

In this stage, we will use the use of cuda streams in data acceleration to add. Post ago (eg here and here), we have learned a problem with a data installation bottles from a different test and promote fewer strategies for them. The general causes of these positions is tired of the CPU service, where the CPU can deal with the computational pipeline requirements. The result is a GPU free – the situation where expensive GPU is unemployed, waiting for data to arrive.

One working solution to load the heavy data in finding a GPU. We will show this approach and take more action by making a Dedicated Code circulation increase, which enables Commortumtion for modeling training.

TYPLIES ASSESSMENT OF FILE PICTURE

We start by explaining the model of the CNN classification model:

import torch
import torch.nn as nn

import torch
import torch.nn as nn

img_size = 256
num_classes = 10
model = nn.Sequential(
    # Start with 256x256 image
    nn.Conv2d(3, 16, kernel_size=1),
    nn.ReLU(inplace=True),
    nn.Conv2d(16, 32, kernel_size=2, stride=2),  # 2x downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(32, 64, kernel_size=2, stride=2),  # 4x downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(64, 128, kernel_size=2, stride=2),  # 8x downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(128, 256, kernel_size=2, stride=2),  # 16x downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(256, 512, kernel_size=2, stride=2),  # 32x downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(512, 1024, kernel_size=2, stride=2),  # 64x downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(1024, 2048, kernel_size=2, stride=2),  # 128X downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(2048, 4096, kernel_size=2, stride=2),  # 256X
    nn.Flatten(),
    nn.Linear(4096, num_classes)
)

Next, create a dataset for the creation of the AGRAMRATION Pipeline intended to make difficult bootleneck of work:

import random
from torch.utils.data import DataLoader
import torchvision.transforms.v2 as T
from torchvision.datasets.vision import VisionDataset
import torchvision.transforms.v2.functional as F
import torchvision.ops as ops

# A dataset with random images and labels
class FakeDataset(VisionDataset):
    def __init__(self, transform = None):
        super().__init__(root=None, transform=transform)
        self.size = 1000000

    def __getitem__(self, index):
        # create a random image
        img = torch.randint(0, 256, (3, img_size, img_size),
                           dtype=torch.uint8)
        # create a random label
        target = torch.randint(0, num_classes, (1, ))

        if self.transform:
            # Apply tranformations
            img = self.transform(img)

        return img, target

    def __len__(self):
        return self.size

augmentations = T.Compose([
    T.ToDtype(torch.float32),
    T.RandomCrop(img_size//2),
    T.Resize(img_size),
    T.RandomRotation(degrees=45.0),
    T.GaussianBlur(kernel_size=7),
    T.Normalize(mean=[0, 0, 0], std=[1, 1, 1])
])

train_set = FakeDataset(transform=augmentations)

train_loader = DataLoader(
    dataset=train_set,
    batch_size=32,
    num_workers=8
)

Finally, we explain the work of loss, efficiency, and training:

import time

device = torch.device("cuda")
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters())

model.train().to(device)

warmup = 10
active_batches = 100
total_iters = warmup + active_batches

for idx, data in enumerate(train_loader):
    inputs = data[0].to(device=device, non_blocking=True)
    labels = data[1].to(device=device, non_blocking=True).squeeze()
    optimizer.zero_grad()
    output = model(inputs)
    loss = criterion(output, labels)
    loss.backward()
    optimizer.step()

    if idx == warmup:
        # sync the GPU and start the timer
        torch.cuda.synchronize()
        t0 = time.perf_counter()

    if idx == total_iters:
        break

# wait for the GPU to finnish and then stop the timer
torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')

Running this base phiantscrine results in the number of 20 measures .41 per second and only a GPU only 42%. Absive of full data draws a CPU leading to the murder of gPU. See our previous post to find more information on finding bottles from data entering data.

Uploading a website data to GPU

To deal with the working box with the data installation pipe, we move an increase in GPU.

The first step is to describe the custom data that changes random conversion and plants with each sample in batch. This is important because the topovision built changes the same conflict throughout batch – a random loss of each sample seen in the CPU.

We use the Batchtrophecrop Convert using ROI_ALign operator.

class BatchRandomCrop(T.Transform):
    def __init__(self, output_size):
        super().__init__()
        self.output_size = output_size

    def transform(self, img: torch.Tensor, params: dict):
        batch_size, _, original_height, original_width = img.shape
        device = img.device
        max_top = original_height - self.output_size
        max_left = original_width - self.output_size

        # Generate random top and left coords for each image in the batch
        random_top = torch.randint(0, max_top + 1, (batch_size,),
                                   device=device, dtype=torch.float32)
        random_left = torch.randint(0, max_left + 1, (batch_size,),
                                    device=device, dtype=torch.float32)

        image_indices = torch.arange(batch_size, device=device,
                                     dtype=torch.float32)

        boxes = torch.stack([
            image_indices,
            random_left,
            random_top,
            random_left + self.output_size,
            random_top + self.output_size
        ], dim=1)

        cropped_batch = ops.roi_align(
            img,
            boxes,
            output_size=self.output_size
        )
        return cropped_batch 

We use the Batchtratomrotate Trastrom by transferring on all photos in batch and using random circulating to each one. Note that this translation is not working; Complete sale implementation will be a lot of efforts will require much effort.

class BatchRandomRotation(T.Transform):
    def __init__(self, degrees):
        super().__init__()
        self .degrees = degrees

    def transform(self, inpt: torch.Tensor, params: dict):
        # split the batch into a list of individual images
        images = list(torch.unbind(inpt, dim=0))

        augmented_images = []
        for img_tensor in images:
            # generate a random angle
            angle = random.uniform(-self.degrees, self.degrees)

            # apply the rotation to the single image
            transformed_img = F.rotate(
                img_tensor,
                angle=angle
            )
            augmented_images.append(transformed_img)

        # stack the transformed images
        return torch.stack(augmented_images, dim=0)

Now we explain Batch_transform Imitation of the above-mentioned Augitiononed Augment Augment Augmentation:

batch_transform = T.Compose([
    T.ToDtype(torch.float32),
    BatchRandomCrop(img_size//2),
    T.Resize(img_size),
    BatchRandomRotation(degrees=45.0),
    T.GaussianBlur(kernel_size=7),
    T.Normalize(mean=[0, 0, 0], std=[1, 1, 1])
]) 

Finally, reset the dataset and update the training loop to use new Batch_transform:

train_set = FakeDataset(transform=None)

train_loader = DataLoader(
    dataset=train_set,
    batch_size=32,
    num_workers=8
)

for idx, data in enumerate(train_loader):
    inputs = data[0].to(device=device, non_blocking=True)
    labels = data[1].to(device=device, non_blocking=True).squeeze()
    
    # apply augmentations
    inputs = batch_transform(inputs)
    
    optimizer.zero_grad()
    output = model(inputs)
    loss = criterion(output, labels)
    loss.backward()
    optimizer.step()

    if idx == warmup:
        torch.cuda.synchronize()
        t0 = time.perf_counter()

    if idx == total_iters:
        break

torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')

This renewed training is developing the passing of steps to 35.22 in steps per second – 72.57% in the foundation.

Pipelining Silisings with Cuda Strinds

Next, increases additional steps and training measures that use two different cuda streams: one to use data converts one to the model. In each loop iteration we do the same two jobs:

  1. We are training model in an unpleasant batch Ni.
  2. Make GPU-based and youggen aggnomations on batch N + 1
transform_stream = torch.cuda.Stream()
model_stream = torch.cuda.Stream()

# initialize the transformed value to None
transformed = None

for idx, data in enumerate(train_loader):
    inputs = data[0]
    labels_next = data[1]

    if transformed is not None:
        with torch.cuda.stream(model_stream):
            labels = labels.to(device, non_blocking=True).squeeze()
            model_stream.wait_stream(transform_stream)
            optimizer.zero_grad()
            output = model(transformed)
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()

    with torch.cuda.stream(transform_stream):
        inputs = inputs.to(device, non_blocking=True)
        transformed = batch_transform(inputs)
        # Record that the tensor was produced on transform_stream
        transformed.record_stream(transform_stream)

    labels = labels_next

    if idx == warmup:
        torch.cuda.synchronize()
        t0 = time.perf_counter()
    if idx == total_iters:
        break

torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')

This further improves the conversion to 38.82 seconds – 10.2% over the sioraled solution, and 90% faster than the original basis

Pipeling Propectoad Properties

As we have seen in Part 1, the Pipin benefit that uses cuda streams vary according to work information. On the table below, we include results with various batch sizes:

The impact of the piper with a Cuda broadcast in the Tiction (by the writer)

As the batch size is increasing, GPU load is successful, to grow to work very much. At the same time, the benefits of reduction in pipe. This may be made to facts batch sizes that increase the GPU performance, reducing the risk of passing.

Summary

When it comes to AI / ML activities, all milliseconds to be counted. In this post we examined the impact of ai / ML Training Step using a CDUM streaming in two common conditions: Part of model and loading data agpu. In both remedies, the PipedDedFormed solution is the Srienededien implementation – although the size of various development is extremely based on the batch value batch size.

As we emphasize throughout the presentation, the expected impact of cuda streams may vary significantly dependent on AI / ML load. For example, in situations where GPU is already used well, it is more than the use of the CUDA. We highly recommend exploring this method in your activities before you receive this method.

We hope that you will find the process described in this useful post. For a more tactic, tactics, and profiles to install the profiles and do well the AI ​​/ ML functionality, check the other post in this series.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button