Metric Customer Cytroc in Pyt TTroch: To avoid Torrmetric Maps

The metric collection is an integral part of all the machine learning projects, which enables us to track the performance of model and monitor training progress. In appropriate, the metric should be collected and integrated without introducing any additional process in the training process. However, like other sections of the training LOOP, the metric integration of the metric unnecessary, increasing training and training expenses.
This post is seven in our series in performance functionality and well performing in Pytroch. The series aims to emphasize the important role of performance and efficiency in the development of a machine reading. Each post is focused on the various stages of training pipeline, indicating tools and strategies to analyze and enhance the use of resources and performance.
In this convention, we focus on the printing of the metric. We will show that the use of the metric collection can affect us mistreat the effectiveness of time and examine the tools and strategies for analysis and performing well.
To use our metric collection, we will use Torchmetrics Library for famous books designed to simplify and measure the metric integration of the metric. Our goals will be:
- Show up caused by the implementation of naïve of metric starting collection.
- Use Pytorch Profile Identifying operating bottles presented by the metric complication.
- Show ways to make good use reducing the metric collection over.
To facilitate our conversation, we will explain the model of the Toy Pytorch and check how the metric collection can affect its functional time. We will use our tests in Nvidia A40 GPU, with Pytorch 2.5.1 Dockeer Image and TorkMetrics 1.6.1.
It is important to note that the metric collection method may vary significantly by being hardware, the nature of the time of running, and the formation of the model. Code snippets given in this sentence intended for display purposes only. Please do not translate our mention of any instrument or process as permitting its use.
Tool Resnet model
In the block Code below we describe a simple picture of photo separation by reset-18 backbone.
import time
import torch
import torchvision
device = "cuda"
model = torchvision.models.resnet18().to(device)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters())
We describe the generating dataset we will use to train our toy model.
from torch.utils.data import Dataset, DataLoader
# A dataset with random images and labels
class FakeDataset(Dataset):
def __len__(self):
return 100000000
def __getitem__(self, index):
rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
label = torch.tensor(data=index % 1000, dtype=torch.int64)
return rand_image, label
train_set = FakeDataset()
batch_size = 128
num_workers = 12
train_loader = DataLoader(
dataset=train_set,
batch_size=batch_size,
num_workers=num_workers,
pin_memory=True
)
It describes a set of normal metrics from Torchmetrics, and the control flag so that they can enable the metric calculation.
from torchmetrics import (
MeanMetric,
Accuracy,
Precision,
Recall,
F1Score,
)
# toggle to enable/disable metric collection
capture_metrics = False
if capture_metrics:
metrics = {
"avg_loss": MeanMetric(),
"accuracy": Accuracy(task="multiclass", num_classes=1000),
"precision": Precision(task="multiclass", num_classes=1000),
"recall": Recall(task="multiclass", num_classes=1000),
"f1_score": F1Score(task="multiclass", num_classes=1000),
}
# Move all metrics to the device
metrics = {name: metric.to(device) for name, metric in metrics.items()}
The following, which defines the example of theyterch proofler, and a regulatory flag that allows us to enable or disable the installation. For a detailed lesson in using PyTTTRON PRIRILIAN, please refer to the first post office in this series.
from torch import profiler
# toggle to enable/disable profiling
enable_profiler = True
if enable_profiler:
prof = profiler.profile(
schedule=profiler.schedule(wait=10, warmup=2, active=3, repeat=1),
on_trace_ready=profiler.tensorboard_trace_handler("./logs/"),
profile_memory=True,
with_stack=True
)
prof.start()
Finally, it describes the common training step:
model.train()
t0 = time.perf_counter()
total_time = 0
count = 0
for idx, (data, target) in enumerate(train_loader):
data = data.to(device, non_blocking=True)
target = target.to(device, non_blocking=True)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
if capture_metrics:
# update metrics
metrics["avg_loss"].update(loss)
for name, metric in metrics.items():
if name != "avg_loss":
metric.update(output, target)
if (idx + 1) % 100 == 0:
# compute metrics
metric_results = {
name: metric.compute().item()
for name, metric in metrics.items()
}
# print metrics
print(f"Step {idx + 1}: {metric_results}")
# reset metrics
for metric in metrics.values():
metric.reset()
elif (idx + 1) % 100 == 0:
# print last loss value
print(f"Step {idx + 1}: Loss = {loss.item():.4f}")
batch_time = time.perf_counter() - t0
t0 = time.perf_counter()
if idx > 10: # skip first steps
total_time += batch_time
count += 1
if enable_profiler:
prof.step()
if idx > 200:
break
if enable_profiler:
prof.stop()
avg_time = total_time/count
print(f'Average step time: {avg_time}')
print(f'Throughput: {batch_size/avg_time:.2f} images/sec')
Metric collection over top
To measure the impact of the metric collection during the step training, we conducted our metric training text. The results are summarized in the next table.
Our Naïve Metric collection results in about 10% of the Runtime performance !! While the metric collection is important in the development of a machine reading, it often involves simple mathematical operations and cannot arouse such high potential. What happened? !!
Diagnosis of Apps With Pyterch Profile
To better understand the source of dementia, we make the pytrod Prilile training document. The following tracking is displayed below:

Trace reflects the repeated “Cudastreamdromese” jobs that are matching a significant droplets in the use of GPU. These types of “CPU-GPU sync events are discussed in part of our part. In the normal training category, the CPU and GPU are acting like: CPU manages jobs such as data transfers to GPU and kernel to load, and GPU removes model in installing data and refresh the installation data. In appropriate, we would like to reduce the sync points between CPU and GPU to expand the operation. Here, however, we see that the metric collection produces a synchronization event by making a CPU in the GPU data. This requires CPU to stop its processing until the GPU is holding, and, which makes the GPU waiting and waiting for CPU and starting loading the next kernel function. The lower line is that these sync points lead to poor CPU and GPU implementation. Collection of metric collection of metric collection adds eight similar incidences to adapt each training action.
The closest examination of the trail shows that the sync events appear in renewing Torrtmetric device. With an informed technician, this may not be enough to identify the cause of the cause, but we will go by step and use Torch.profir.record.Record_function.
Profile with record_cturation
Identify the exact source of synchronization event, we increase the booking section and increase the update blocks using this method of this method allows us to install each working profiles within the method and identify operating bottles.
class ProfileMeanMetric(MeanMetric):
def update(self, value, weight = 1.0):
# broadcast weight to value shape
with profiler.record_function("process value"):
if not isinstance(value, torch.Tensor):
value = torch.as_tensor(value, dtype=self.dtype,
device=self.device)
with profiler.record_function("process weight"):
if weight is not None and not isinstance(weight, torch.Tensor):
weight = torch.as_tensor(weight, dtype=self.dtype,
device=self.device)
with profiler.record_function("broadcast weight"):
weight = torch.broadcast_to(weight, value.shape)
with profiler.record_function("cast_and_nan_check"):
value, weight = self._cast_and_nan_check_input(value, weight)
if value.numel() == 0:
return
with profiler.record_function("update value"):
self.mean_value += (value * weight).sum()
with profiler.record_function("update weight"):
self.weight += weight.sum()
We then update our AVG_loss metric to use the previously created Prilemerring and restart the training text.

The updated track reveals that the sync event comes from the following line:
weight = torch.as_tensor(weight, dtype=self.dtype, device=self.device)
This work changes the default scalar value weight=1.0
in pytro tensor and puts it on GPU. The synchronization event occurs because this action is to cause a copy of the CPU-to-GPU data, requiring CPU to wait for GPU to process the copied number.
Effective 1: Specify the amount of weight
Now that we have found a source of the problem, we can easily overcome it by describing a weight Our number revise Call. This prevents running time from change automated scalar weight=1.0
to Tensor in GPU, to avoid synchronization event:
# update metrics
if capture_metric:
metrics["avg_loss"].update(loss, weight=torch.ones_like(loss))
Transferring script after using this change shows us to succeed in completing the first synchronization event … only to reveal the new one, this time from work _cast_and_in_nput:

Profile with a_a-record – part 2
To check our new synchronization event, we import our metrics with additional profile problems and restart our text.
class ProfileMeanMetric(MeanMetric):
def update(self, value, weight = 1.0):
# broadcast weight to value shape
with profiler.record_function("process value"):
if not isinstance(value, torch.Tensor):
value = torch.as_tensor(value, dtype=self.dtype,
device=self.device)
with profiler.record_function("process weight"):
if weight is not None and not isinstance(weight, torch.Tensor):
weight = torch.as_tensor(weight, dtype=self.dtype,
device=self.device)
with profiler.record_function("broadcast weight"):
weight = torch.broadcast_to(weight, value.shape)
with profiler.record_function("cast_and_nan_check"):
value, weight = self._cast_and_nan_check_input(value, weight)
if value.numel() == 0:
return
with profiler.record_function("update value"):
self.mean_value += (value * weight).sum()
with profiler.record_function("update weight"):
self.weight += weight.sum()
def _cast_and_nan_check_input(self, x, weight = None):
"""Convert input ``x`` to a tensor and check for Nans."""
with profiler.record_function("process x"):
if not isinstance(x, torch.Tensor):
x = torch.as_tensor(x, dtype=self.dtype,
device=self.device)
with profiler.record_function("process weight"):
if weight is not None and not isinstance(weight, torch.Tensor):
weight = torch.as_tensor(weight, dtype=self.dtype,
device=self.device)
nans = torch.isnan(x)
if weight is not None:
nans_weight = torch.isnan(weight)
else:
nans_weight = torch.zeros_like(nans).bool()
weight = torch.ones_like(x)
with profiler.record_function("any nans"):
anynans = nans.any() or nans_weight.any()
with profiler.record_function("process nans"):
if anynans:
if self.nan_strategy == "error":
raise RuntimeError("Encountered `nan` values in tensor")
if self.nan_strategy in ("ignore", "warn"):
if self.nan_strategy == "warn":
print("Encountered `nan` values in tensor."
" Will be removed.")
x = x[~(nans | nans_weight)]
weight = weight[~(nans | nans_weight)]
else:
if not isinstance(self.nan_strategy, float):
raise ValueError(f"`nan_strategy` shall be float"
f" but you pass {self.nan_strategy}")
x[nans | nans_weight] = self.nan_strategy
weight[nans | nans_weight] = self.nan_strategy
with profiler.record_function("return value"):
retval = x.to(self.dtype), weight.to(self.dtype)
return retval
The result trail is held below:

Trace point directly to the wrong line:
anynans = nans.any() or nans_weight.any()
This work is checking NaN
Prices in installing servants, but introduces the most expensive CPU-GPU sync event because performance includes copying data from GPU to CPU.
When tested near the Torchmetric Sections of Agregator, we find several options to manage NAN value updates, everything exceeds the code of the code. However, with our case of use – calculating the missing metric therapy – this test is not required and does not excuse the operation penalties.
Working Well 2: Disable Nan value checks
To end over, we suggest to disable NaN
Value checks for excess _cast_and_nan_check_input
work. Instead of high writing, we use a dynamic solution that can easily be used in any offspring of the basegregator.
from torchmetrics.aggregation import BaseAggregator
def suppress_nan_check(MetricClass):
assert issubclass(MetricClass, BaseAggregator), MetricClass
class DisableNanCheck(MetricClass):
def _cast_and_nan_check_input(self, x, weight=None):
if not isinstance(x, torch.Tensor):
x = torch.as_tensor(x, dtype=self.dtype,
device=self.device)
if weight is not None and not isinstance(weight, torch.Tensor):
weight = torch.as_tensor(weight, dtype=self.dtype,
device=self.device)
if weight is None:
weight = torch.ones_like(x)
return x.to(self.dtype), weight.to(self.dtype)
return DisableNanCheck
NoNanMeanMetric = suppress_nan_check(MeanMetric)
metrics["avg_loss"] = NoNanMeanMetric().to(device)
Send outcomes for work properly: Success
After using two formation – specifies the amount of weight and disabled NaN
Checks – We get working time for work and the use of GPU to match those basic assessments. In addition, the Pytorch leader of the festorch shows that all events “of CudastreamSynsynnorize” “is associated with the metric collection, completed. For small changes, we reduce the cost of training in ~ 10% without the behavior of the metric collection.
In the following section we will examine additional metric operations collection.
Example 2: Adjust the Metric app setting
In the previous paragraph, the metric prices live in GPU, making it logical to keep and mix metrics in GPU. However, in cases where the values we wish to consolidate live in CPU, it may be unable to keep metrics in CPU unnecessarily.
In the Block Code below, we convert our script to calculate the average step while using the CPU. This change does not contribute to the efficiency of our training movement:
avg_time = NoNanMeanMetric()
t0 = time.perf_counter()
for idx, (data, target) in enumerate(train_loader):
# move data to device
data = data.to(device, non_blocking=True)
target = target.to(device, non_blocking=True)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
if capture_metrics:
metrics["avg_loss"].update(loss)
for name, metric in metrics.items():
if name != "avg_loss":
metric.update(output, target)
if (idx + 1) % 100 == 0:
# compute metrics
metric_results = {
name: metric.compute().item()
for name, metric in metrics.items()
}
# print metrics
print(f"Step {idx + 1}: {metric_results}")
# reset metrics
for metric in metrics.values():
metric.reset()
elif (idx + 1) % 100 == 0:
# print last loss value
print(f"Step {idx + 1}: Loss = {loss.item():.4f}")
batch_time = time.perf_counter() - t0
t0 = time.perf_counter()
if idx > 10: # skip first steps
avg_time.update(batch_time)
if enable_profiler:
prof.step()
if idx > 200:
break
if enable_profiler:
prof.stop()
avg_time = avg_time.compute().item()
print(f'Average step time: {avg_time}')
print(f'Throughput: {batch_size/avg_time:.2f} images/sec')
The problem arises when we try to extend our script to support the distribution training. To show a problem, we changed our description of the Distribassporta DistribarAllarAllelairel (DDP):
# toggle to enable/disable ddp
use_ddp = True
if use_ddp:
import os
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "29500"
dist.init_process_group("nccl", rank=0, world_size=1)
torch.cuda.set_device(0)
model = DDP(torchvision.models.resnet18().to(device))
else:
model = torchvision.models.resnet18().to(device)
# insert training loop
# append to end of the script:
if use_ddp:
# destroy the process group
dist.destroy_process_group()
The DDP conversion results in the following error:
RuntimeError: No backend type associated with device type cpu
Automatically, metric metrics distributed are organized to synchronize at all used devices. However, reversing the synchronization of the DDP does not sponsor the metrics stored on the CPU.
Another way to solve this is to disable the metric sync out:
avg_time = NoNanMeanMetric(sync_on_compute=False)
In our case, where we measure the central time, this solution is acceptable. However, in some cases, the metric sync is important, and may not do anything without moving metric to GPU:
avg_time = NoNanMeanMetric().to(device)
Unfortunately, this situation provides a new CPU-GPU synchronization event from the renewal work.

This synchronization event should not come as surprise – after all, we review the Metric gupo with a CPU value, which should require a copy of memory. However, in the case of the scalar metric, this data transfer can be completely restricted.
Effective 3: Make metric updates on the pillars instead of the scalars
Solution is correct: Instead of renewing the metric renewal amount, we change the tensor before driving update
.
batch_time = torch.as_tensor(batch_time)
avg_time.update(batch_time, torch.ones_like(batch_time))
This little change has passed the problem code line, completing the synchronization event, and restores the application action time.
When you first look, this effect is likely to appear surprised: We expect to renew the MTH metric with CPU TENSOR you must need a copy of memory. However, Pytorch makes functions in scalar wishes through the dedicated kernel that makes it add without clear data transfer. This avoids an expensive synchronization event.
Summary
In this sentence, we checked that Naïve's approach to tortdomrics can introduce the CPU-GPU synchronization events and decrease the Pyterch training. We use PyTTRONCE Proferler, pointing to the code line lines for syncing events and used intended performance to complete:
- Clearly describe the weight of the metal when driving
MeanMetric.update
work instead of dependency on the default value. - Disable NAN checks on the foundation
Aggregator
Class or replace the efficient way. - Carefully treat each metric service placement to reduce unnecessary transfers.
- Disable CROSS-Dice sync if not required.
- When the metric remains on GPU, modify floating spaces into enemies before submitting them
update
Work to avoid complete sync.
We have created a deduction request to the Torchmetric Ghutub on the additional concerns. Please feel free to prove your progress and doing well!