Machine Learning

Ukuthuthukisa Ukudluliswa Kwedatha Ku-Batched AI/ML Inference Workloads

iwukuthuthukisa Ukudluliswa Kwedatha Kumthwalo Wokusebenza we-AI/ML lapho sibonise khona ukusetshenziswa kwe-NVIDIA Nsight™ Systems (nsys) ekutadisheni nasekuxazululeni ibhodlela elivamile lokulayisha idatha — izehlakalo lapho i-GPU ingenzi lutho ngenkathi ilinde idatha yokufaka evela ku-CPU. Kulokhu okuthunyelwe sigxilisa ukunaka kwethu kudatha ehamba kwelinye icala, kusukela kudivayisi ye-GPU kuya kumsingathi we-CPU. Ngokucacile, sibhekana nemithwalo yokusebenza ye-AI/ML lapho usayizi wokuphumayo obuyiswa yimodeli uphezulu uma kuqhathaniswa. Izibonelo ezijwayelekile zifaka: 1) ukusebenzisa imodeli yesiqephu sesigcawu (ukulebula kwephikiseli ngayinye) kumaqoqo ezithombe ezinokulungiswa okuphezulu kanye 2) nokuthwebula ukushumeka kwesici sobukhulu obuphezulu bokulandelana kokokufaka kusetshenziswa imodeli yesifaki khodi (isb, ukudala isizindalwazi se-vector). Zombili izibonelo zifaka phakathi ukusebenzisa imodeli kunqwaba yokokufaka bese ukopisha i-tensor yokuphumayo ukusuka ku-GPU ukuya ku-CPU ukuze kucutshungulwe okwengeziwe, isitoreji, kanye/noma ukuxhumana kwenethiwekhi.

Amakhophi enkumbulo ye-GPU-kuya-CPU okukhiphayo kwemodeli ngokuvamile athola ukunakwa okuncane kakhulu ezifundweni zokuthuthukisa kunamakhophi e-CPU-to-GPU aphakela imodeli (isb, bheka lapha). Kodwa umthelela wabo ongaba khona ekusebenzeni kahle kwamamodeli kanye nezindleko zokwenza ungaba yingozi ngendlela efanayo. Ngaphezu kwalokho, ngenkathi ukulungiselelwa kokulayishwa kwedatha kwe-CPU-kuya-GPU kubhalwe kahle futhi kulula ukukusebenzisa, ukwenza ikhophi yedatha ibe ngakolunye uhlangothi kudinga umsebenzi wezandla owengeziwe.

Kulokhu okuthunyelwe sizosebenzisa isu elifanayo esilisebenzisile kokuthunyelwe kwethu kwangaphambilini: Sizochaza imodeli yokudlala futhi sisebenzise iphrofayili ye-nsys ukukhomba nokuxazulula izingqinamba zokusebenza. Sizosebenzisa ukuhlolwa kwethu kusibonelo se-Amazon EC2 g6e.2xlarge (ne-NVIDIA L40S GPU) esebenzisa i-AWS Deep Learning (Ubuntu 24.04) AMI nge-PyTorch (2.8), i-nsys-cli profiler (inguqulo 2025.6.1), kanye nelabhulali ye-NVIDIA Tools Extension (NV).

Ukuzihlangula

Ikhodi esizokwabelana ngayo ihloselwe izinjongo zokubonisa; sicela unganciki ekunembeni kwayo noma ekulungeni kwayo. Sicela ungatoliki ukusebenzisa kwethu noma yimuphi umtapo wolwazi, ithuluzi, noma inkundla, njengokugunyaza ukusetshenziswa kwayo. Umthelela wokulungiselelwa esizowuhlanganisa ungahluka kakhulu ngokusekelwe emininingwaneni yemodeli nendawo yesikhathi sokusebenza. Sicela uqiniseke ukuthi uhlola umphumela wazo esimweni sakho sokusebenzisa ngaphambi kokuhlanganisa ukusetshenziswa kwazo.

Sibonga kakhulu ku-Yitzhak Levi kanye no-Gilad Wasserman ngeqhaza labo kulokhu okuthunyelwe.

I-Toy PyTorch Model

Sethula umbhalo we-inference ohlanganisiwe owenza ukuhlukaniswa kwesithombe kudathasethi yokwenziwa kusetshenziswa imodeli ye-DeepLabV3 enomgogodla we-ResNet-50. Imiphumela yemodeli ikopishelwa ku-CPU ukuze icutshungulwe futhi igcinwe. Sigoqa izingxenye ezihlukene zesinyathelo sokukhomba ngezichasiselo ze-nvtx ezinombala:

import time, torch, nvtx
from torch.utils.data import Dataset, DataLoader
from torch.cuda import profiler
from torchvision.models.segmentation import deeplabv3_resnet50

DEVICE = "cuda"
WARMUP_STEPS = 10
PROFILE_STEPS = 3
COOLDOWN_STEPS = 1
TOTAL_STEPS = WARMUP_STEPS + PROFILE_STEPS + COOLDOWN_STEPS
BATCH_SIZE = 64
TOTAL_SAMPLES = TOTAL_STEPS * BATCH_SIZE
IMG_SIZE = 512
N_CLASSES = 21
NUM_WORKERS = 8
ASYNC_DATALOAD = True


# A synthetic Dataset with random images
class FakeDataset(Dataset):

    def __len__(self):
        return TOTAL_SAMPLES

    def __getitem__(self, index):
        img = torch.randn((3, IMG_SIZE, IMG_SIZE))
        return img

# utility class for prefetching data to GPU
class DataPrefetcher:
    def __init__(self, loader):
        self.loader = iter(loader)
        self.stream = torch.cuda.Stream()
        self.next_batch = None
        self.preload()

    def preload(self):
        try:
            data = next(self.loader)
            with torch.cuda.stream(self.stream):
                next_data = data.to(DEVICE, non_blocking=ASYNC_DATALOAD)
            self.next_batch = next_data
        except:
            self.next_batch = None

    def __iter__(self):
        return self

    def __next__(self):
        torch.cuda.current_stream().wait_stream(self.stream)
        data = self.next_batch
        self.preload()
        return data

model = deeplabv3_resnet50(weights_backbone=None).to(DEVICE).eval()

data_loader = DataLoader(
    FakeDataset(),
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    pin_memory=ASYNC_DATALOAD
)

data_iter = DataPrefetcher(data_loader)

def synchronize_all():
    torch.cuda.synchronize() 

def to_cpu(output):
    return output.cpu()

def process_output(batch_id, logits):
    # do some post processing on output
    with open('/dev/null', 'wb') as f:
        f.write(logits.numpy().tobytes())

with torch.inference_mode():
    for i in range(TOTAL_STEPS):
        if i == WARMUP_STEPS:
            synchronize_all()
            start_time = time.perf_counter()
            profiler.start()
        elif i == WARMUP_STEPS + PROFILE_STEPS:
            synchronize_all()
            profiler.stop()
            end_time = time.perf_counter()

        with nvtx.annotate(f"Batch {i}", color="blue"):
            with nvtx.annotate("get batch", color="red"):
                batch = next(data_iter)
            with nvtx.annotate("compute", color="green"):
                output = model(batch)
            with nvtx.annotate("copy to CPU", color="yellow"):
                output_cpu = to_cpu(output['out'])
            with nvtx.annotate("process output", color="cyan"):
                process_output(i, output_cpu)

total_time = end_time - start_time
throughput = PROFILE_STEPS / total_time
print(f"Throughput: {throughput:.2f} steps/sec")

Qaphela ukufakwa kwakho konke ukulungiselelwa kokulayisha idatha kwe-CPU-kuya-GPU okuxoxwe ngakho kokuthunyelwe kwethu kwangaphambilini.

Sisebenzisa umyalo olandelayo ukuze sithwebule umkhondo wephrofayili ye-nsys:

nsys profile 
  --capture-range=cudaProfilerApi 
  --trace=cuda,nvtx,osrt 
  --output=baseline 
  python batch_infer.py

Lokhu kubangela ukuthi a baseline.nsys-rep landelela ifayela esilikopisha emshinini wethu wokuthuthukisa ukuze lihlaziywe.

Ukuze silinganise i-output, sinyusa inombolo yezinyathelo ziye ku-100. Isilinganiso sokuphuma kokuhlolwa kwethu okuyisisekelo yizinyathelo ezingu-0.45 ngesekhondi. Kulezi zigaba ezilandelayo sizosebenzisa ukulandelelwa kwephrofayili ye-nsys ukuthuthukisa ngokuqhubekayo lo mphumela.

Ukuhlaziywa Kokusebenza Okuyisisekelo

Isithombe esingezansi sibonisa umkhondo wephrofayela ye-nsys yokuhlolwa kwethu okuyisisekelo:

I-Baseline Nsight Systems Profiler Trace (ngoMbhali)

Esigabeni se-GPU sibona iphethini elandelayo ephindaphindayo:

  1. Ibhlokhi ye-kernel compute (eluhlaza okwesibhakabhaka ngokukhanyayo) esebenza ama-millisecond angu-520.
  2. Ibhulokhi encane yekhophi yenkumbulo yomsingathi uye kudivayisi (eluhlaza) egijima ngokuhambisana nekhompuyutha ye-kernel. Lokhu kuvumelana kufinyelelwe kusetshenziswa ukulungiselelwa okukhulunywe ngakho kokuthunyelwe kwethu kwangaphambilini.
  3. Ibhulokhi yekhophi yenkumbulo yokusingatha idivayisi (ebomvu) esebenzisa ~750 millisecond.
  4. Isikhathi eside (~940 millisecond) sesikhathi sokungenzi lutho se-GPU (isikhala esimhlophe) phakathi kwazo zonke izinyathelo ezimbili.

Uma sibheka ibha ye-NVTX yesigaba se-CPU, singabona ukuthi indawo emhlophe ihambisana kahle nebhulokhi “yokukhiphayo” (nge-cyan). Ekusebenziseni kwethu kokuqala, kokubili ukwenziwa kwemodeli kanye nomsebenzi wokulondoloza okukhiphayo kusebenza ngenqubo eyodwa efanayo ngendlela elandelanayo. Lokhu kuholela esikhathini esibalulekile sokungenzi lutho ku-GPU njengoba i-CPU ilinda ukuthi umsebenzi wesitoreji ubuye ngaphambi kokuphakela i-GPU inqwaba elandelayo.

Ukuthuthukisa 1: Ukucutshungulwa Kokukhipha Izisebenzi Eziningi

Isinyathelo sokuqala esisithathayo ukwenza umsebenzi wokugcina okukhiphayo ngezinqubo zesisebenzi esihambisanayo. Sithathe isinyathelo esifanayo kokuthunyelwe kwethu kwangaphambilini lapho sihambisa ukulandelana kokulungiselelwa kweqoqo lokufaka kubasebenzi abazinikele. Kodwa-ke, ngenkathi lapho sikwazile ukwenza ngokuzenzakalelayo ukulayisha kwedatha yezinqubo eziningi ngokumane simise i- inani_labasebenzi ukungqubuzana kwekilasi le-DataLoader kunani elingelona uziro, ukusebenzisa ukucubungula okukhiphayo kwezisebenzi eziningi kudinga ukuqaliswa okwenziwa ngesandla. Lapha sikhetha isixazululo esilula ngezinjongo zokubonisa. Lokhu kufanele kwenziwe ngokwezifiso ngokwezidingo zakho kanye nezinketho zokuklama.

I-PyTorch Multiprocessing

Sisebenzisa isu lomkhiqizi-umthengi sisebenzisa iphakheji ye-PyTorch eyakhelwe ngaphakathi yokwenza izinto eziningi, i-torch.multiprocessing. Sichaza ulayini wokugcina amaqoqo okukhiphayo kanye nezisebenzi eziningi zabathengi ezicubungula amaqoqo kulayini. Sishintsha iluphu yethu yokucabanga ukuze sibeke amabhafa okukhiphayo emugqeni wokuphumayo. Siphinde sibuyekeze i vumelanisa_konke() usizo lokudonsa ulayini futhi wengeze ukulandelana kokuhlanza ekugcineni kweskripthi.

Ibhulokhi elandelayo yekhodi iqukethe ukusetshenziswa kwethu kokuqala. Njengoba sizobona ezigabeni ezilandelayo, lokhu kuzodinga ukushuna okuthile ukuze kufinyelele ukusebenza okuphezulu.

import torch.multiprocessing as mp

POSTPROC_WORKERS = 8 # tune for optimal throughput

output_queue = mp.JoinableQueue(maxsize=POSTPROC_WORKERS)

def output_worker(in_q):
    while True:
        item = in_q.get()
        if item is None: break  # signal to shut down
        batch_id, batch_preds = item
        process_output(batch_id, batch_preds)
        in_q.task_done()

processes = []
for _ in range(POSTPROC_WORKERS):
    p = mp.Process(target=output_worker, args=(output_queue,))
    p.start()
    processes.append(p)

def synchronize_all():
    torch.cuda.synchronize() 
    output_queue.join() # drain queue


with torch.inference_mode():
    for i in range(TOTAL_STEPS):
        if i == WARMUP_STEPS:
            synchronize_all()
            start_time = time.perf_counter()
            profiler.start()
        elif i == WARMUP_STEPS + PROFILE_STEPS:
            synchronize_all()
            profiler.stop()
            end_time = time.perf_counter()

        with nvtx.annotate(f"Batch {i}", color="blue"):
            with nvtx.annotate("get batch", color="red"):
                batch = next(data_iter)
            with nvtx.annotate("compute", color="green"):
                output = model(batch)
            with nvtx.annotate("copy to CPU", color="yellow"):
                output_cpu = to_cpu(output['out'])
            with nvtx.annotate("queue output", color="cyan"):
                output_queue.put((i, output_cpu))


total_time = end_time - start_time
throughput = PROFILE_STEPS / total_time
print(f"Throughput: {throughput:.2f} steps/sec")
# cleanup
for _ in range(POSTPROC_WORKERS):
    output_queue.put(None)

Ukuthuthukiswa kokucubungula okuphumayo kwezisebenzi eziningi kuphumela ekuphumeni kwezinyathelo ezingu-0.71 ngesekhondi – ukukhuphuka okungu-58% kunemiphumela yethu yokuqala.

Ukuqalisa kabusha umyalo we-nsys kuphumela ekulandeleni kwephrofayela okulandelayo:

Umugqa Wesikhathi Wephrofayili Yezinhlelo Zokusebenza Eziningi Ze-Nsight (ngoMbhali)

Siyabona ukuthi usayizi webhulokhi yesikhala esimhlophe wehle kakhulu (kusuka ku-~940 millisecond ukuya ku-~50). Uma besingasondeza indawo emhlophe esele, besiyoyithola iqondaniswe nomsebenzi othi “munmap”. Kokuthunyelwe kwethu kwangaphambilini, ukutholwa okufanayo kwazise ukwenziwa ngcono kwekhophi yedatha yethu ye-asynchronous. Kodwa kulokhu sithatha isinyathelo esimaphakathi sokuthuthukisa inkumbulo ngendlela yechibi elabelwe ngaphambili lamabhafa.

Ukuthuthukisa 2: Ukwabiwa kwangaphambili kwe-Buffer Pool

Ukuze kwehliswe i-overhead yokwaba nokuphatha i-CPU tensor entsha kukho konke ukuphindaphinda, siqalisa iqoqo lama-tensor anikezwe ngaphambili kumemori eyabiwe futhi sichaze umugqa wesibili wokuphatha ukusetshenziswa kwawo.

Ikhodi yethu ebuyekeziwe ivela ngezansi:

shape = (BATCH_SIZE, N_CLASSES, IMG_SIZE, IMG_SIZE)
buffer_pool = [torch.empty(shape).share_memory_() 
               for _ in range(POSTPROC_WORKERS)]

buf_queue = mp.Queue()
for i in range(POSTPROC_WORKERS):
    buf_queue.put(i)

def output_worker(buffer_pool, in_q, buf_q):
    while True:
        item = in_q.get()
        if item is None: break  # signal to shut down
        batch_id, buf_id = item
        process_output(batch_id, buffer_pool[buf_id])
        buf_q.put(buf_id)
        in_q.task_done()

processes = []
for _ in range(POSTPROC_WORKERS):
    p = mp.Process(target=output_worker,
                   args=(buffer_pool,output_queue,buf_queue))
    p.start()
    processes.append(p)

def to_cpu(output):
    buf_id = buf_queue.get()
    output_cpu = buffer_pool[buf_id]
    output_cpu.copy_(output)
    return output_cpu, buf_id

with torch.inference_mode():
    for i in range(TOTAL_STEPS):
        if i == WARMUP_STEPS:
            synchronize_all()
            start_time = time.perf_counter()
            profiler.start()
        elif i == WARMUP_STEPS + PROFILE_STEPS:
            synchronize_all()
            profiler.stop()
            end_time = time.perf_counter()

        with nvtx.annotate(f"Batch {i}", color="blue"):
            with nvtx.annotate("get batch", color="red"):
                batch = next(data_iter)
            with nvtx.annotate("compute", color="green"):
                output = model(batch)
            with nvtx.annotate("copy to CPU", color="yellow"):
                output_cpu, buf_id = to_cpu(output['out'])
            with nvtx.annotate("queue output", color="cyan"):
                output_queue.put((i, buf_id))

Ukulandela lezi zinguquko, i-inference throughput yeqela ku-1.51 — ngaphezu kuka-2X ukusheshisa kunomphumela wethu wangaphambilini.

Ukulandelela iphrofayela entsha kuvela ngezansi:

Umugqa wesikhathi we-Buffer Pool Nsight Systems (ngoMbhali)

Akukhona nje kuphela ukuthi indawo emhlophe yonke kodwa yanyamalala, kodwa ukusebenza kwenkumbulo ye-CUDA DtoH (okubomvu) kwehlile kusuka ku-~~750 millisecond kuya ku-~110. Ngokunokwenzeka, ikhophi yedatha ye-GPU-to-CPU enkulu ihilela ingxenye enkulu yokuphatha inkumbulo esiyisusile ngokusebenzisa i-buffer pool ezinikele.

Naphezu kokuthuthuka okukhulu, uma sisondeza sizothola ukuthi kusele cishe ama-millisecond angu-0.5 esikhala esimhlophe esibangelwa ukuvumelanisa komyalo wekhophi we-GPU-to-CPU – inqobo nje uma ikhophi ingakaqedi i-CPU ayibangeli ukubalwa kwe-kernel yeqoqo elilandelayo.

Ukuthuthukisa 3: Ikhophi Yedatha Ye-Asynchronous

Ukulungiselela kwethu okwesithathu ukushintsha ikhophi yedivayisi ukuze ibe usokhaya ukuthi ihambisane. Njengangaphambili, sizothola ukuthi ukusebenzisa lolu shintsho kunzima kakhulu kunendlela ye-CPU-to-GPU.

Isinyathelo sokuqala ukudlula non_blocking=True kumyalo wokukopisha we-GPU-kuya-CPU.

def to_cpu(output):
    buf_id = buf_queue.get()
    output_cpu = buffer_pool[buf_id]
    output_cpu.copy_(output, non_blocking=True)
    return output_cpu, buf_id

Kodwa-ke, njengoba sibonile kokuthunyelwe kwethu kwangaphambilini, lolu shintsho ngeke lube nomthelela ozwakalayo ngaphandle kokuthi siguqule ama-tensor ethu ukuze asebenzise inkumbulo ephiniwe:

shape = (BATCH_SIZE, N_CLASSES, IMG_SIZE, IMG_SIZE)
buffer_pool = [torch.empty(shape, pin_memory=True).share_memory_() 
               for _ in range(POSTPROC_WORKERS)]

Okubi kakhulu, uma sisebenzisa lezi zinguquko ezimbili kuphela kuskripthi sethu, ukuphuma kuzokwanda kodwa okuphumayo kungase konakaliswe (isb, bheka lapha). Sidinga indlela esekelwe kumcimbi yokuhlonza isikhathi ngasinye lapho kuqedwa ikhophi ye-GPU-to-CPU ukuze sikwazi ukuqhubeka nokucubungula idatha yokuphumayo. (Qaphela, ukuthi lokhu bekungadingeki lapho kwenziwa ikhophi ye-CPU-to-GPU ingavumelanisi. Ngenxa yokuthi ukusakaza okukodwa kwe-GPU kucubungula imiyalo ngokulandelana, ukubala kwe-kernel kuqala kuphela lapho ukukopishwa sekuqediwe. Ukuvumelanisa bekudingeka kuphela uma kwethulwa ukusakaza kwesibili.)

Ukuze kusetshenziswe indlela yokwazisa, sichaza inqwaba yemicimbi ye-CUDA kanye nomugqa owengeziwe wokuphatha ukusetshenziswa kwayo. Siphinde sichaze uchungechunge lwabalaleli lokuqapha isimo semicimbi kulayini nokugcwalisa umugqa ophumayo uma amakhophi eseqedile.

import threading, queue

event_pool = [torch.cuda.Event() for _ in range(POSTPROC_WORKERS)]
event_queue = queue.Queue()

def event_monitor(event_pool, event_queue, output_queue):
    while True:
        item = event_queue.get()
        if item is None: break
        batch_id, buf_idx = item
        event_pool[buf_idx].synchronize()
        output_queue.put((batch_id, buf_idx))
        event_queue.task_done()

monitor = threading.Thread(target=event_monitor,
                           args=(event_pool, event_queue, output_queue))
monitor.start()

I-inference ebuyekeziwe ngokulandelana kwayo iqukethe izinyathelo ezilandelayo:

  1. Thola inqwaba yokokufaka eyalandwa ngaphambili ku-GPU.
  2. Sebenzisa imodeli kubheshi yokufaka ukuze uthole i-toput tensor ku-GPU.
  3. Cela ibhafa ye-CPU engenalutho kulayini webhafa futhi uyisebenzisele ukuqalisa ikhophi yedatha engavumelaniyo. Lungiselela umcimbi ozowuqalisa lapho ukukopisha sekuqediwe futhi uphushele umcimbi kumugqa womcimbi.
  4. Uchungechunge lokuqapha lulinda ukuthi umcimbi luqale bese luphusha i-tensor yokuphumayo kumugqa okukhiphayo ukuze kucutshungulwe.
  5. Uchungechunge lwesisebenzi ludonsa i-tensor ephumayo kulayini bese luyigcina kudiski. Ibese idedela isigcinalwazi sibuyele kulayini webhafa.

Ikhodi ebuyekeziwe ivela ngezansi.

def synchronize_all():
    torch.cuda.synchronize()
    event_queue.join()
    output_queue.join()


with torch.inference_mode():
    for i in range(TOTAL_STEPS):
        if i == WARMUP_STEPS:
            synchronize_all()
            start_time = time.perf_counter()
            profiler.start()
        elif i == WARMUP_STEPS + PROFILE_STEPS:
            synchronize_all()
            profiler.stop()
            end_time = time.perf_counter()

        with nvtx.annotate(f"Batch {i}", color="blue"):
            with nvtx.annotate("get batch", color="red"):
                batch = next(data_iter)
            with nvtx.annotate("compute", color="green"):
                output = model(batch)
            with nvtx.annotate("copy to CPU", color="yellow"):
                output_cpu, buf_id = to_cpu(output['out'])
            with nvtx.annotate("queue CUDA event", color="cyan"):
                event_pool[buf_id].record()
                event_queue.put((i, buf_id))

total_time = end_time - start_time
throughput = PROFILE_STEPS / total_time
print(f"Throughput: {throughput:.2f} steps/sec")
# cleanup
event_queue.put(None)
for _ in range(POSTPROC_WORKERS):
    output_queue.put(None)

Umphumela womphumela uyizinyathelo ezingu-1.55 ngomzuzwana.

Ukulandelela iphrofayela entsha kuvela ngezansi:

Umugqa wesikhathi we-Async Data Transfer Nsight Systems (ngoMbhali)

Emgqeni we-NVTX wesigaba se-CPU singabona yonke imisebenzi ku-loop ye-inference ehlanganiswe ndawonye ngakwesobunxele – okusho ukuthi yonke igijima ngokushesha futhi ngokulinganayo. Siphinde sibone amakholi okuvumelanisa umcimbi (okuluhlaza ngokukhanyayo) asebenza kuchungechunge lokuqapha oluzinikele. Esigabeni se-GPU sibona ukuthi i-kernel computation iqala ngokushesha ngemva kokuthi ikhophi yokusingathwa kwedivayisi isiqediwe.

Ukwenza kwethu okuphelele kuzogxila ekuthuthukiseni ukufana kwe-kernel nokusebenza kwenkumbulo ku-GPU.

Ukuthuthukisa 4: Ukufaka Amapayipi Ngokusebenzisa Ukusakaza kwe-CUDA

Njengokuthunyelwe kwethu kwangaphambilini, sifisa ukusizakala ngezinjini ezizimele zokukopisha inkumbulo (i-DMA) kanye ne-kernel compute (ama-SMs). Senza lokhu ngokunikeza ikhophi yenkumbulo emfudlaneni ozinikele we-CUDA:

egress_stream = torch.cuda.Stream()

with torch.inference_mode():
    for i in range(TOTAL_STEPS):
        if i == WARMUP_STEPS:
            synchronize_all()
            start_time = time.perf_counter()
            profiler.start()
        elif i == WARMUP_STEPS + PROFILE_STEPS:
            synchronize_all()
            profiler.stop()
            end_time = time.perf_counter()

        with nvtx.annotate(f"Batch {i}", color="blue"):
            with nvtx.annotate("get batch", color="red"):
                batch = next(data_iter)
            with nvtx.annotate("compute", color="green"):
                output = model(batch)
            
            # on separate stream
            with torch.cuda.stream(egress_stream):
                # wait for default stream to complete compute
                egress_stream.wait_stream(torch.cuda.default_stream())
                with nvtx.annotate("copy to CPU", color="yellow"):
                    output_cpu, buf_id = to_cpu(output['out'])
                with nvtx.annotate("queue CUDA event", color="cyan"):
                    event_pool[buf_id].record(egress_stream)
                    event_queue.put((i, buf_id))

Lokhu kuphumela ekuphumeni kwezinyathelo ezingu-1.85 ngesekhondi — ukuthuthukiswa okwengeziwe okungu-19.3% kunokuhlola kwethu kwangaphambilini.

Ukulandelela iphrofayili kokugcina kuvela ngezansi:

Umugqa Wesikhathi Wephrofayili Yezinhlelo ze-Nsight Efakwe ngamapayipi (ngoMbhali)

Esigabeni se-GPU sibona ibhulokhi eqhubekayo ye-kernel compute (eluhlaza okwesibhakabhaka) kukho kokubili umsingathi ukuya kudivayisi (okuluhlaza okotshani) kanye nedivayisi-to-host (okunsomi) esebenza ngokufana. I-inference loop yethu manje isiboshelwe kukhompuyutha, okusho ukuthi siwasebenzise wonke amathuba angokoqobo okuthuthukisa ukudluliswa kwedatha.

Imiphumela

Sifingqa imiphumela yethu kuthebula elilandelayo:

Imiphumela Yokuhlola (ngoMbhali)

Ngokusebenzisa iphrofayili ye-nsys sikwazile ukukhulisa ukusebenza kahle ngokungaphezu kuka-4X. Ngokwemvelo, umthelela wokulungiselelwa esixoxile ngakho uzohluka ngokuya ngemininingwane yemodeli nendawo yesikhathi sokusebenza.

Isifinyezo

Lokhu kuphetha ingxenye yesibili yochungechunge lwethu lokuthunyelwe esihlokweni sokuthuthukisa ukudluliswa kwedatha kumthwalo wokusebenza we-AI/ML. Ingxenye yokuqala igxile kumakhophi womsingathi uye kudivayisi futhi ingxenye yesibili kumakhophi edivayisi ukuya kumsingathi. Uma kusetshenziswa ngokungazi, ukudluliswa kwedatha kunoma iyiphi indlela kungaholela ekuvinjweni okubalulekile kokusebenza okuholela ekulambeni kwe-GPU kanye nezindleko ezikhulayo zesikhathi sokusebenza. Sisebenzisa iphrofayili ye-Nsight Systems, sibonise indlela yokuhlonza nokuxazulula lezi zinkinga futhi sandise ukusebenza kahle kwesikhathi sokusebenza.

Nakuba ukwenziwa kahle kwazo zombili izikhombisi-ndlela kwakuhilela izinyathelo ezifanayo, imininingwane yokusetshenziswa ibihluke kakhulu. Ngenkathi ukuthuthukiswa kokudluliselwa kwedatha ye-CPU-kuya-GPU kusekelwa kahle ama-API wokulayisha idatha we-PyTorch futhi kudinge izinguquko ezincane ku-loop yokusayinda, ukuthuthukisa isiqondisindlela se-GPU-kuya-CPU kudinga ubunjiniyela besofthiwe obuthe xaxa. Okubalulekile, izixazululo esizibeke kulokhu okuthunyelwe zikhethelwe izinhloso zokubonisa. Isixazululo sakho singase sehluke kakhulu ngokuya ngezidingo zakho zephrojekthi kanye nokuncamelayo kokuklama.

Ngemva kokumboza womabili amakhophi edatha ye-CPU-to-GPU kanye ne-GPU-kuya-CPU, sibhekisa ukunaka kwethu emisebenzini ye-GPU-to-GPU: Hlala ubukele ukuze uthole okuthunyelwe okuzayo esihlokweni sokuthuthukisa ukudluliswa kwedatha phakathi kwama-GPU emisebenzini yokuqeqeshwa esabalalisiwe.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button