Hlangana ne-mKernel: I-Multi-GPU, I-Multi-Node Fused Kernel Library ye-GPU-Driven Communication

Ukuxhumana okuphezulu kwe-GPU kuyibhodlela elilinganisekayo ekukhiqizeni imithwalo yemisebenzi ye-AI. Ngokusho kwedatha ecashunwe yiphrojekthi ye-mKernel, ukuxhumana kungadla U-43.6% wokuphasa okuya phambili kanye no-32% wesikhathi sokuqeqesha sokugcina. Kuwo wonke amamodeli adumile we-Mixture-of-Experts (MoE), ukuxhumana phakathi kwamadivayisi kungalandisa kuze kufike ku-47% wesikhathi esiphelele sokwenza. Abacwaningi abavela kuphrojekthi ye-UCCL ye-UC Berkeley bakhiphe i-mKernel, umtapo wezincwadi we-CUDA ophikelelayo ohlanganisa ukuxhumana kwe-NVLink ye-intra-node, i-inter-node ye-RDMA, futhi ihlanganise i-kernel eyodwa.
Inkinga: Ukuxhumana Okuqhutshwa Ihostela
Imodeli ejwayelekile yokuxhumana kwama-GPU amaningi eshayelwa umsingathi: i-CPU isebenzisa indlela yokulawula futhi ishayele kumtapo wolwazi njenge-NCCL noma i-NVSHMEM. Umtapo wolwazi ukhipha ukusebenza okuhlangene – i-AllReduce, i-AllGather, njll. – kuwo wonke ama-GPU. Ukubala ngekhompyutha nokuxhumana kusebenza emifudlaneni ehlukene ye-CUDA futhi kudlulela emingceleni ye-kernel.
Ithimba labacwaningi lihlonza izinkinga ezimbili ngale ndlela:
(1) Ama-CPU awalinganisi ngekhompyutha ye-GPU. Irekhi ye-GB300 NVL72 ihlanganisa ama-Blackwell Ultra GPU angu-72 nama-CPU angu-36 oGrace, iletha 720 PFLOP/s FP8/FP6, 1.44 EFLOP/s FP4 Tensor Core ukusebenza, kanye no-130 TB/s wawo wonke umkhawulokudonsa we-intra-rack we-NVLink. Ngalezo zivinini, i-microsecond-scale host orchestration overhead — a cudaLaunchKernel shayela, isheke le-CPU-side elithi “konke kubhala kwenziwe”, umcimbi ophakathi kokusakaza – uvela ngqo njenge amabhamuza amapayipi.
(2) Amasistimu aqhutshwa ngusokhaya agqagqene ikhompuyutha nokuxhumana emingceleni yezinhlamvu ezimahhadla. Ukugqagqana okucolisekile kuthayela noma ileveli ye-chunk akunakwenzeka ukusuka ohlangothini lomsingathi.
Okuhlukile yilokhu Ukuxhumana okuqhutshwa yi-GPU: I-GPU ngokwayo icupha ukudluliselwa, nokuxhumana kuhlanganiswe ku-kernel efanayo nekhompuyutha. Imitapo yolwazi eminingi ekhona ye-kernel ehlanganisiwe isebenza ngaphakathi kwendawo eyodwa, noma i-GPU eyodwa. I-mKernel iqondise icala le-multi-node.
Okwenziwa yi-mKernel
I-mKernel iwumtapo wolwazi we izikhwebu ze-CUDA eziqhubekayo. I-kernel ngayinye ihlanganisa ukuxhumana kwe-NVLink ye-intra-node, i-RDMA yama-inter-node, kanye nekhompiyutha eminyene ibe yikheneli eyodwa.
I-Multi-GPU + i-multi-node, ku-kernel eyodwa: Kokubili i-NVLink ye-intra-node kanye ne-inter-node RDMA bukhoma ngaphakathi kwe-kernel ephikelelayo efanayo.
Ukugqagqana kwe-intra-kernel enezinhlamvu ezinhle: Ukwenza ikhompuyutha nokuxhumana kugqagqana ku-tile/chunk granularity, okuhlanganisa kokubili ukuxhumana kwe-GPU kwangaphakathi kwamanodi nokuphakathi kwamanodi.
I-kernel eqhubekayo enobuchwepheshe be-SM: Ama-CTA azabela wona izindima: compute, intra-comm, inter-send, inter-reduce. Inani lama-SM anikezelwe endimeni ngayinye lingafundeka ngomumo ngamunye.
Inethiwekhi eqhutshwa yi-GPU yakhelwe phezu kwayo libibverbs: I-mKernel isebenzisa ukubhala kwe-RDMA eqalwe yi-GPU ngaphandle kokuncika ku-NCCL noma ku-NVSHMEM. I-backend yokuxhumana ibhalwa kusukela ekuqaleni ukuze kwandiswe ukusebenza futhi isekele amadivaysi enethiwekhi ahlukahlukene.
Izinhlamvu Ezinhlanu Ezihlanganisiwe
| I-Kernel | Yini ehlanganisayo | Incazelo |
|---|---|---|
| I-AllGather + GEMM | I-AllGather → I-GEMM | Irenki ngayinye inocezu lwe A. Ngenkathi amazinga eqoqa amashadi ontanga ngaphezu kwe-NVLink/RDMA, i-GEMM yasendaweni idla amathayela ngokushesha nje lapho efika. |
| I-GEMM + AllReduce | I-GEMM → KonkeYehlisa | Ihlanganisa C = A @ B futhi yehlisa okuphumayo okungaphelele kuwo wonke amazinga ekuqalisweni okukodwa. Amathayela okukhiphayo aphushwa esihlahleni sokunciphisa ngokushesha nje lapho ekhiqizwa. |
| I-MoE Dispatch + GEMM | Ukuthumela Konke-kuya-Konke → i-GEMM eqoqwe | Ihambisa amathokheni e-MoE ezinhlwini zabo zochwepheshe (i-intra-node i-NVLink + i-inter-node konke ukuya kubo bonke) futhi isebenzisa i-GEMM eqoqwe ngochwepheshe ngamunye kukhneli efanayo. Amathokheni acutshungulwa ngokushesha nje lapho efika – asikho isilondolozi sesiteji sokuya nokubuya. |
| Ring Qaphela | Ring KV exchange → FlashAttention | Ukunakwa okulandelanayo okuhambisanayo kuwo wonke amazinga. Isinyathelo ngasinye sizungezisa ingxenye ye-KV eringini kuyilapho i-FlashAttention yasendaweni idla ingxenye eyamukelwe ngaphambilini. Bala kanye neringi yokuthumela/recv igijima kanyekanye ngaphakathi kwe-kernel eyodwa eqhubekayo. |
| I-GEMM + ReduceScatter | I-GEMM → ReduceScatter | Ihlanganisa C = A @ B futhi yehlise-isakaza okukhiphayo. Ithayela ngalinye eliphumayo liyancishiswa futhi lidluliselwe kuzinga lalo ngokushesha nje lapho selikhiqizwa. |
Ukusethwa kokuhlola
Ithimba labacwaningi lihlole i-mKernel kumaqoqo amabili we-2-node × 8-H200 ahluke kuphela ngendwangu yawo yama-inter-node:
| Testbed | Amanodi × ama-GPU | I-Intra-node | I-inter-node transport | I-NIC |
|---|---|---|---|---|
| AWS EFA | 2 × 8 H200 | I-NVLink | I-AWS EFA / SRD | 16 × 200 Gb/s EFA inodi ngayinye |
| I-ConnectX-7 | 2 × 8 H200 | I-NVLink | I-InfiniBand | 8 × 400 Gb/s NVIDIA ConnectX-7 inodi ngayinye |
I-mKernel imakwe iqhathaniswa ne-NCCL, i-Triton-distributed, i-Flux, i-Mercury, i-MagiAttention, i-Transformer-Engine, nokunaka i-ring-flash-attention. Ithimba liphawula ukuthi okunye ukulinganisa ngesilinganiso esikhulu kusaqhubeka.
Okungemuva kanye Nezimfuneko
I-mKernel isekela ama-backend amabili enethiwekhi:
| Ingemuva | Imakhro | Ezokuthutha | Lapho igijima khona |
|---|---|---|---|
| CX7 | -DINTERNODE_BACKEND_IBVERBS |
ama-libibverbs RC | I-ConnectX-7 / InfiniBand / RoCE |
| I-EFA | -DINTERNODE_BACKEND_EFA |
ama-libibverbs + efadv (SRD) | I-AWS p5/p5e (H200, EFA) |
Womabili ama-backends abelana nge-API ye-host-side API kanye ne-GPU kernel efanayo. Kuphela ukufakwa kommeleli/seshini okuhlukile (session.h kwe-CX7, session_efa.h ye-EFA). Izimfuneko: I-NVIDIA Hopper GPUs (okuhloswe ngakho ukwakha okuzenzakalelayo sm_90a), CUDA 12.9, Python with PyTorch. I-backend ye-CX7 idinga izihloko zokuthuthukiswa kwama-libibverbs nemitapo yolwazi. I-backend ye-EFA idinga ukufakwa kwe-AWS EFA nge-libfabric, libibverbs, efadv, kanye nezihloko ze-EFA ngaphansi EFA_HOME=/opt/amazon/efa ngephutha.
Isichazi Esibonakalayo sikaMarktechpost
Okuthathwayo Okubalulekile
- I-mKernel ihlanganisa i-NVLink ye-intra-node, i-RDMA yama-internode, futhi ihlanganise ku-kernel eyodwa ye-CUDA eqhubekayo.
- I-overhead yezokuxhumana ifinyelela ku-47% wesikhathi sokwenziwa kumamodeli we-MoE ngedatha yokukhiqiza ecashuniwe.
- Kufakwe izinhlamvu ezinhlanu: AllGather+GEMM, GEMM+AllReduce, MoE Dispatch+GEMM, Ring Attention, kanye ne-GEMM+ReduceScatter.
- I-RDMA eqalwe yi-GPU isetshenziswa ngokuqondile nge
libibverbs– akukho ukuncika kwe-NCCL noma kwe-NVSHMEM. - Okwamanje idinga i-Hopper GPUs (
sm_90a) kanye nokuxhumana kwe-ConnectX-7 noma i-AWS EFA; Ukusekelwa kweBlackwell kumephu yomgwaqo.
Hlola I-Repo futhi Imininingwane Yezobuchwepheshe. Futhi, zizwe ukhululekile ukusilandela Twitter futhi ungakhohlwa ukujoyina wethu 150k+ ML SubReddit futhi Bhalisela ku Iphephandaba lethu. Linda! ukutelegram? manje ungasijoyina kuthelegramu futhi.
Udinga ukusebenzisana nathi ekuthuthukiseni i-GitHub Repo yakho NOMA Ikhasi Lobuso Lokugona NOMA Ukukhishwa Komkhiqizo NOMA I-Webinar njll.? Xhuma nathi



