GPUS system structures in AI: Cuda, Rocm, Triton, Tensurt Advertising Methods

Introduction to the deepest hinges learning to how successful Compuler Tic Tic Tensor programs in GPU programs: Thread / Block Motion (eg a Cuda, Trion, TRITO, from the joint of the joint and the efficiency moves the needle in operation.
What really decides to work in a modern GPU
Across the vendors, the same levers are repeated:
- Organizing and Compilation: Reduce kernel to skip and trip to HBM; Embarrassed by long producer → SERVICES / BUDGETS. Tensert and Cudnn “Runtime Fusion Fusion Engines” Speaking This In order to pay attention and solid blocks.
- Order and planning of data: Qondanisa amajamo we-tile ku-tensor core / wgmma / WMMA / WMMA PRENCRCRCRCRCRCRECTCRECRCECRCECRCRECTCRECTRECRECROCLOGE; Avoid memory bank conflicts and tentmakers. Cutlass documents in Warp-Level Gemm Tilight for both Tensor Cores and Cuda Cores.
- Accuracy and quantity: FP16 / BF16 / FP8 of training / conviction; Int8 / int4 (Calibrated or Qat) of inclination. Tensortt changes the measurement of the Kernel selection under these types.
- GRAPH & TIME OF ORDERMENT: The murder of the graph to disassemble the quietly quietly quiets; Dynamic Fusion of Common Subgraphs (eg attention). Cudnn 9 is added to the graph of support engine support engine.
- Autotuning: To search for tile sizes, vacancies, and the depth of the pipe in the Arch Arch / SKU. Triton and Cutlass disclose clear Autotune hooks; Tensort makes the choice of timetable.
Thus of Lens, here is the way each stone is the above.
Cuda: NVCC / PTXAS, Cudnn, Cutlass, and Cuffa Graphs
The way to combine. The Code Code includes NVCC on ptx, then handled Reduce PTX in SASS (CORCH-COET CODE CODE). Control efficiency requires flags in both categories and services; For the Kernels Key -Xptxas. Engineers often miss that -O3 Only one touches a hold code only.
Kernel generation and libraries.
- Cop Gives the Gemm / Convent Templates, implementing the Warp-Level Tipeline status, and smemter Teradi is for easy access to the cowardic cattle problems, including Hopper's Wgmma the way.
- cudnn 9 Printressing engineers are presented (especially obtaining the attention blocks), a tradition of traditional graphs for those engines, as well as the combined new energy updates – reducing the loading disaphs.
Performance to work.
- Walking from the cudnch ops to the attention of the attention often cut into the kernel and the world traffic; combined with Cuffle graphsIt reduces CPU bottles in the consecutive order.
- In Hopper / Blackwell, Synchronization Tiles to WGmma / Timonor Core Bomets decisive; Cutlass teaching courses like tile that is inappropriate inappropriate tensor-core patrol.
When a Cuda is a perfect tool. You need a great control of the selection of commands, living in, and smoor picturegraphy; Or extend the Kernels more than a library cover while living in Lvidia GPUS.
ROCM: HIP / Clang Toolbar tools, RocBlas / Miopen, and a 6.x series
The way to combine. Rocm uses Clang / Lllvm unification Hip (Cuffa-like) into GCN / RDNAA sa. 6.x series focused on Perfep streaming and framework; Uninstall notes tracking part-tracking-level support and HW / OS support.
Libraries and ears.
- rocblas including Miopen Use the smiling gemm / primitives with an arch-arching selected with the same algorithm selection in the air in CLUBS / CUDNN. The combined changelog highlights the coming option for all these libraries.
- The latest ROCM works for better Tone Enabement in AMD GPUS, enables the Python-Level Kernel Authoring While Reducing LLVM to AMD Backonds.
Performance to work.
- In AMD GPUS, comparing LDS (Memory Shared) Bank and Verized Global loads to Matrix Yamo shapes such as Smem Bank alignment in Nvidia. Integrated Combination in the structures (eg attention) Plus Libula Autibus Erocry / Miopen is usually closed a large gap in handwritten curtains, on the driver. Deletory documents indicate ongoing increased development in 6.0-6.4.4.x.
When the ROCM is the correct tool. You need traditional support and performance in AMD accelerators, with hip compulsory from cuda-style is already available and a clear Loolchain of LLVM.
Triton: DSL and compiler of custom kittnels
The way to combine. Triton is Dython-eombededdedded DSL Llvm; It treats vectorization delivery, recalls to remember, and register allocation while giving clear control of block size and program IDs. Building documents indicate LOVM and customization; Nvidia Developments discuss Titriton's Tuning for new buildings (eg
To work.
- Arrest tile sizes,
num_warpsand categorizing categories; stump magic By the limitations of the boundary without scalar problems; Memory shared The ladder and software Pipeling to disconnect the global loads of compute. - Titroto's design aims to automate Parts of a Cuda-level Error inputs while leaving block-level options in the author; The first declaration shows that the division of concerns.
Performance to work.
- TRITITO CAN NEED WHERE DO NUMBER, KERNEL'S KERNEL without a library (eg Nvida's Librarian, merchants to special construction in Triton Backlend, reduces the categlass-gems finance fines.
When Triton is a perfect tool. You want the Cuda functioning near the custom ops opties without writing the SASS / Wmma, and you inform Python-starting Itemation with autotuning.
Tensort (Netensert-LLM): The Graph-Time Actor Activization for the acquisition
The way to combine. Telancesort long -x or frame of graphs and releases hardware-special mechanism. During construction, it makes layer / tensor fusion, Balanced care (Int8, fp8 / fp16), and Kernel Tactic selection; The best documents describe these construction stages. TENSORT-llM Expands this with LLM-Spet Time to Start.
To work.
- Graph-Level: Continuous wrappings, inclusion of concat-speaks, conc-activation fusion, attention attention.
- Accuracy: The estimation of post-training training (ENTRY / PERCENTLE / MSE) and each amount of Tensor, as well as the smooth-smooth-smooth-smoic functioning.
- Time to Start: Cache-KV cache, flight surveillance, as well as the planning of faster transport / multities (telensert-llim documents).
Performance to work.
- WINs the largest of all appear: at the end-to-end Intn (or FP8 In the Hopper / Blackwell where supported), delete the frame on one engine, as well as the fusion of angry attention. Tenrorst builder produces Per-Arch techniques in the engine of avoiding the ordinary head of the edge during the workforce.
When Tononsort is a perfect tool. The acquisition of production in Nvidia GPUS where you can put the prepared engine first and benefit from the value of the value and fusion of graph.
Useful directory: Choices and stack clay
- Training vs. To submit.
- Testing Options / Test Options → Cuda + Cutlass (Nvidi) or ROCM + RocBlas / Miopen (AMD); Tron for distorted ops.
- The installation of Nvidi production → Tensorort / TensorT-LLM of Global Graphy-Level Sinks.
- Booster construction buildings – native.
- In the Nvidia Hopper / Blackwell, Verify Tire Map WGmma / Wmma sizes; Cutlass cutting materials show that Warp-Level Gemm and organized.
- In AMD, they adapt to the use of LDS and the vector width in CU DATAPHATS; Leverage Rocm 6.x Autotuners and Triton-on-Rocm of special OPS.
- Fuse first, then weigh.
- Kernel / Graph reduces the fullness of memory; Quantity reduces bandwidth and increases number size. Tensert's Builde's Fuusions Plus Int8 / FP8 usually bring many benefits.
- Use graphs in a short order.
- Cuddry graphs are combined with Cudnn to pay attention to the Acortize, Launching the AutoreticPent adoption.
- Handle mixed flags as the first class.
- In Cutter, remember flags beside the app: For example,
-Xptxas -O3,-vand (also-Xptxas -O0when available). Only host-O3is not enough.
- In Cutter, remember flags beside the app: For example,
References:
Michal Sutter is a Master of Science for Science in Data Science from the University of Padova. On the basis of a solid mathematical, machine-study, and data engineering, Excerels in transforming complex information from effective access.



