Machine Learning

Breaking the Hardware Barrier: Software FP8 for Older GPUs

As deep learning models grow and data sets expand, practitioners face an increasing problem: GPU memory bandwidth. While high-end hardware offers FP8 accuracy to accelerate training and interpretation, most data scientists and ML engineers are working with older GPUs that lack this capability.

This gap in the ecosystem is what inspired me to build it He is a featheris an open source library that uses a software-based approach to bring FP8-like performance improvements to widely available hardware. I created this tool to make effective deep learning more accessible to the wider ML community, and I welcome contributions

Notation & Abbreviations

  • FPX: X-bit floating point number
  • UX: IX-bit is an unsigned number
  • GPU: Image processing unit
  • SRAM: Static RAM (on-chip GPU cache)
  • HBM: High bandwidth memory (GPU VRAM)
  • GEMV: Normalized Matrix-Vector Multiplication

Inspiration

FP8 processing has proven to work well in the Deep Learning community [1]; however, only recent hardware architectures (Ada and Blackwell) support it, limiting its benefits to practitioners and researchers to use it. I personally have `Nvidia RTX 3050 6GB Laptop GPU`, which unfortunately does not support FP8 functionality at the hardware level.

Inspired by software-based solutions such as (accelerated software rendering on computers that do not support native hardware acceleration for gaming), the article proposes an interesting solution that can exploit the power of FP8 data types


Pack FP8 & FP16 in FP32 containers

Inspired by smart performance and packaging methods, the article presents an algorithm that packs two FP16s or four FP8s into one FP32. This allows for double or quadruple memory stacking, gaining in lower memory, while sacrificing a small amount of precision.

One might argue that we are doing unnecessary math, “Pack -> Load -> Unload -> Calculate.” However, consider Deep Learning activities; Oftentimes, these tasks are brain-bound rather than computer-bound. This is the same bottle of algorithms as the FlashAttention address; However, FlashAttention uses tiling to store data in fast SRAM, and feather compresses data to reduce memory traffic.


GPU Memory Hierarchy

GPU Memory Hierarchy & Bandwidth Chart. (Taken from Flash Attention) (Note: Prices given do not represent RTX 3050 cards)

Look at this drawing. SRAM is the fastest-accessible GPU memory area and has the highest bandwidth (except for the register itself), but is limited to only 20MB. HBM can be viewed as the VRAM of the GPU itself, with approx 1/7 of SRAM bandwidth.

GPU cores are fast enough to complete calculations quickly, but they spend most of their time sitting idle, waiting for data to finish loading and writing back. This is what I mean by memory binding: the bottleneck here is not the calculations, but the data transfer between the memory layer and the GPU.


Low Accuracy and Bandwidth Types

Most of the time, the values ​​during calculation are limited to a range around zero due to normality. Engineers have developed lower precision types such as FP8 and FP16, which allow for higher bandwidth. One may be confused as to how reducing precision allows for higher bandwidth. If we take a closer look, we successfully load two values ​​into one field of FP16 type and four values ​​into one field of FP8 type. We trade off accuracy for higher bandwidth to handle memory-bound operations.

Hardware Level Support

Like the AVX-512 instructions, which are only supported on a limited number of hardware platforms, the FP8 and FP16 instructions and registers are also limited by hardware and are only available in the latter. If you're on the RTX-30 or RTX-20 series of GPUs from Nvidia, you won't be able to take advantage of this low-precision FP8 model. This is exactly the problem He is a feather settlement efforts.


Packaging method

By using bitwise operators, one can easily pack the FP16 type into FP32. The algorithm is described below.

Packaging FP16

  • Throw the input of FP32 to FP16; this step can be easily done using numpy's astype work.
  • Throw them in U16 and then U32; this sets the upper 16 bits to 0 and the lower 16 bits of the actual FP16.
  • Shift one of them by 16 using bitwise LSHIFT operator, and combine them both bitwise OR the operator.

Unpacking FP16

  • Extract the lower 16 bits using bitwise AND operator and mask 0xFFFF.
  • Extract the upper 16 bits using RSHIFT working at 16 and doing it gradually AND working with mask 0xFFFF.
  • Broadcast both U16 values ​​back to FP16 and FP32 if needed.

FP8 packaging

FP8 has two widely used formats – E5M2 & E4M3. One cannot use the same algorithm used to pack two FP16s into FP32 because the CPU does not natively support FP8 types, but supports FP16 (partial precision); this is the reason that np.float8 there is no.

Casting FP16 to FP8-E5M2 is straightforward, as seen in the figure, because both have the same number of exponent bits and differ only in their component.

FP8-E5M2 packaging

  • Throw the input of FP32 to FP16; this step can be easily done using numpy's astype function, or get input yourself like FP16.
  • Broadcast to U16, LSHIFT at 8, then RSHIFT by 8 to divide the upper 8 bits
  • Do this for every FP32 or FP16.
  • Now use the LSHIFT operator, subtract them by 0, 8, 16 and 24 units and add them bitwise OR the operator.

Again, the unpacking should be straightforward; it's the exact opposite of packaging.

Packing the FP8-E4M3 is not as simple and straightforward as packing the FP16 or FP8-E5M2, due to the difference in the exponent bits.

Instead of using it from scratch, the library uses the ml_d types library, which already performs broadcast calculations.

I ml_d types the library provides support for commonly used FP8 standards, such as E5M2 and E4M3 casting, for NumPy arrays. Using the same astype function, we can do the casting as we do for FP16 types. The algorithm is exactly the same as how we package FP16, so I skip it here.


Triton GPU Kernels

After packing, we need an algorithm (kernel) to use this type of packed data and perform calculations. Passing a packed data type to a kernel implemented by FP32 or FP64 will result in an undefined calculation because we have already corrupted the FP32 or FP64 being passed. Writing a kernel that takes a packed data type as input in CUDA is not a straightforward task and is prone to errors. That's right there Triton it shines; is a Domain-specific language library that implements a custom intermediate representation of GPU kernels. In layman's terms, it allows one to write GPU scripts in Python itself without the need to write CUDA scripts in C.

Triton characters do what was said before; the algorithm is as follows:

  • Load the full list into memory
  • Remove the memory and upload it to the FP32 for accumulator functions
  • Do the math

The reader should note that when doing computing, upcasting is used to prevent overflow. So, from an accounting point of view, there is no benefit. However, from a bandwidth perspective, we load the memory twice or four times without compromising the bandwidth.

Triton Kernel Implementation (pseudocode)
@triton.jit
def gemv_fp8_kernel(packed_matrix_ptr, packed_vector_ptr, out_ptr): 
    # Get current row to process
    row_id = get_program_id()
    
    # Initialize accumulator for dot product
    accumulator = 0
    
    # Iterate over row in blocks
    for each block in row:
        # Load packed FP32 values (each contains 4 FP8s)
        packed_matrix = load(packed_matrix_ptr)
        packed_vector = load(packed_vector_ptr)
        
        # Unpack the FP32 into 4 FP8 values
        m_a, m_b, m_c, m_d = unpack_fp8(packed_matrix)
        v_a, v_b, v_c, v_d = unpack_fp8(packed_vector)
        
        # Upcast to FP32 and compute partial dot products
        accumulator += (m_a * v_a) + (m_b * v_b) + (m_c * v_c) + (m_d * v_d)
    
    # Store final result
    store(out_ptr, accumulator)

Results

Hardware: NVIDIA GeForce RTX 3050 6GB VRAM

CUDA version: 13.0

Python version: 3.13.9

GEMV Benchmark (M = 16384, N = 16384) (MxN matrix)

Implementation Time (seconds) Hurry up
Pytorch (FP32) 5,635 (Basic)
Feather (FP8-E4M3) 2,703 2.13x
Feather (FP8-E5M2) 1,679 3.3x

A theoretical performance improvement that can be reached by 4x; 3.3x is pretty good in comparison, with the remaining overhead mainly coming from packet/unpack operations and kernel initialization costs.

The E5M2 is faster than the E4M3 due to ease of release, but the E4M3 offers better accuracy. However, it is more complicated to decode (Feather uses a different GPU kernel to decode the E4M3 format).

Flash Attention Benchmark (Sequence length = 8192, Embed size = 512)

Implementation Time (seconds) Hurry up
Pytorch (FP32) 33,290 (Basic)
Feather (FP8-E5M2) 9,887 ~3.3x

Accuracy and Precision

Testing with random matrices (numerical distributions in the range [-3, 3] and normal normal distribution) show that both E4M3 and E5M2 keep numerical results within realistic tolerances for deep learning performance. Accumulation errors remain manageable for typical workload sizes; however, users who require strict numerical accuracy should ensure their exact use.


When should you use a feather?

Feather's use cases are not limited; one can use Feather wherever packing and unpacking FP8 is beneficial, such as

  • Large matrix-vector products, where loading and unloading are constraints.
  • Memory-bound kernels such as attention.
  • Understanding or optimizing for RTX 30 or 20 series.
  • Bulk processing, where overpacking is reduced

When should you not use Feather?

  • You have RTX 40 series or H100 GPUs (native FP8 is faster).
  • Workloads are computer-bound rather than bandwidth- or memory-bound.
  • You need guaranteed accuracy.

Feather Limitations

The feather is currently in the early stages of prototyping with several areas to improve.

  • Limited operational support; currently, He is a feather it only supports dot product, GEMV subroutine and FlashAttention.
  • Ensuring accuracy of complete ML workloads; currently, He is a feather accuracy is only guaranteed for tasks, not for end-to-end ML workloads.
  • Integration is currently limited; He is a feather independent implementation. Integration with PyTorch and autograd support makes it ready for production.

The project is open source; Community donations are welcome! You can try the code just by following the instructions on GitHub.

Image license: All images are created by the author. Sources of adaptation are clearly cited in the respective captions.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button