ANI

Writing your first gpu éthon kernel and numba and Cuffa

Writing your first gpu éthon kernel and numba and Cuffa
Photo for Author | Ideogram

GPUS is good for jobs where you need to do the same functionality with different data pieces. This is known as One command, many data (simd) the way. Unlike CPUS, only a few fewer ceres, the GPUS with thousands of smallers can run these repeated activities at the same time. You will see this amount in a machine learning, for example when you add or repeat large vectors, because each count is independent. This is a good condition to use GPUS to speed up jobs.

Nvidia was created Cude As a mechanism for engineering systems running in GPU instead of CPU. Based on C and allows you to write special activities called Kernels that can work many functions at the same time. The problem is that writing canter in C or ic ++ is not broadcast. You should deal with things like manual memory sharing, string link, and understanding how the GPU works at low level. This may be especially especially when used to write the Epython code.

That's where Mune can help you. Allows for cuda bundled with Python using LLVM (Low Level Virtual Machine) to integrate infrastructure to integrate your Python code in cudas-compatible. For Easter In-Time combination (JIT), you can add your functions to a decorator, and the numba handles everything else for you.

In this article, we will use the standard example of the Vector, and change the simple CPU code to Cuda Kernel and Numba. Vector additions is a good example of similarities, as an addition to all one indicator is independent of other indices. This is a complete SIMF status and so all indices can be added at the same time to complete the addor of the vector in one function.

Note that you will need a Cuda GPU to follow this article. You can use Colab's Free T4 GPU or Local GPU with NVIDIA Toolkit and NVCC installed.

Obvious Setting up the environment and installing a numba

The numba is available as Python package, and you can enter the pip. In addition, we will use destruction The operation of the vector. Set the Nthon nature using the following commands:

python3 -m venv venv
source venv/bin/activate
pip install numba-cuda numpy

Obvious VPU supplement in CPU

Let's take a simple example of the vector addiction. Two vectors provided, add associated values from each index to determine the final amount. We will use part to produce random veeres

import numpy as np 

N = 10_000_000 # 10 million elements 
a = np.random.rand(N).astype(np.float32) 
b = np.random.rand(N).astype(np.float32) 
c = np.zeros_like(a) # Output array 

def vector_add_cpu(a, b, c): 
    """Add two vectors on CPU""" 
    for i in range(len(a)): 
        c[i] = a[i] + b[i]

Here is the breach of the code:

  • Run two two-digit vaectors in ten mountain numbers
  • We also form an empty vector c To maintain the result
  • This page vector_add_cpu work simply in each indicator and add elements from a including bto keep the effect on the c

This is a Serial performance; Each added event occurs respectively. While this is effective, it is not the most effective way, especially in large datasets. As one adding is the independence of others, this is the perfect person with the killing of GPU.

In the following section, you will see how you can change this functionality to activate GPU using a numba. By distributing logical consolidation of accurate discretion in all the fibers of GPU Threads, we can complete the work as soon as possible.

Obvious Vector supplement in GPU and Numba

Now you will use the Numba to describe the Python function to run in cancion, and do it inside the Python. We do the same vector work to add but can now run like each nung array index, resulting in fast murder.

Here is the Kernel Writing Code:

from numba import config

# Required for newer CUDA versions to enable linking tools. 
# Prevents CUDA toolkit and NVCC version mismatches.
config.CUDA_ENABLE_PYNVJITLINK = 1

from numba import cuda, float32

@cuda.jit
def vector_add_gpu(a, b, c):
	"""Add two vectors using CUDA kernel"""
	# Thread ID in the current block
	tx = cuda.threadIdx.x
	# Block ID in the grid
	bx = cuda.blockIdx.x
	# Block width (number of threads per block)
	bw = cuda.blockDim.x

	# Calculate the unique thread position
	position = tx + bx * bw

	# Make sure we don't go out of bounds
	if position < len(a):
    	    c[position] = a[position] + b[position]

def gpu_add(a, b, c):
	# Define the grid and block dimensions
	threads_per_block = 256
	blocks_per_grid = (N + threads_per_block - 1) // threads_per_block

	# Copy data to the device
	d_a = cuda.to_device(a)
	d_b = cuda.to_device(b)
	d_c = cuda.to_device(c)

	# Launch the kernel
	vector_add_gpu[blocks_per_grid, threads_per_block](d_a, d_b, d_c)

	# Copy the result back to the host
	d_c.copy_to_host(c)

def time_gpu():
	c_gpu = np.zeros_like(a)
	gpu_add(a, b, c_gpu)
	return c_gpu

Let us break what is happening above.

// Understanding the GPU work

This page @cuda.jit Decoration tells Mumba to carry the next job as a Cuda Kernel; Special activity that will run like many straws in GPU. During the initiation, the numba will include this function in the compatible code and manage the C-API anger to you.

@cuda.jit
def vector_add_gpu(a, b, c):
	...

This work will be conducted by thousands of strings at the same time. But we need a way to find out which part of the data should work for each rope. That's what the next few lines:

  • tx The string ID within its block
  • bx Block ID within the grid
  • bw How many stribes on block is on the block

We are combining this Count the different positionSpeaking each cord for which item of Array should add. Note that the fibers and blocks may not always provide a valid indication, as it works in 2 power. This can lead to invalid indices where the vector duration is not compatible with basic construction. Therefore, add the monitoring status to confirm the indicator, before we make the addor added. This prevents any worktime error without receiving array.

Once we know a unique position, we can now add prices as we do with the use of the CPU. The next line will be like the use of CPU:

c[position] = a[position] + b[position]

// To introduce kernel

This page gpu_add The work sets things up:

  • It means how many threads and blocks you can use. You can check the different blocks of block and string size, and print the corresponding amounts in the GPU Kernel. This can help you understand that it is less than GPU indexity less than GPU.
  • Copying the Arris of Input (a, bbesides c) From the CPU memory in the GPU memory, then vaectors are available on GPU Ram.
  • Works the GPU Kernel with vector_add_gpu[blocks_per_grid, threads_per_block].
  • Finally, copy the result back in GPU to c Array, so we can access the price in the CPU.

Obvious Comparing the implementation and speed of potential

Now that we have both the CPU and the PUctor and to see, time to see how they compare. It is important to ensure the results and the promotion of the murder we can receive through a Cuda Paloralism.

import timeit

c_cpu = time_cpu()
c_gpu = time_gpu()

print("Results match:", np.allclose(c_cpu, c_gpu))

cpu_time = timeit.timeit("time_cpu()", globals=globals(), number=3) / 3
print(f"CPU implementation: {cpu_time:.6f} seconds")

gpu_time = timeit.timeit("time_gpu()", globals=globals(), number=3) / 3
print(f"GPU implementation: {gpu_time:.6f} seconds")

speedup = cpu_time / gpu_time
print(f"GPU speedup: {speedup:.2f}x")

First, we run both and examine whether their results are results. This is important to make sure that our GPU code is effective and issued must be the same with CPU.

Next, we use the Python built timeit Module to measure how long each version is taking. We work for each work a few times and take the rate of a fulry time. Finally, we count how fast the GPU version is likened to CPU. You should see a big difference because GPU can perform many tasks at the same time, while CPU helps one at the same time in loop.

Here is the expected result in Nvidi T4 GPU on Colob. Note that specific speed can vary according to Cuda versions and low hardware.

Results match: True
CPU implementation: 4.033822 seconds
GPU implementation: 0.047736 seconds
GPU speedup: 84.50x

This simple test helps to demonstrate GPU energy and why it is very useful for jobs that involve a large amount of data and same work.

Obvious Rolling up

And that's it. You have now recorded your Cuda Kernel of Cuda and Numba, without writing any C or Cuda code. The numba allows the simple interface to use GPU with Python, and make it very easy for the Python Engineer to start with the Code program.

You can now use the same template to write the advanced algorithms of Cuda, ordinary to the learning of the machine and deep reading. If you find the problem following SIMF paradigm, it is always a good idea to use GPU to improve the killing.

The full code is available on Colab Notebook you can access here. Feel free to test and make simple changes to find better understanding of the indicator and killings.

Kanal Mehreen Are the engineering engineer and a technological author interested in the biggest interest of data science and a medication of Ai and medication. Authorized EBOOK “that added a product with chatGPT”. As a Google scene 2022 in the Apac, it is a sign of diversity and the beauty of education. He was recognized as a Teradata variation in a Tech scholar, Mitacs Globalk scholar research, and the Harvard of Code Scholar. Kanalal is a zealous attorney for a change, who removes Femcodes to equip women to women.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button