Numpy API in GPU?

nimda July 23, 2025

0 11 15 minutes read

Is it the future of Python to include numbers?

Toward the end of the past year, Unvidia makes a vital announcement in relation to the future of Python-based computer computer. I will not be surprised if you missed. After all, all other announcements arise from the entire AI company, then now, appear to be Mega – are important.

That declaration introduces cap library, Replacement of the tank of the unpleasant fields built on top of I have done with the story law Outline.

Who is unvidia?

Most people will probably know unriidia from their fastest chips on strong computers and data institutions worldwide. You can also be familiar with Nviditian Crimatic, skin likes, the Jensen Huang, who appears to come from the laity of the all-days S6 of Ai.

What people do not know that NVIDIA is also designed and creates new programs and relevant software. Some of its most important products The construction of the joint compute (Cuda). Cude Is the platform of computer nvidia related to NVIA and the system model. Since launch in 2007, it appeared in a wide variety of drivers, working time, compilers, libraries, debugging tools and photos. The result is a well-organized hardware and software that keeps nvidia GPUS in the center of today's operation and the highest loading of AI.

What is this in a luxury?

Legate Legadile of Open Open Open Open System Allows you to use Python scientific libraries interpret the higher duties of the good work graph and good graph mans in C ++ Regiment Runtime, organizing activities, separating information, and moves tiles between CPUS, GPUS and links to your network.

In short, the legate allows the typical Python databased librarians.

What is a cunumric?

The cunumeric is the replacement of the NUNGY its legal exchanges are performed by the Legate work engine and accelerated in one or more Nvidia (or, if there is no GPU. In fact, it installs it and requires switch one to import and start using it instead of your regular nunpy code. For example …

# old
import numpy as np
...
...

# new
import cupynumeric as np     # everything else stays the same
...
...

… and then run your text in the bug with the legate command.

After the scenes, Cunumric Converts each NUNYPE CALL DO, for example, NP.SIN, NP.SinG.SvD, bad indicator, decrease, etc. Those works will do,

Divorce Your Arrows in the size tiles to suit the GPU memory.
Program Each tile on the most widely available device (GPU or CPU).
Pass It includes communication where work load spends GPUS or many nodes.
Spend Tiles go to NVME / SSD automatically when your dataset comes out of GPU Ram.

Because API of the Cyumric Nunumeric Nunumeric Nunums's Sunglasses about 1-1, scientific or scientific code can measure from the laptop various cluster without re-writing.

Benefits Benefits

Therefore, all of this seems to be good, so? But it is only reasonable if it results in a visualization of the nump development, and the Nvidia makes other solid claims to say. As a data scientist, mechanical engineers and data engineers often use great profits, we can appreciate that this can be an important factor of programs we write and maintain them.

Now, I do not have a collection of GPU or a supercomputer to check this, but my desktop PC has nvidia GPO RTX 4070 GPU, and we will use what some of the Nvidedian claims.

(base) tom@tpr-desktop:~$ nvidia-smi
Sun Jun 15 15:26:36 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.75                 Driver Version: 566.24         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 Ti     On  |   00000000:01:00.0  On |                  N/A |
| 32%   29C    P8              9W /  285W |    1345MiB /  12282MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I will include a cunumeric and nunpy on my PC to perform comparative tests. This will help us to examine that Nvidia's claims are accurate and understand the difference between two libraries.

Setting up the development environment.

Like always, I like to set a different development place to run my exams. That way, nothing I do in that area will affect any other projects. At the time of writing, Cuumric is not available in Windows installation, so I will use WSL2 for Windows personality instead.

I will be using a minicanda to set up my nature, but feel free to use any free tool.

$ conda create cunumeric-env python=3.10 -c conda-forge
$ conda activate cunumeric-env
$ conda install -c conda-forge -c legate cupynumeric
$ conda install -c conda-forge ucx cuda-cudart cuda-version=12

Example of Code 1 – Simple matrix repetition

Motrix duplication is bread and pet butter for many AI programs, so it is reasonable to try that work first.

Note that in all my examples, I will run the nunum code abbreviations and five times in order and there is an estimate of each time. I also do a “warm action in GPU before time runs to face heads like an In-Time (JIIT).

import time
import gc
import argparse
import sys

def benchmark_numpy(n, runs):
    """Runs the matrix multiplication benchmark using standard NumPy on the CPU."""
    import numpy as np
    
    print(f"--- NumPy (CPU) Benchmark ---")
    print(f"Multiplying two {n}×{n} matrices ({runs} runs)n")

    # 1. Generate data ONCE before the timing loop.
    print(f"Generating two {n}x{n} random matrices on CPU...")
    A = np.random.rand(n, n).astype(np.float32)
    B = np.random.rand(n, n).astype(np.float32)

    # 2. Perform one untimed warm-up run.
    print("Performing warm-up run...")
    _ = np.matmul(A, B)
    print("Warm-up complete.n")

    # 3. Perform the timed runs.
    times = []
    for i in range(runs):
        start = time.time()
        # The operation being timed. The @ operator is a convenient
        # shorthand for np.matmul.
        C = A @ B
        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.4f}s")
        del C # Clean up the result matrix
        gc.collect()

    avg = sum(times) / len(times)
    print(f"nNumPy average: {avg:.4f}sn")
    return avg

def benchmark_cunumeric(n, runs):
    """Runs the matrix multiplication benchmark using cuNumeric on the GPU."""
    import cupynumeric as cn
    import numpy as np # Import numpy for the canonical sync
    
    print(f"--- cuNumeric (GPU) Benchmark ---")
    print(f"Multiplying two {n}×{n} matrices ({runs} runs)n")

    # 1. Generate data ONCE on the GPU before the timing loop.
    print(f"Generating two {n}x{n} random matrices on GPU...")
    A = cn.random.rand(n, n).astype(np.float32)
    B = cn.random.rand(n, n).astype(np.float32)

    # 2. Perform a crucial untimed warm-up run for JIT compilation.
    print("Performing warm-up run...")
    C_warmup = cn.matmul(A, B)
    # The best practice for synchronization: force a copy back to the CPU.
    _ = np.array(C_warmup)
    print("Warm-up complete.n")

    # 3. Perform the timed runs.
    times = []
    for i in range(runs):
        start = time.time()
        
        # Launch the operation on the GPU
        C = A @ B
        
        # Synchronize by converting the result to a host-side NumPy array.
        np.array(C)

        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.4f}s")
        del C
        gc.collect()

    avg = sum(times) / len(times)
    print(f"ncuNumeric average: {avg:.4f}sn")
    return avg

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Benchmark matrix multiplication on NumPy (CPU) vs. cuNumeric (GPU)."
    )
    parser.add_argument(
        "-n", "--n", type=int, default=3000, help="Matrix size (n x n)"
    )
    parser.add_argument(
        "-r", "--runs", type=int, default=5, help="Number of timing runs"
    )
    parser.add_argument(
        "--cunumeric", action="store_true", help="Run the cuNumeric (GPU) version"
    )
    
    args, unknown = parser.parse_known_args()

    # The dispatcher logic
    if args.cunumeric or "--cunumeric" in unknown:
        benchmark_cunumeric(args.n, args.runs)
    else:
        benchmark_numpy(args.n, args.runs)

Running the nungup side of the items regularly Python Exampy1.py a line syntax. By using a legate, syntax is more complex. What the transaction is disabling the default Legate Configuration and launches for example1.Spy under Legate with a single CPU, one GPU, and Zero Openmp Threads using a cupumeric.

Here's the outgoing.

(cunumeric-env) tom@tpr-desktop:~$ python example1.py
--- NumPy (CPU) Benchmark ---
Multiplying two 3000×3000 matrices (5 runs)

Generating two 3000x3000 random matrices on CPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 0.0976s
Run 2: time = 0.0987s
Run 3: time = 0.0957s
Run 4: time = 0.1063s
Run 5: time = 0.0989s

NumPy average: 0.0994s

(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example1.py --cunu
meric
[0 - 7f2e8fcc8480]    0.000000 {5}{module_config}: Module numa can not detect resources.
[0 - 7f2e8fcc8480]    0.000000 {4}{topology}: can't open /sys/devices/system/node/
[0 - 7f2e8fcc8480]    0.000049 {4}{threads}: reservation ('GPU ctxsync 0x55cd5fd34530') cannot be satisfied
--- cuNumeric (GPU) Benchmark ---
Multiplying two 3000×3000 matrices (5 runs)

Generating two 3000x3000 random matrices on GPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 0.0113s
Run 2: time = 0.0089s
Run 3: time = 0.0086s
Run 4: time = 0.0090s
Run 5: time = 0.0087s

cuNumeric average: 0.0093s

Yes, that is a happy start. Cuumeric posts a 10x shower on top of Numpy.

Legate warnings are ignored. This is for details, indicating that the Legate could not find information about the CPU / Memory building (NUMA) or enough CPU to manage GPU.

Example of Code 2 – Reasonable Refund

The logistic order is a basic tool for data science because it provides an easy, modifying method for the binary results (Yes / No, click, Click, click, Click, click, Click, click). In this example, we will be able to measure how long it takes to train the simplassified data for the performance. One fifth flight, first to produce Ni samples with D Features (x), and corresponding label of 0/11 label (Y). Starts the vector of mass w on zeros, and then do 500 The first time ITERATIONS z = x.dot (w)Using Sigmoid P = 1 / (1 + Exp (-z))computer gradient Grad = xtdot (p – y) / nand to reinforce weight with W – = 0.1 * Frequency. The text records the past for each run, cleaning memory, and finally prints the normal training time.

import time
import gc
import argparse
import sys

# --- Reusable Training Function ---
# By putting the training loop in its own function, we avoid code duplication.
# The `np` argument allows us to pass in either the numpy or cupynumeric module.
def train_logistic_regression(np, X, y, iters, alpha):
    """Performs a set number of gradient descent iterations."""
    # Ensure w starts on the correct device (CPU or GPU)
    w = np.zeros(X.shape[1])
    
    for _ in range(iters):
        z = X.dot(w)
        p = 1.0 / (1.0 + np.exp(-z))
        grad = X.T.dot(p - y) / X.shape[0]
        w -= alpha * grad
    
    return w

def benchmark_numpy(n_samples, n_features, iters, alpha):
    """Runs the logistic regression benchmark using standard NumPy on the CPU."""
    import numpy as np
    
    print(f"--- NumPy (CPU) Benchmark ---")
    print(f"Training on {n_samples} samples, {n_features} features for {iters} iterationsn")

    # 1. Generate data ONCE before the timing loop.
    print("Generating random dataset on CPU...")
    X = np.random.rand(n_samples, n_features)
    y = (np.random.rand(n_samples) > 0.5).astype(np.float64)

    # 2. Perform one untimed warm-up run.
    print("Performing warm-up run...")
    _ = train_logistic_regression(np, X, y, iters, alpha)
    print("Warm-up complete.n")

    # 3. Perform the timed runs.
    times = []
    for i in range(args.runs):
        start = time.time()
        # The operation being timed
        _ = train_logistic_regression(np, X, y, iters, alpha)
        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.3f}s")
        gc.collect()

    avg = sum(times) / len(times)
    print(f"nNumPy average: {avg:.3f}sn")
    return avg

def benchmark_cunumeric(n_samples, n_features, iters, alpha):
    """Runs the logistic regression benchmark using cuNumeric on the GPU."""
    import cupynumeric as cn
    import numpy as np # Also import numpy for the canonical synchronization
    
    print(f"--- cuNumeric (GPU) Benchmark ---")
    print(f"Training on {n_samples} samples, {n_features} features for {iters} iterationsn")

    # 1. Generate data ONCE on the GPU before the timing loop.
    print("Generating random dataset on GPU...")
    X = cn.random.rand(n_samples, n_features)
    y = (cn.random.rand(n_samples) > 0.5).astype(np.float64)

    # 2. Perform a crucial untimed warm-up run for JIT compilation.
    print("Performing warm-up run...")
    w_warmup = train_logistic_regression(cn, X, y, iters, alpha)
    # The best practice for synchronization: force a copy back to the CPU.
    _ = np.array(w_warmup)
    print("Warm-up complete.n")

    # 3. Perform the timed runs.
    times = []
    for i in range(args.runs):
        start = time.time()
        
        # Launch the operation on the GPU
        w = train_logistic_regression(cn, X, y, iters, alpha)
        
        # Synchronize by converting the final result back to a NumPy array.
        np.array(w)

        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.3f}s")
        del w
        gc.collect()

    avg = sum(times) / len(times)
    print(f"ncuNumeric average: {avg:.3f}sn")
    return avg

if __name__ == "__main__":
    # A more robust argument parsing setup
    parser = argparse.ArgumentParser(
        description="Benchmark logistic regression on NumPy (CPU) vs. cuNumeric (GPU)."
    )
    # Hyperparameters for the model
    parser.add_argument(
        "-n", "--n_samples", type=int, default=2_000_000, help="Number of data samples"
    )
    parser.add_argument(
        "-d", "--n_features", type=int, default=10, help="Number of features"
    )
    parser.add_argument(
        "-i", "--iters", type=int, default=500, help="Number of gradient descent iterations"
    )
    parser.add_argument(
        "-a", "--alpha", type=float, default=0.1, help="Learning rate"
    )
    # Benchmark control
    parser.add_argument(
        "-r", "--runs", type=int, default=5, help="Number of timing runs"
    )
    parser.add_argument(
        "--cunumeric", action="store_true", help="Run the cuNumeric (GPU) version"
    )
    
    args, unknown = parser.parse_known_args()

    # Dispatcher logic
    if args.cunumeric or "--cunumeric" in unknown:
        benchmark_cunumeric(args.n_samples, args.n_features, args.iters, args.alpha)
    else:
        benchmark_numpy(args.n_samples, args.n_features, args.iters, args.alpha)

And results.

(cunumeric-env) tom@tpr-desktop:~$ python example2.py
--- NumPy (CPU) Benchmark ---
Training on 2000000 samples, 10 features for 500 iterations

Generating random dataset on CPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 12.292s
Run 2: time = 11.830s
Run 3: time = 11.903s
Run 4: time = 12.843s
Run 5: time = 11.964s

NumPy average: 12.166s

(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example2.py --cunu
meric
[0 - 7f04b535c480]    0.000000 {5}{module_config}: Module numa can not detect resources.
[0 - 7f04b535c480]    0.000000 {4}{topology}: can't open /sys/devices/system/node/
[0 - 7f04b535c480]    0.001149 {4}{threads}: reservation ('GPU ctxsync 0x55fb037cf140') cannot be satisfied
--- cuNumeric (GPU) Benchmark ---
Training on 2000000 samples, 10 features for 500 iterations

Generating random dataset on GPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 1.964s
Run 2: time = 1.957s
Run 3: time = 1.968s
Run 4: time = 1.955s
Run 5: time = 1.960s

cuNumeric average: 1.961s

Not impressive as our first example, but 5x and 6x speed with the existing NUMAL system should not be removed from it.

Example of Code 3 – Solving specific estimates

This text is bacon bacon how long it takes to solve a cramped system of crowded 3000 × 3000 algebra equations line. This is the basic operation in the same algebra used to resolve equation of type Ax = b, where Greater Grade of Numbers (3000 × 3000 Matrix in this case), and B The number list (VEECTOR).

The goal is to detect an unknown list of numbers x that enables the equation into true. This is a powerful function of a powerful power in many young scientificists, engineering problems, financial models, and other algoriths ai.

import time
import gc
import argparse
import sys # Import sys to check arguments

# Note: The library imports (numpy and cupynumeric) are now done *inside*
# their respective functions to keep them separate and avoid import errors.

def benchmark_numpy(n, runs):
    """Runs the linear solve benchmark using standard NumPy on the CPU."""
    import numpy as np

    print(f"--- NumPy (CPU) Benchmark ---")
    print(f"Solving {n}×{n} A x = b ({runs} runs)n")

    # 1. Generate data ONCE before the timing loop.
    print("Generating random system on CPU...")
    A = np.random.randn(n, n).astype(np.float32)
    b = np.random.randn(n).astype(np.float32)

    # 2. Perform one untimed warm-up run. This is good practice even for
    # the CPU to ensure caches are warm and any one-time setup is done.
    print("Performing warm-up run...")
    _ = np.linalg.solve(A, b)
    print("Warm-up complete.n")

    # 3. Perform the timed runs.
    times = []
    for i in range(runs):
        start = time.time()
        # The operation being timed
        x = np.linalg.solve(A, b)
        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.6f}s")
        # Clean up the result to be safe with memory
        del x
        gc.collect()

    avg = sum(times) / len(times)
    print(f"nNumPy average: {avg:.6f}sn")
    return avg

def benchmark_cunumeric(n, runs):
    """Runs the linear solve benchmark using cuNumeric on the GPU."""
    import cupynumeric as cn
    import numpy as np # Also import numpy for the canonical synchronization

    print(f"--- cuNumeric (GPU) Benchmark ---")
    print(f"Solving {n}×{n} A x = b ({runs} runs)n")

    # 1. Generate data ONCE on the GPU before the timing loop.
    # This ensures we are not timing the data transfer in our main loop.
    print("Generating random system on GPU...")
    A = cn.random.randn(n, n).astype(np.float32)
    b = cn.random.randn(n).astype(np.float32)

    # 2. Perform a crucial untimed warm-up run. This handles JIT
    # compilation and other one-time GPU setup costs.
    print("Performing warm-up run...")
    x_warmup = cn.linalg.solve(A, b)
    # The best practice for synchronization: force a copy back to the CPU.
    _ = np.array(x_warmup)
    print("Warm-up complete.n")

    # 3. Perform the timed runs.
    times = []
    for i in range(runs):
        start = time.time()

        # Launch the operation on the GPU
        x = cn.linalg.solve(A, b)

        # Synchronize by converting the result to a host-side NumPy array.
        # This is guaranteed to block until the GPU has finished.
        np.array(x)

        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.6f}s")
        # Clean up the GPU array result
        del x
        gc.collect()

    avg = sum(times) / len(times)
    print(f"ncuNumeric average: {avg:.6f}sn")
    return avg

if __name__ == "__main__":
    # A more robust argument parsing setup
    parser = argparse.ArgumentParser(
        description="Benchmark linear solve on NumPy (CPU) vs. cuNumeric (GPU)."
    )
    parser.add_argument(
        "-n", "--n", type=int, default=3000, help="Matrix size (n x n)"
    )
    parser.add_argument(
        "-r", "--runs", type=int, default=5, help="Number of timing runs"
    )

    # Use parse_known_args() to handle potential extra arguments from Legate
    args, unknown = parser.parse_known_args()

    # The dispatcher logic: check if "--cunumeric" is in the command line
    # This is a simple and effective way to switch between modes.
    if "--cunumeric" in sys.argv or "--cunumeric" in unknown:
        benchmark_cunumeric(args.n, args.runs)
    else:
        benchmark_numpy(args.n, args.runs)

Results.

(cunumeric-env) tom@tpr-desktop:~$ python example4.py
--- NumPy (CPU) Benchmark ---
Solving 3000×3000 A x = b (5 runs)

Generating random system on CPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 0.133075s
Run 2: time = 0.126129s
Run 3: time = 0.135849s
Run 4: time = 0.137383s
Run 5: time = 0.138805s

NumPy average: 0.134248s

(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example4.py --cunumeric
[0 - 7f29f42ce480]    0.000000 {5}{module_config}: Module numa can not detect resources.
[0 - 7f29f42ce480]    0.000000 {4}{topology}: can't open /sys/devices/system/node/
[0 - 7f29f42ce480]    0.000053 {4}{threads}: reservation ('GPU ctxsync 0x562e88c28700') cannot be satisfied
--- cuNumeric (GPU) Benchmark ---
Solving 3000×3000 A x = b (5 runs)

Generating random system on GPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 0.009685s
Run 2: time = 0.010043s
Run 3: time = 0.009966s
Run 4: time = 0.009739s
Run 5: time = 0.009383s

cuNumeric average: 0.009763s

That is a major result. NVIA CINUMERIC Run is 100x faster than nunpy run.

Example of Code 4 – Sorting

Sorting is a basic part of your entire program, and modern computers are so fast that many developers do not think about. But let's see how much the difference is using the cunumric can do in this coming work. We will redeem the main list (30,000,000) of 1D

# benchmark_sort.py
import time
import sys
import gc

# Array size
n = 30_000_000 # 30 million elements

def benchmark_numpy():
    import numpy as np
    print(f"Sorting an array of {n} elements with NumPy (5 runs)n")

    times = []
    for i in range(5):
        data = np.random.randn(n).astype(np.float32)
        start = time.time()
        _ = np.sort(data)
        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.6f}s")
        del data
        gc.collect()

    avg = sum(times) / len(times)
    print(f"nNumPy average: {avg:.6f}sn")

def benchmark_cunumeric():
    import cupynumeric as np
    print(f"Sorting an array of {n} elements with cuNumeric (5 runs)n")

    times = []
    for i in range(5):
        data = np.random.randn(n).astype(np.float32)
        start = time.time()
        _ = np.sort(data)
        # Force GPU sync
        _ = np.linalg.norm(np.zeros(()))
        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.6f}s")
        del data
        gc.collect()
        _ = np.linalg.norm(np.zeros(()))

    avg = sum(times) / len(times)
    print(f"ncuNumeric average: {avg:.6f}sn")

if __name__ == "__main__":
    if "--cunumeric" in sys.argv:
        benchmark_cunumeric()
    else:
        benchmark_numpy()

Results.

(cunumeric-env) tom@tpr-desktop:~$ python example5.py
--- NumPy (CPU) Benchmark ---
Sorting an array of 30000000 elements (5 runs)

Creating random array on CPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 0.588777s
Run 2: time = 0.586813s
Run 3: time = 0.586745s
Run 4: time = 0.586525s
Run 5: time = 0.583783s

NumPy average: 0.586529s
-----------------------------

(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example5.py --cunumeric
[0 - 7fd9e4615480]    0.000000 {5}{module_config}: Module numa can not detect resources.
[0 - 7fd9e4615480]    0.000000 {4}{topology}: can't open /sys/devices/system/node/
[0 - 7fd9e4615480]    0.000082 {4}{threads}: reservation ('GPU ctxsync 0x564489232fd0') cannot be satisfied
--- cuNumeric (GPU) Benchmark ---
Sorting an array of 30000000 elements (5 runs)

Creating random array on GPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 0.010857s
Run 2: time = 0.007927s
Run 3: time = 0.007921s
Run 4: time = 0.008240s
Run 5: time = 0.007810s

cuNumeric average: 0.008551s
-------------------------------

Other impressive performance from cunumric and legate.

Summary

This article was introduced capThe NVIA Library was designed for high performance, slope The change of Destruction. Key Takeaway that Data Scientists may accelerate their existing Python code in Nvidia GPUS with small effort, usually by changing one line of import and using text with 'Legate' command.

Two main components strengthen technology:

LETATION: The layer of open runtime from NVIries automatically translates the operation of the highest Python. It is clearly distributing these functions across the GPU or more, to manage data to partition, memory management (even disk-based on the need), and doing communication.
Cuumeric: A Library faced by a user facing the NAMPLE API. When you make a call like NP.Matmul (), the cunumric is turning it into an infected engine work to do on GPU.

I was able to verify NVIria's performance by monitoring my Benchmark PATIES (NVIDIA RTX 4070 TI GPU), a common incident in CPU against CPU.

Results show the benefit of important Cronic performance:

Matrix duplicate: ~ 10x Soon it's benefits.
Restricted training with a REGRISION: ~ 6x Immediately.
To solve straight doses: Magnificent 100x + Hurrying.
Sorting a large array: Other great improvements, which operates nearly 70x Immediately.

In conclusion, the canumeric moves effectively to its promise, make the power to combine large scientific public science without recovering learning curve or a complete code.

For more information and links to related services, check the first Nvidia announcement in Cronic here.

Source link

nimda July 23, 2025

0 11 15 minutes read