Reactive Machines

Testing LLMS with MLX and neural accelerators on M5 GPU

Macs with apple silicon are becoming increasingly popular among AI developers and researchers who want to use their own Mac to try out the latest models and techniques. With MLX, users can test and run LLMS successfully on Mac. It allows researchers to test new techniques or optimization techniques, or investigate AI techniques in a private environment, on their own hardware. MLX works with all apple silicon systems, as well as the latest Macos Beta releases1now it helps accelerate neural possibilities in the new M5 chip, introduced in the 14-inch MacBook Pro. Neural accelerators provide dedicated matrix multiplication functions, which are very important for many machine learning machines, and enable faster experiences on apple silicon, as shown in this post.

What is mlx

MLX is an open source attack system that is functional, flexible, and heavily programmed by Apple Silicon. You can use MLX for a variety of applications ranging from quantitative and scientific computing to machine learning. MLX comes with built in support for neural network training, including text and image generation. MLX makes it easy to generate text with or beautiful language models of large models on apple silicon devices.

MLX takes advantage of Apple's silicon memory architecture. Operations in MLX can run on the CPU or GPU without needing to move memory around. The API is intuitive and flexible. MLX also has advanced Neaural Netty Net and Optimizer packages as well as function modification with automatic variation and graph optimization.

Getting started with MLX in Python is as easy as:

pip install mlx

To learn more, check out the documentation. MLX also has many examples to help as an entry point for building and using many standard ML models.

MLX Swift builds the same Core library as the MLX Python Front-End. It also has several examples to help you get started with machine development applications quickly. If you prefer something lower, MLX has easy to use C and C++ APIs that can run on any apple silicon platform.

Running LLMS on Apple Silicon

MLX LM is a package built on top of MLX for creating scripts and types of languages ​​for good language. It allows to use many LLMs found in face kissing. You can install MLX LM with:

pip install mlx-lm

Also, you can start a conversation with your favorite language model by simply dialing mlx_lm.chat in the body.

MLX Subtlety, the inversion method reduces the memory of the model language model by using a lower precision to store the model parameters. Using mlx_lm.convertthe downloaded model from the face sink is taken into account in a few seconds. For example, converting a 7b wonder model to 4-bit takes a few seconds using a simple command.

mlx_lm.convert 
  --hf-path mistralai/Mistral-7B-Instruct-v0.3 
  -q 
  --upload-repo mlx-community/Mistral-7B-Instruct-v0.3-4bit

NOKET operation on M5 with MLX

GPU chips with Aral introduced with the M5 Chip that provide dedicated matrix multiplication functions, which are very important for multi-machine learning. MLX derives the tensor functions (tensorops) and the metal operation framework introduced in metal 4 to support the features of smooth waxing. To demonstrate the performance of the M5 with MLX, we referred a collection of LLMS with different sizes and constructions, running on Macbook Pro with MOM MEMED PRO M4.

We test Qwen 1.7B and 8B, with native BF16 precision, and 4-bit built-in QWEn 8B and Qwen 14b models. In addition, we mark the measurement of two mixtures of experts (MOE): Qwen 30B (3B active parameters, 4-bit mature) and GPT OSS 20B). The test is done with mlx_lm.generateand reported in terms of token generation start time (in seconds), and generation speed (in terms of token / tokens). For all these benchmarks, the fastest size is 4096. The generation speed was tested when it generated 128 additional tokens.

Model performance is reported in terms of Token First (TTFT) on the M4 and M5 MacMacbook Pro, and the corresponding SpeedUp.

Time to first token (TTFT)

Figure 1: TTFT in seconds (smaller is better) for different LLMS run with MLX on M4 and M5 MacBook Pro. SpeedUp prices are listed under each model name.

In LLM compliance, initial token generation is rolled out, and fully leverages neural accelerators. The M5 compresses the time-to-token generation to less than 10 seconds for a dense 14B build, and less than 3 seconds for a 30b Moe, delivering strong performance for this MacBook Pro build.

The following tokens are created based on memory bandwidth, instead of manipulation ability. In the architectures we have tested in this post, the M5 provides a performance performance of 19-27% compared to the M4, due to its large memory footprint (120GB / s for the M4, 153GB / S Regarding the Memory Footprint, the MacBook Pro 24GB can easily hold 8B with BF16 precision or 30b Moe 4-bit increased, keeping the load of installing the source below of 18GB.

TTFT SpeedUp Speed ​​of Generation Memory (GB)
Qwen3-1.7b-MLX-BF16 3.57 1.27 4.40

Qwen3-8B-MLX-BF16

3.62 1.24 17.46

Qwen3-8B-MLX-4BIT

3.97 1.24 5.61

Qwen3-14b-mlx-4bit

4.06 1.19 9.16

GPT-OSS-20B-MXFPP4-Q4

3.33 1.24 12.08

Qwen3-30B-A3B-4BIT

3.52 1.25 17.31

Table 1: Benchmarking schedule obtained by different LLMS with MLX on M5 MacCook Pro (compared to M4) for TTFT and TTFT Generations, with corresponding memory requirements, and corresponding memory requirements. TTFT is subtly bound, while the generation is memory – observed.

The GPU and Aral accelerators that shine with MLX in ML Works include a large multiplication of the matrix, allowing acceleration up to 4x compared to the base M4. As we continue to add features and improve the functionality of MLX, we expect new architectures and models for the ML community to investigate and run on Apple Silicon.

Start with MLX:

[1] MLX works with all apple silicon systems, and can be easily installed with pip install mlx. To take advantage of M5's advanced accelerators, MLX requires MacOS 26.2 or later

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button