Using local LLMs to find high-performance algorithms

Ever since I was a child, I have always been interested in drawing. What affected me was not only the act of painting itself, but also the idea that all paintings can be greatly improved. I remember reaching very high levels with my drawing style. However, when I reached the peak of perfection, I tried to see how I could improve the painting even more – alas, with disastrous results.
Since then I always remember the same words: “refine and multiply and you will reach perfection”. At university, my method was to read books many times, expand my knowledge by searching other sources, to find the hidden layers of meaning in each mind. Today, I apply this philosophy to AI/ML and coding.
We know that matrix multiplication (matmul for simplification here), is a key part of any AI process. A while back I made LLM.rust, a Rust mirror of Karpathy's LLM.c. The most difficult point in using Rust has been matrix multiplication. Since we have to perform thousands of iterations to fine-tune a GPT-based model, we need good matmul performance. For this purpose, I had to use the BLAS library, using the program unsafe a strategy to overcome limitations and obstacles. The use of unsafe in Rust is against the philosophy of Rust, that's why I'm always looking for safer ways to develop matmul in this context.
So, inspired by Sam Altman's statement – ”ask GPT how it creates value” – I decided to ask local LLMs to generate, measure, and iterate their algorithms to create the best, native implementation of Rust matmul.
The challenge has other obstacles:
- We need to use our local area. In my case, MacBook Pro, M3, 36GB RAM;
- Overcome token limits;
- Timing and marking the code inside the generation loop itself
I know that achieving BLAS level performance in this way is almost impossible, but I want to highlight how we can use AI for custom needs, even for our “small” laptops, so that we can open ideas and push the boundaries in any field. This post seeks to be an inspiration to employees, as well as people who want to become more familiar with Microsoft Autogen, and local LLM deployments.
All code implementations can be found in this Github repo. This is an ongoing experiment, and many changes/improvements will be made.
A general idea
The whole idea is to have a roundtable of agents. The starting point is the local MrAderMacher Mixtral 8x7B Q4 K_M model. From the model we create 5 entities:
- i
Proposercomes up with a new algorithm like Strassen, to find a better and more efficient way to do matmul; - i
Verifierrevises the structure of matmul with symbolic figures; - i
Codercreates underlying Rust code; - i
Testercreates and stores all information in the vector database; - i
Managerit works silently, controlling the entire workflow.
| Agent | Role work |
| The lifter | Analyzes benchmark sessions, and suggests new tuning parameters and matmul configurations. |
| Confirmation | (Currently disabled in the code). It verifies the mathematical structure of the candidate through symbolic verification. |
| The code | It takes parameters, and runs Rust template code. |
| The inspector | It runs the Rust code, saves the code and calculates the downtime. |
| The manager | Complete control of performance. |
The overall workflow can be programmed with Microsoft Autogen as shown in fig.1.
Prepare input data and vector database
Input data is collected from all academic papers, which focus on matrix multiplication. Many of these papers are referenced, and related to DeepMind's Strassen paper. I want to start easily, so I have collected 50 papers, published from 2020 to 2025, that deal directly with matrix multiplication.
Next, I used it chroma building a vector database. An important factor in generating a new vector database is how PDFs are processed. In this context, I used a semantic episoder. Unlike the split text methods, the semantic chunker uses the actual meaning of the text, to determine where to cut it. The goal is to keep related sentences together into a single chunk, making the final vector database more compact and accurate. This is done using a spatial model BAAI/bge-base-en-v1.5. The Github source below shows the full implementation.
Main code: autogen-core and GGML models
I used Microsoft Autogen, specifically the autogen-core exception (version 0.7.5). In contrast to the high-level discussion, in autogen-core we can have access to the low-level event-driven building blocks, which are needed to build state machine-driven workflows as we need them. In fact, the challenge is maintaining a solid workflow. All acting agents must act in a certain order: Proposer –> Verifier –> Coder –> Tester.
The main part is BaseMatMulAgentinheriting AutoGen's RoutedAgent. This baseline allows us to measure how LLM agents will participate in the conversation, and will behave.
From the code above, we can see the class is designed to participate in an asynchronous group chat, managing chat history, calls to external tools and generating responses through local LLM.
The core part is @message_handlerdecorator that registers the method as listener or subscriber based on the message type. The decorator automatically finds the hint type of the first method argument – for us of course message: GroupChatMessage. It then registers the agent to receive any events of that type sent to the agent topic. I handle_message the async method is responsible for updating the agent's internal memory, without producing a response.
With the listener working, we can focus on the Manager class. I MatMulManager inheritance RoutedAgent and organizes the overall agent flow.
The above code handles all agents. Skip the Verifier part, for now. I Coder publish the final code, and Tester takes care of saving both the code and all the context in the Vector database. This way, we can avoid consuming all the tokens of our local model. In each new run, the model will access the latest algorithms generated from the vector database and suggest a new solution.
A very important caveat, of course autogen-core he can work llama models on macOS, use the following snippet:
#!/bin/bash
CMAKE_ARGS="-DGGML_METAL=on" FORCE_CMAKE=1 pip install --upgrade --verbose --force-reinstall llama-cpp-python --no-cache-dir
Fig.2 summarizes the entire code. We can roughly divide the code into 3 main blocks:
- I
BaseAgentwhich handles messages through LLM agents, which analyze the statistical structure and generate code; - I
MatMulManagerarranges the travel of all agents; autogen_core.SingleThreadedAgentRuntimeit allows us to make the entire workflow a reality.

autogen_core.SingleThreadedAgentRuntime make all this work on our MacBook PRO. [Image created with Nano Banana Pro.]Results and benchmark
All Rust code has been reviewed and reworked manually. Although the workflow is robust, working with LLMs requires a critical eye. Many times the model was combined*it produces code that looks optimized but fails to do the actual matmul work.
The first iteration produces a Strassen-like algorithm (“Run 0” code in fig.3):
The model assumes the best implementation, such as Rust-NEON, so that after 4 iterations it gives the following code (“Run 3” in fig.3):
We can see the use of functions like vaddq_f32CPU specific instructions for ARM processors, from std::arch::aarch64. The model is able to use rayon to split the workflow across multiple CPU cores, and within the same threads it uses NEON intrinsics. The code itself is not completely correct, moreover, I noticed that we are facing an out-of-memory error when dealing with 1024 × 1024 matrices. I had to manually refactor the code to make it work.
This brings us back to our “iterate to perfection” mantra, and we can ask ourselves: 'could a local agent be able to automate Rust code to the point of knowing the complexities of NEON?'. The findings show that yes, even on consumer hardware, this level of improvement is achievable.
Fig.3 shows the final results I got after each iteration.

Benchmark 0 and 2 have some flaws, as it is physically impossible to achieve such results in 1024×1024 matmul on CPU:
- the first code suffers from the diagonal fallacy, so the code includes only the diagonal blocks of the matrix and ignores the rest;
- the second code has a broken buffer, as it repeatedly overwrites the small, hot cache-hot buffer 1028, rather than transferring the full 1 million elements.
However, the code generated two real codes, run 1 and run 3. The first iteration is up to 760 ms, and it forms a real base. It suffers from cache misses and lack of SIMD vectorization. Run 3 records 359 ms, improving the implementation of NEON SIMD and Rayon parallelism.
*: I wrote “model includes” for purposes. From a medical point of view, all LLMs don't lie, but they cover up. Hallucinations are a completely different situation in relation to what LLMs do when they are babbling and producing “wrong” answers.
Conclusions
This experiment started with a question that seemed to be an impossible challenge: “can we use consumer-grade local LLMs to find more efficient Rust algorithms that can compete with using BLAS?”.
We can say yes, or at least we have a valid and solid base, where we can build better code to achieve full BLAS-like code in Rust.
The post showed how to contact Microsoft Autogen, autogen-coreand how to create a round table of agents.
The base model used is from GGUF, and it can run on a MacBook Pro M3, 36GB.
Well, we haven't found (yet) anything better than BLAS with one simple code. However, we have shown that the local workflow, on the MacBook Pro, can achieve what was thought to require a large collection and large models. Finally, the model was able to find a reasonable implementation of Rust-NEON, “Run 3 above”, with a speed of more than 50% in the normal use of Rayon. We have to highlight that the core implementation is created by AI.
The border is open. I hope this blog post can inspire you in trying to see what limitations we can overcome with local LLM deployment.
I am writing this on a personal basis; these opinions are my own.



