Setting Your Own Great Language Model

0 0 9 minutes read

: Frontier AI models are increasingly at risk of being trapped behind strict export controls or mounting API costs.

As these technologies make their way into our daily lives, the open source movement isn't just a philosophical fad, it's a necessary means of keeping AI in the hands of everyday users. We are not yet balanced; Proprietary models from the big tech labs still hold a huge lead in pure performance. But, we can hope that the gap is closing soon. Day and night, the independent community of researchers and engineers is pushing to ensure that this technology is accessible to anyone with a computer.

Today, the basis of true democracy is already in place: you can use a high-performance model entirely on your laptop. For today's research, I set out to find a large language model that would work perfectly on my laptop – and use it for simple tasks that I usually assign to a large lab model.

We're going to install Qwen 3 8B on my MacBook Air, run it completely offline, and finally have a language model that lives on my machine instead of in a remote data center. The Qwen family of models have been developed by Alibaba (a Chinese company) and are completely open source, available online for anyone to download. The model weighs 9 billion and takes about 6gb of your RAM when loaded.
What follows now is a practical, start-to-finish guide to running an LLM for the right environment on an Apple Silicon Mac and includes the final instructions you need. But before we open the terminal, we need to talk about why this is worth doing.

Why Did It Do This?

In most cases, cloud models are better and simpler. I'm not going to run an 8 billion parameter model on a laptop that hits the AI boundary. It doesn't and I will continue to use giant cloud models to do the heavy lifting.

But the ongoing price and sovereignty wars around AI could make open source and local models more relevant in a future where access to technology will make a big difference. Every time you use Claude or ChatGPT, you are sending your data to other remote servers where access can be blocked at any time.

“Digital royalty” is a nice phrase for a very common desire: we might want to own something that reads our sensitive thoughts, in the same way that we have a notebook or keep some money at home.

The spatial model answers that cleanly in the AI world. Once downloaded, nothing leaves the machine. No API keys, no changing terms of service, no silent data retention policies. You can remove the Wi-Fi card and it continues to work. For the most sensitive part of your job, that alone may be worth the price of admission.

People like to say that local models are “democratization” AI. I want that to be true, but we're not there yet. Running this stack still assumes that you own a €1,500 laptop with massive onboard memory and are comfortable with the command line. That's a small, lucky part of the world.

But the route he made a democracy. Two years ago, running a decent offline model required a dedicated workspace and a lot of technical pain. This weekend, it took me a few hours and 5 gigabytes of disk space.

So let's install something.

Machine and Specs

I built this on MacBook Air M4 with 24 GB of integrated memory and 235 GB free storage. This was a new beginning: no Homebrew, no Python environment nightmares.

The really important number here is this 24GB. Apple Silicon's “unified memory” is the magic trick that makes Macs so good at this. Because the CPU and GPU share the exact same memory pool, neural network workloads don't have to be lazily run back and forth.

The 8B model takes up about 5 GB on disk and occupies about 6 GB in memory when loaded. For a 24 GB machine, that's seriously comfortable. You can use the 14B model and keep a bunch of browser tabs open. (If you're on an 8 GB Mac, stick to the 1.5B or 3B models and close your other apps).

Why Ollama?

There are a dozen ways to run a local AI, and most of them require you to care about merge flags and dependency trees. You shouldn't do it.

Ollama is an open source framework and a simple tool. A single binary that includes a highly optimized runner (llama.cpp using Apple's Metal for GPU acceleration), a Docker-style registration model, and a local HTTP API. You install it, pull the model, and talk to it. That's all!

Step 1: Install Ollama (No Homebrew required)

Ollama ships as a standard macOS app in a zip file. The command line interface (CLI) resides discreetly within the application bundle, so we can set it up completely by hand.

# Download the Apple Silicon build
cd ~/Downloads
curl -L -o Ollama-darwin.zip 
# Unzip and move the app into your Applications folder
unzip -o -q Ollama-darwin.zip
mv Ollama.app /Applications/

If you don't know how to open a terminal, just go to your Mac applications and search for “terminal”:

Step 2: Install Ollama IN YOUR HOME

I didn't want to fight him sudo permissions to /usr/local/binso I've linked the compiled CLI to a local directory that I don't own – this is just a handy shortcut to speed up the installation and browsing of LLM.

# Create a local bin directory and symlink the CLI
mkdir -p ~/.local/bin
ln -sf /Applications/Ollama.app/Contents/Resources/ollama ~/.local/bin/ollama

# Make it permanent in your zsh profile
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc
# Apply it to your current shell
export PATH="$HOME/.local/bin:$PATH"
ollama --version

Step 3: Start the Server

Ollama uses a lightweight backend server to expose an API and manage your computer's memory.

# Start the server and log output
mkdir -p ~/.ollama/logs
nohup ollama serve > ~/.ollama/logs/serve.log 2>&1 &

# Ping it to check if it's alive
curl -s

If the above command returns “version”, ollama is set!

Recovery version of Ollama in Mac Terminal

Note: You can also double-click the Ollama app in your Applications folder to run this server through your menu bar. I did it through the terminal so I could see exactly what was going on under the hood.

Step 4: Drag the Model

This is as simple as it gets:

ollama pull qwen3:8b     
ollama list

Go make some coffee. The download is about 5.2 GB.

After using the ollama list, you will see the model available to you:

Step 5: Talk to the new digital brain on your computer

You have three different ways to communicate with your new local model.

1. Interactive Conversation (Very Simple)

ollama run qwen3:8b

Using the following command will start an active dialog:

In automatic mode, the model will spend “thinking tokens”, something that is often considered and hidden in many marketing tools.

I'll start by asking my local model what it has to say about open source models:

Feedback from the spatial model (Thinking Tokens)

The light gray text represents the internal thought process of the model. These models perform extensive calculations before generating an answer, and in local models, this phase of thinking takes up a large portion of the total time until the model outputs the answer.

After doing the thought process, here is the response from the model:

It had many tools, these models also retain some context from previous interactions:

The model is issuing 5.7 tokens per second because I am in battery saving mode. If I refuse, we will probably see a value of 15–20 tokens per second.

2. One-Shot Terminal Commands
To interact with your local model, you can also issue a query outside of interactive mode:

ollama run qwen3:8b "write a python script that tells me how many vowels a word has"

Here's the script generated by our main locale model:

```python
# Prompt the user for a word
word = input("Enter a word: ")

# Define the set of vowels
vowels = {'a', 'e', 'i', 'o', 'u'}

# Initialize a counter
count = 0

# Convert the word to lowercase and check each character
for char in word.lower():
    if char in vowels:
        count += 1

# Output the result
print(f"Number of vowels: {count}")

3. HTTP API (Documents and Applications)

Can you use this only between terminal commands?

Of course not! If you are comfortable with Python, you can create any local script using your local model:

import json, urllib.request

req = urllib.request.Request(
    "
    data=json.dumps({
        "model": "qwen3:8b",
        "prompt": "Give me three uses for a local LLM.",
        "stream": False,
        "think": False,
    }).encode(),
    headers={"Content-Type": "application/json"},
)
print(json.loads(urllib.request.urlopen(req).read())["response"])

Here is the response from the model after running this Python script:

Sure! Here are three common and practical uses for a **local LLM (Large Language Model)**:

1. **Personalized Assistance and Productivity**
A local LLM can act as a private AI assistant, helping with tasks like email drafting, scheduling, note-taking, and even coding. Since it runs locally, it maintains user privacy and doesn't rely on internet connectivity.

2. **Content Creation and Language Processing**
You can use a local LLM to generate creative content such as blog posts, stories, scripts, or marketing copy. It can also assist with language translation, grammar checking, and summarizing text.

3. **Custom Applications and Integration**
A local LLM can be integrated into custom applications or workflows, such as chatbots, customer support systems, or data analysis tools. This allows for tailored solutions without exposing sensitive data to external servers.

Let me know if you'd like examples of how to implement these uses!

Good! Now you can create your applications with your local model easily.

Optimizing Experiences – Controlling “Thinking” Tokens.

Qwen 3 is a hybrid thinking model. By default, it produces verbose ... block expressing its train of thought before giving an actual answer. Sometimes you want to see statistics but most of the time, you just want a quick answer (and cut some time from waiting for output tokens from the thought process).

Here's how you bypass the logic pass:

Completely disable: ollama run qwen3:8b --think=false
Use it, but hide it from the UI: ollama run qwen3:8b --hidethinking
In the text: Go through "think": false in your JSON upload.

Warning About Web Searches

The models are static until their training data. That means they don't have access to the data after training, and companies have been relying on web search tools to increase the power of the models. An example of our local model:

Last day of training data for our Local Model

But, Ollama allows you to give the model a web search tool. This sounds unbelievable but there is a catch.

The search itself uses a cloud service managed by Ollama. When you open it, your commands are sent over the Internet to download search results. The model stays in place, but your questions don't. This may violate the privacy policy you want to ensure by setting up.

Bonus: VS Code Integration

The last resort for me was to find an offline coding assistant. A clean, completely free way of this Continue.dev an extension.

Enter the VS Code and the Continue extension.
Open the Qhubeka configuration file in ~/.continue/config.yaml.
Point to your local Ollama server:

name: Local Assistant
version: 1.0.0
models:
  - name: Qwen3 8B (local)
    provider: ollama
    model: qwen3:8b
    roles:
      - chat
      - edit
      - apply
  - name: Qwen3 8B Autocomplete
    provider: ollama
    model: qwen3:8b
    roles:
      - autocomplete

Pro-tip: The 8B model is more difficult for the split-second delay you want for inline code to complete automatically. I highly recommend pulling a small model directly into that function (ollama pull qwen2.5-coder:1.5b-base), to map it autocomplete role, and letting the Qwen3 8B handle the heavy lifting chat activities.

What if I have a Windows computer?

Since I'm not into windows in this tutorial, I haven't tried it much yet. But the good news is that the Ollama package is available for Windows computers here.

The installation process may be slightly different, but the logic of using Ollama and pulling models will be exactly the same.

Where Does This Leave Me?

My total for this project was 156 MB for the software and 5.2 GB for the model itself.

I now have a highly capable language model that resides permanently on my hard drive. For public, complex work, I have yet to reach for the cloud. But about drafts I don't want to be included in training data, offline flights, and legally bound client documents? This wisdom is now on my computer.

This may be too artistic for many people still, but things are becoming more democratic. And it's not just about availability. On the performance front, open source models are developing at an incredible pace, delivering results that make the future of local AI look incredibly promising. For example, GLM 5.2 and Qwen 3.7 Max reach the performance of large lab models:

Performance Comparison of Models in Software Engineering Benchmark – Image by Author

As the technology floor continues to drop, “owning your own AI” will cease to be a luxury reserved for developers with expensive laptops. That's the AI version of democracy I believe in.

Go give your laptop another brain this weekend and long live open source!

Source link

nimda 1 hour ago

0 0 9 minutes read