← All posts

LLM-models June 2, 2026 · Mario Keller

Your AI Should Run on Your Hardware: The Case for Local LLMs in 2026

I was automating invoice processing. Extracting line items, matching vendors, pushing results into our system. It worked.

Then I stopped and thought about what I was actually doing. I had just sent several hundred invoices vendor names, amounts, internal cost centers to a cloud API. Without thinking twice about it.

That was the start of a longer experiment. Over the past few months, we’ve been testing what it actually looks like to run AI models locally. On your own machine. No cloud, no API calls, no data leaving your environment.

Here’s what we found.

What “Running AI Locally” Actually Means

When you use ChatGPT or Claude, your prompt leaves your machine entirely. It travels to a server somewhere, gets processed, and a response comes back. The model itself is never near you.

A local LLM flips that. The model all its weights, everything that makes it think — lives on your hardware. Inference runs on your CPU or GPU. Nothing goes anywhere.

Until recently this required serious infrastructure and ML expertise. Today, a tool called Ollama makes it straightforward enough that I had a capable model running on my Mac in about ten minutes. No configuration, no driver setup. Download, pull a model, done.

Open Source Models. The Part Most People Miss

Here’s the thing about running AI locally: you can’t run GPT locally. OpenAI doesn’t release the weights. Neither does Anthropic.

What you can run are open source models models where the weights are publicly released and you can download and use them yourself. Meta, Google, Microsoft, Mistral, Alibaba, DeepSeek all releasing capable models you can run on your own hardware.

For many automation tasks the quality is close enough to commercial models that the gap stops mattering. That’s not hype. It’s just where things are in 2026.

Ollama. How You Actually Get This Running

The tool that makes local LLMs accessible is Ollama. It handles model downloads, GPU/CPU acceleration, and exposes a local API. The API is OpenAI-compatible, so most automation tools that talk to OpenAI can talk to Ollama with one config change — you just point them at your local machine.

The Models Worth Knowing About

These are the providers building open source models worth running locally, with three currently available through Ollama each.

Provider	Country	Models
Meta	🇺🇸 USA	Llama 3.1 (8B/70B), Llama 3.2 (1B/3B), Llama 4
DeepSeek	🇨🇳 China	DeepSeek-R1, DeepSeek-V3, DeepSeek-V3.1
Google DeepMind	🇺🇸 USA	Gemma 3, Gemma 4, Gemma 3n (on-device)
Alibaba (Qwen)	🇨🇳 China	Qwen 2.5, Qwen 3.5, Qwen 3-Coder
Mistral AI	🇫🇷 France	Mistral 7B, Mistral Small 3.2, Magistral 24B
Microsoft	🇺🇸 USA	Phi-3, Phi-4, Phi-4 Mini
IBM (Granite)	🇺🇸 USA	Granite 4, Granite 4.1, Granite 3.3
NVIDIA	🇺🇸 USA	Nemotron 3 Super (120B MoE), Nemotron 3, Nemotron Cascade 2
Z.ai / Zhipu (GLM)	🇨🇳 China	GLM-4.7, GLM-5, GLM-5.1
MiniMax	🇨🇳 China	MiniMax-M2.5, MiniMax-M2.7, MiniMax-M3

Full library at ollama.com/library.

Why We’re Actually Doing This

Three things drove us to take this seriously.

Data stays where it belongs. The invoice situation above is a real problem. Contracts, HR documents, customer emails, internal reports, if you’re automating workflows that touch this kind of data, every API call is a decision about who else has access to it. Running locally removes that decision entirely.

The cost math. Cloud APIs charge per token. For a single conversation that’s nothing. For a workflow processing 500 documents a day, it adds up fast. A local model running on hardware you already own costs electricity. That’s it.

No one else’s rate limits or deprecations. We’ve had workflows break because a cloud API changed a response format with no notice. A local model doesn’t change unless you decide to update it.

What Doesn’t Work (Being Honest)

A local 8B model is not GPT-5. For complex reasoning or anything that needs frontier-level capability, the quality gap is real and you’ll feel it.

You need the right hardware. Models need to fit in memory, an 8B model needs around 8GB of RAM or VRAM. A 70B model needs 40GB or more.

You own the maintenance. No automatic updates, no SLAs. If your machine goes down, your workflows go with it. And the hallucination problem is worse with smaller local models, production automation requires solid validation on top of model output, more so than with cloud models.

Worth downloading model weights only from Ollama’s official library or directly from the providers in the table. Unverified sources carry real security risk.

What We Tested and What Happened

Mac M2, 32GB this is where it works

Apple Silicon has a genuine advantage here. The M2’s unified memory is shared between CPU and GPU, which means you can run a 13–14B parameter model without a dedicated graphics card. On 32GB, this is comfortable.

We ran Mistral 7B, Llama 3.1 8B, Phi-4 14B, and Qwen 2.5 7B. Setup took minutes.

Phi-4 at 14B surprised us. Structured data extraction, classification, turning messy inputs into clean outputs it handled this reliably enough for production with validation on top.

Getsper’s workflows point at the local Ollama endpoint the same way they point at a cloud API. No workflow changes needed. It just works.

Raspberry Pi 4 waste of time for this

Tried it. A simple prompt took several minutes. The Pi is great for a lot of things. Running LLM inference is not one of them.

The Mac Mini situation

There’s a growing number of people setting up a Mac Mini M2 or M4 as a dedicated local AI server. After our testing, I completely understand why. Small, silent, low power, capable chip, large unified memory ceiling. You put it on a shelf and it runs your local AI workflows 24/7.

For a small team that wants a self-hosted inference machine without managing cloud infrastructure — it’s a solid choice. You could also run this on a rented server with a proper GPU if you’re running at scale. Same sovereignty, different operational trade-offs.

Is This a Real Alternative to Cloud AI?

For automation specifically yes, for a lot of what we actually do.

Extraction, classification, summarization, transformation, formatting these are the workhorses of automation workflows, and local models handle them well. If your workflow touches sensitive data or runs at high volume, running it locally is the obvious move.

For genuinely complex reasoning you still want a frontier model. Knowing which is which in your specific workflow is the actual work.

We’re still learning where the real limits are hybrid routing between local and cloud models, full workflow stacks running locally, how model quality degrades at edge cases. We’ll write about it as we go.

What are you running locally? Curious what setups people are actually using hardware, models, what it took to make it work in practice.