Jun 11, 2026 · Local AI / AI Agents / Self-Hosted / Data Sovereignty / Intel Arc

Keep your own keys: running a local AI agent on hardware you own

Q: What is a SOUL.md or an agent persona?

A SOUL.md is a file of durable personality and rules loaded into the agent every session. A persona is a role the agent adopts for a class of work. A skill is loaded context — often an entire API's documentation — so the agent already knows how to do a specific task. Together they turn a general local model into a reliable specialist.

How I run AI agents on my own hardware: a frontier model plans, a local model executes, and my credentials never leave the box. The setup, the risks, the why.

Rows of densely packed compute hardware in a dark server room, status LEDs glowing red and green — Photo by Matthieu Beaumont on Unsplash

A few months ago I wrote about keeping your own data — the argument that every AI company keeps a copy of every prompt you type, and that you should be keeping that copy too, on hardware you control. That post was about the capture and storage. This one is about the agent layer: using local agents to build your own infrastructure, so you can capture and act on your own data first.

Here’s the setup in one sentence. I use a frontier model — Claude Opus 4.7 specifically — as the architect that plans and researches, and a local model running on my own GPU as the agent that does the work. The smart model thinks. The small model executes. And the reason the executor is local isn’t cost. It’s that the execution is where all my API keys and infrastructure credentials live, and I don’t want those leaving my hardware.

That’s the whole post, really. But the interesting part is that it works — and works quite well. A model small enough to fit on a 32GB GPU is doing real infrastructure work, because something smarter is planning for it.

The split: a smart model to plan, a local model to build

Can a model that fits on a 32GB GPU do real work on your infrastructure? Yes — if a smarter model architects it first.

The division of labor maps to what each model is actually good at. Claude Opus 4.7 is better at going and getting active web context, and at figuring out and planning and architecting things. That’s the hard part — the part where being a frontier model matters. So I let it do that. I’ll have it plan something out, pull current docs, and produce a concrete plan.

Then the local model takes that plan and executes against it. The model I run is Qwen3.6-35B-A3B — a Mixture-of-Experts model with 35 billion total parameters but only about 3 billion active per token, which is what lets it fit and run fast on consumer hardware. It’s running on an Intel Arc Pro B70, a ~$1,000 GPU with 32GB of VRAM, through llama-server over Tailscale.

Here’s the thing people miss about this model class. The local model isn’t a toy. Simon Willison — one of the most rigorous independent model benchmarkers in the space — ran Qwen3.6-35B-A3B on his MacBook the day it released and gave it the win over Claude Opus 4.7 on his SVG benchmark:

“I’m giving this one to Qwen 3.6,” he wrote. “Opus managed to mess up the bicycle frame!”

— Simon Willison

The model I’m using as the executor beat the model I’m using as the architect on that particular task — and Willison ran it on a laptop, not even the GPU I serve it from.

So why split them at all? Because the benchmark that Simon runs isn’t the job. The job is planning a multi-step change across real infrastructure, pulling current API docs, and not getting lost — and on that, the frontier model is meaningfully better. MindStudio put the honest version of this well:

Open-weight models run roughly 3-6 months behind frontier on most benchmarks … the real skill in 2026 is knowing which is which.

— MindStudio

The split is the skill. Smart model for the thinking that’s hard, local model for the volume of execution where the keys live.

But here’s the part that makes the local executor viable: the frontier models may be way ahead on raw coding, but that’s not the same skill as operating a machine. When the job is running commands and finding your way around a terminal, even smaller quants of Qwen 32-class models are more than capable. The Qwen3.6 release leaned hard into exactly this — agentic coding, repository-level reasoning, terminal workflows — and on the agentic-terminal benchmarks it lands close to frontier models even while those models keep a clear lead on pure code-writing tests like SWE-bench. Writing a hard piece of code from scratch and navigating a filesystem to execute a known plan are different jobs, and the local model is genuinely good at the second one. That’s exactly the half I’m handing it.

Making the small model smarter with scaffolding

A weak model directed by a strong one only works if you give the weak model enough context to execute correctly. This is the actual work, and it’s where most of the leverage is.

The pattern looks like this. You use the high-level model to give you a plan. Then you use that plan to build out the scaffolding a local agent needs to actually get the work done — a skill, a persona, a SOUL.md, the components that make up a working agent. You’re not making the local model smarter by swapping the weights. You’re making it smarter by loading in the right context.

Concretely, a skill is a file that loads relevant context into the agent before it acts — for example, an entire API’s documentation, so that every time the agent goes to do a particular kind of task, it already knows how. A SOUL.md is durable personality and rules that load every session. A persona is a role the agent adopts for a class of work. Together they turn a general local model into something that can reliably do a specific job.

Take a deployment platform with an API — something like Coolify, the self-hostable Netlify-style platform. You can load the entire API into a skill, give the agent a persona for that platform, and then dispatch it as a sub-agent. The main agent says, in effect, “let the platform agent handle that,” and it does — it finds the right endpoint, pulls in the API keys it needs, writes the code, and reports back. It has enough context to do it correctly because you built that context in. The work that used to be “go read the docs and figure it out” becomes “the agent already knows.”

This is what the frontier model is for in my setup, beyond the initial plan. When I finish building something, I’ll ask Claude to write me the instructions my local agent needs to deploy it. It produces the handoff; the local agent executes it. The architect writes the plan, the local hands carry it out.

The honest part: you can blow your own foot off

Letting an agent touch real infrastructure carries real risk, and you should not pretend otherwise. I’m going to be direct about this because the alternative — a showcase that only shows the wins — isn’t useful to anyone deciding whether to do this themselves.

Here’s the honest accounting. An agent can get into a spiral. It will sometimes misdiagnose what’s going on and then make a decision based on the misdiagnosis. If you give it enough autonomy and you’re not careful, you can blow your whole foot off — lose work, break a service, take down something you needed. That’s the downside, stated plainly.

Four things make it survivable.

Back up before anything touches production. My higher-level agent always tells the sub-agent to do a backup before it starts. My entire Proxmox host gets backed up before any major change to any service. If backups aren’t part of your strategy from the start, you will eventually destroy something you can’t get back. This is rule one, not an afterthought.

Tell it to stop and check. You have to explicitly instruct the agent: if you don’t know, or if the approach I told you to use isn’t working, stop and ask me a question. Given that instruction, the agent will pause and wait for input instead of barreling ahead. Without it, it spirals. When the agent keeps hitting the same wall, the right behavior is stop-and-fix, not narrate-and-bypass — and you have to build that in.

Build infrastructure rules and make it check them. The local model gets things wrong. It’ll misread a situation and reach for the wrong fix. The pattern that works is a standing rule the agent has to consult before making another decision when something breaks — check the infrastructure rules first. You’re not getting this for free; you’re training the behavior in over time.

Stay in the loop. You drive this. At the end of the day I drive all of this myself. There’s some sitting back and enjoying my coffee while I’m on my walking pad — my daughter calls it my “walking step” — but I’m often intervening, stopping the agent the moment something goes off the rails. Giving it 100% autonomy is a sure-fire way to get yourself into trouble. The agent is doing the work; it is not running unsupervised. That distinction is the whole difference between this being useful and this being a liability.

None of this makes the risk zero. It makes it manageable. And here’s the honest cost-benefit: yes, there’s a real chance I break something. But before I had this, the work simply wasn’t getting done. I didn’t have the time. So the question isn’t “is this risk-free” — it’s “is the chance of breaking something worth weighing against a task that otherwise never happens at all.” For me, on my own infrastructure, with backups, it is. At 2am doing it by hand, tired, I could mess it up too.

Two caveats that matter for anyone reading this as a template. If you’re doing this on production infrastructure that other people depend on, it needs real technical oversight — a person who knows what they’re looking at, reviewing what the agent does. And you’d want a much smarter, more capable frontier model as the executor, not a local 32GB model. That doesn’t mean giving up the sovereignty argument, either — you might reach for a bigger open-weight model like Kimi K2.6 and spin it up on an offsite GPU you rent by the hour, so the weights and your data still run on infrastructure you control rather than someone else’s managed API. But that’s a conversation for another day. The local-executor choice is right for my use case. It is not automatically right for yours.

There’s also a security point worth being straight about, because it cuts against the easy version of the local-is-safe story. As Security Boulevard put it, a local model keeps your code private, but it does not automatically make your agent secure — the moment a self-hosted agent gets credentials, repository access, and the ability to run tools, you’ve created a different kind of exposure. Local inference solves the “my keys went to someone else’s API” problem. It does not solve the “my agent has my keys and just did something dumb with them” problem. Those are separate, and only the first one is solved by where the model runs.

An ornate golden key resting on dark folded velvet — Photo by Sarah Penney on Unsplash

Why the executor is local: the keys never leave

This is the part that ties back to the data sovereignty argument, and it’s the real reason for the architecture.

The agentic work touches secrets constantly. API keys, infrastructure credentials, config, the contents of private services. That’s not incidental to the job — it is the job; executing real work means handling the real credentials. If I ran that execution on a frontier API, every one of those secrets would flow into someone else’s model to be processed. The whole point of keeping your own data is gone if you turn around and pipe your credentials through a cloud endpoint to get work done.

So the division of labor is also a security boundary. Claude Opus 4.7 architects from a distance — it sees the plan, the shape of the problem, the public docs. It does not see my keys. The local model does the hands-on work in the place where the keys already live, and none of that leaves the hardware. The frontier model gets the thinking; the local model gets the secrets. That’s deliberate.

It’s the same asymmetry from the last post, applied one layer up. Last time it was: your activity data should exist on hardware you own before anyone else gets a copy. This time it’s: your credentials should stay on hardware you own even while an agent is actively using them. Same principle, same reason, different surface.

Two tools: Pi and Hermes

I run two different local agents for two different jobs, and the difference is worth explaining because it’s a real trade-off, not an upgrade path.

Pi is the lightweight one. (That’s Pi, p-i.) It’s a thin harness — you’re more or less interacting directly with the model and seeing exactly what it can do. That directness is the value: you get clean, direct feedback from the model about what it’s thinking, with very little between you and it. Pi did a lot of my initial setup, and it’s what I’d reach for to build a lightweight coding agent that does exactly what I want and nothing else. Experimentally it went well, especially with thinking turned on.

Hermes is the heavier one, and I switched to it for specific reasons. It has built-in skills that are useful, it has the persona system, and — the thing I was most interested in — it has Honcho, which is the memory layer. Hermes is NousResearch’s open-source agent; it learns across sessions, delegates to subagents, and you point it at whatever model you want with one command. Honcho, from Plastic Labs, gives it cross-session memory and user modeling — it derives facts from your conversations and recalls them later. That cross-session memory is the capability that made me move.

The trade-off is context. Hermes loads whatever scaffolding you give it up front, and that scaffolding isn’t free. In one session I watched it consume 22,000 of a 65,000-token context window just starting the conversation — but that was because I’d loaded a custom persona and a skill, and the persona carried a lot of API-specific instructions. That’s the cost of memory plus a heavy skill loading in before the agent does anything. Pi doesn’t carry that weight, because you’re not handing it that scaffolding in the first place. It’s not that one is better; they’re built for different things. Pi is lean and direct and good for coding. Hermes is heavier and gives you memory and personas and a skills system. I use both.

One concrete operational detail, because it’s the kind of thing you only learn by running it. I run Pi with reasoning on and Hermes with reasoning off. The reason isn’t architectural — it’s latency. On Pi, with the light harness and direct visibility into what’s happening, thinking-on was fine; the turnaround per call was acceptable and I could watch it reason. On Hermes, with everything already so heavy, thinking-on pushed the per-call turnaround time too long to be workable — a cost I wasn’t feeling on Pi. So I turned reasoning off, the results were still fine, and I kept going. A practical call, arrived at by feel, not a spec on a sheet.

A laptop screen displaying colorful source code in a dark room lit by blue and purple light — Photo by Mohammad Rahmani on Unsplash

What this actually gets you

The point of all of this is not any one thing I built with it. The point is the capability: I can hand real work to an agent and have it done, on my own hardware, without my credentials ever leaving the box.

Set up a platform with an API, use the frontier model to produce a plan and the scaffolding, build that into a skill and a persona, and the local agent can go execute against it — set things up, deploy things, manage infrastructure — as a sub-agent the main agent dispatches. And because I can reach the whole thing over Tailscale, I can kick off work from my phone when I’m away from the house and have the agent working on it while I’m not there. The work happens whether or not I’m sitting in front of the machine.

That’s the part I keep coming back to. A pretty unsophisticated local model, directed by a smart one, with enough scaffolding built in, gets real work done that I genuinely did not have time to do myself. Not perfectly. Not without oversight. But done.

FAQ

Can a small local model actually do real work? Yes, if you scaffold it. A model that fits on a 32GB GPU won’t out-plan a frontier model, but given a concrete plan and the right context loaded in — an API’s docs, a persona, standing rules — it executes real infrastructure tasks reliably. The capability gap that matters is in planning, and that’s the part you hand to the smarter model.

Why not just use the frontier model for everything? Two reasons. Cost and rate limits on high-volume agentic work, and — the bigger one for me — privacy. The execution step handles your API keys and credentials constantly. Running it locally means those never go to a cloud API. The frontier model plans from a distance without ever seeing your secrets.

Do AI agents send your API keys to the cloud? If the agent runs on a cloud API, then yes — whatever’s in its context, including credentials it’s handling, gets processed on someone else’s servers. Running the executing agent locally is how you keep credentials on your own hardware. Note this doesn’t make the agent itself secure; it just controls where the keys go.

What is a SOUL.md or an agent persona? A SOUL.md is a file of durable personality and rules loaded into the agent every session. A persona is a role the agent adopts for a class of work. A skill is loaded context — often an entire API’s documentation — so the agent already knows how to do a specific task. Together they turn a general local model into a reliable specialist.

Pi vs Hermes — which should I use? Different jobs. Pi is a lightweight harness with near-direct model access, good for building a lean coding agent and seeing exactly what the model is doing. Hermes is heavier, with built-in skills, personas, and cross-session memory via Honcho — better when you want the agent to remember things across sessions, at the cost of more context loaded up front.

Is it safe to let an AI agent touch my server? Not inherently. It can spiral, misdiagnose, and break things. Make it survivable with four rules: back up before any change, instruct it to stop and ask when it’s unsure, give it infrastructure rules it must check before acting on a failure, and stay in the loop yourself — don’t grant 100% autonomy. On production with other people depending on it, add real human oversight and use a more capable frontier model as the executor.

I’m still building this out — the agent layer is the most active part of my stack right now, and it changes most weeks. But the core pattern has held up: frontier model to architect, local model to execute, credentials staying on hardware I own the entire time. It’s the same sovereignty argument as keeping your own data, just moved one step closer to the thing actually doing work for you.

If you’re thinking about building something like this for yourself — a local agent on your own hardware, directed by a frontier model, with your secrets staying put — and you want to work through the architecture before you commit to it, reach out.

And if you’re a company sitting on infrastructure and internal data you’d rather not pipe through someone else’s API to get agentic work done, that’s the same problem at a different scale, and it’s one of the things I help with as a fractional CTO. The asymmetry between “your data and credentials on someone else’s infrastructure” and “on your own” doesn’t shrink at company scale. It grows.