May 21, 2026 · updated Jun 22, 2026 · Local LLM / Intel Arc / llama.cpp / vLLM / Qwen

Running Qwen3.6 Locally on the Intel Arc Pro B70: Why I Chose llama.cpp Over vLLM

Real benchmarks running Qwen3.6-35B-A3B on the Intel Arc Pro B70. Why llama.cpp beats vLLM for this model today, and what's coming next.

A close-up of a computer motherboard, lit from below in a dark room — Photo by Tai Bui on Unsplash

Update — June 2026: I re-ran this whole benchmark suite about two months after publishing, and the throughput numbers below are now low — in places by integer multiples. A Mesa 26.1 graphics-driver upgrade alone roughly doubled single-stream decode (the 42.5 t/s figure here is now ~76 t/s), and with --ubatch-size tuning and better concurrency scaling, aggregate throughput climbed about 8× at the same settings. The llama.cpp-over-vLLM recommendation still stands; the raw performance figures don’t. See Don’t trust your old benchmarks: a weekend of testing on the Intel Arc B70 for the full write-up.

So I’ve been building out a local AI server, and I want to share where I’m at with it. The short version: I’m trying to get a fully offline, private AI running on hardware I own, that can handle whatever workloads I want to throw at it, where I’m completely in control of the data and the stack. It’s part personal infrastructure project, part skill-building exercise — there’s no better way to develop expertise in this space than to actually do it.

The hardware is an Intel Arc Pro B70. Intel released this card in March 2026, and it’s a really interesting piece of hardware. 32GB of VRAM for about a thousand dollars. That’s the killer spec. Compare it to something like a GeForce RTX 5090, which has an MSRP of $1,999 but is actually selling for around $3,600 to $3,950 on the street right now, and the math gets really compelling really fast. If you want to run larger language models locally, you need VRAM, and the B70 gives you a lot of it for not a lot of money.

The “why bother” question for local AI is real, though, so let me address it. Right now if you want to use an AI assistant, the easy path is to pay one of the big providers for API inference. You hit their API, you get incredibly fast responses from server-grade GPUs, and the responses are high quality. The pricing is actually pretty great today — Anthropic’s Claude Max plans at twenty, a hundred, and two hundred dollars a month give you a ton of inference for the money. So why not just do that?

Two reasons. First, those prices are almost certainly being subsidized right now. The compute these requests run on is expensive, and the per-token economics are unlikely to stay this favorable forever. Building a local capability now is a hedge against that. Second, and more importantly to me, every request you send to a cloud provider is a request that leaves your machine. You’re trusting them with your code, your business data, whatever context you’re feeding the model. For a lot of work that’s fine. For some of it, it isn’t. Having a local option means you choose.

So that’s the setup. Now the actual question this post is about.

vLLM vs llama.cpp on the Intel Arc Pro B70

When you’re running an AI inference server on a Battlemage B70, should you use vLLM or llama.cpp?

These are the two main open-source inference engines for running large language models locally. llama.cpp is the more general-purpose one — it talks to your GPU through standard graphics APIs (Vulkan in my case), works on basically anything, and just runs. vLLM is the more sophisticated one — it has a smarter architecture for handling multiple requests at once, and Intel ships a specific version of it called llm-scaler-vllm that’s optimized for their Arc GPUs. On paper, vLLM is the better choice for the kind of workload I want to run.

I went into this wanting to use vLLM. Here’s what I found, and what I’d recommend.

Why Qwen3.6-35B-A3B Specifically

Before I get into the engine question, I want to be clear about the model I’m running and why, because it ends up mattering for the whole story.

The model is Qwen3.6-35B-A3B. It’s a Mixture of Experts model with about 3B active parameters per token from a 35B total pool — meaning it has the responsiveness of a much smaller model but the knowledge and reasoning of a much larger one. It was released by Alibaba in April 2026 and it’s specifically optimized for agentic coding and tool use, which is exactly the kind of work I want to do.

The reason I’m focused on 3.6 and not the older 3.5 — which is on Intel’s supported list — is that the jump between them is significant. Same architecture, same parameter count, but the training and post-training improvements between versions produce real benchmark gains:

SWE-bench Pro (real GitHub repo problem solving): 49.5 on 3.6 vs 35.7 on 3.5 — a 38% improvement
Terminal-Bench 2.0 (agentic terminal coding): 51.5 on 3.6 vs about 40.5 on 3.5 — a 27% improvement
Across agentic coding benchmarks generally, 3.6 “dramatically surpasses” 3.5
3.6 also introduces “thinking preservation” — the ability to retain reasoning context across messages, which makes iterative agent work substantially more stable

For coding and agentic tasks, 3.6 is the version you want. It’s not a small upgrade. Settling for 3.5 to get vLLM support would mean leaving a lot of capability on the table for the specific workloads I care about most. So the question wasn’t “can I get vLLM working at all” — it was “can I get vLLM working on the model that’s actually worth running.”

The Agentic Coding Stack: Pi and Local Inference

I’m building out my own internal agent infrastructure — software that connects together via MCP (a protocol that lets AI agents talk to your apps) and can do things like data retrieval, task aggregation across my projects, RAG pipelines over my own documents. The kind of stuff where I can ask “what are the tasks I need to do on my sales app today” and it goes and fetches the answer. For this kind of work, I make direct API calls into the model from my own apps. Fast, snappy, bounded — exactly what local inference is good at.

For coding workflows specifically, I’m using Pi — an open-source coding agent CLI from Mario Zechner. Pi works as a drop-in alternative to Claude Code, where I can point it at my local llama-server instead of Anthropic’s API. The ultimate vision for coding is that I can open Pi, do all my work through it, and not have to send anything to a cloud provider.

I want to be straight about where this is and isn’t working today. For the small agentic tasks I described above, it’s working great. For full end-to-end coding workflows where the agent iterates autonomously for fifteen minutes on a complex task? Not yet. Claude Code’s harness combined with their top-tier models can self-iterate through complex problems faster than my local setup can. What takes Claude Code fifteen minutes might take Pi on my hardware three hours with a lot more human-in-the-loop prompting. We’ll get there, but a single B70 with a 35B model isn’t replacing Claude Opus today. That requires either more B70s or a bigger GPU, both of which are upgrade paths I can take later.

So the use case right now: agentic infrastructure for everything except coding (via direct API calls), and Pi-driven coding workflows that work but require more babysitting than the cloud equivalent. That’s where the stack is, and it’s useful.

A rack of servers in a data center — Photo by Kevin Ache on Unsplash

Why I Wanted vLLM (PagedAttention vs Static Slots)

The kind of workload I’m describing — multiple agents firing requests in parallel, mixed sizes — has a specific need. Sometimes I need to dispatch five small agentic tasks concurrently. Sometimes I need to send one big request with a lot of context for a deeper thinking task. I want the system to handle both gracefully, dynamically, without me having to predecide what shape my requests are going to be.

llama.cpp handles concurrency, but it does it in a static way. When you launch it, you tell it how many slots you want and it carves the available memory into that many equal pieces. Pre-committed. If I set up eight slots for eight concurrent agents, I literally cannot send a large request — there’s no slot big enough to hold it. If I set up two big slots for large requests, I can only run two things at once. It’s one or the other, and I have to choose at startup.

vLLM doesn’t work that way. It uses something called PagedAttention, where the memory is a shared pool and each request grabs what it needs. Big request comes in? It grabs a lot. Small request comes in? It grabs a little. They share the pool dynamically. For my exact use case, that’s the right architecture.

So I tried to migrate.

What Happened When I Tried vLLM on Battlemage

I won’t drag you through every detail. Iteration was actually pretty painless — I’m using Claude Code itself to help write the deployment scripts, so I can quickly try something, see what breaks, and rewrite the script in seconds. That’s a big shift from how this kind of work used to go. The pain of trying something risky has gotten really low.

The short version: Qwen3.6-35B-A3B isn’t on Intel’s list of supported models for llm-scaler-vllm. I knew this going in, but the architecture is identical to Qwen3.5-35B-A3B which is supported, so I figured it was worth trying. After a few rounds of iteration, the answer was clear: it doesn’t work today. Both INT4 quantization formats I tried (AWQ and GPTQ) failed to load. Intel needs to add Qwen3.6 to their supported list, which they’ll likely do in a future image release, and at that point I’ll re-test.

I’m not willing to downgrade to 3.5 to get vLLM working. Given the benchmark gap I described earlier, that would be trading meaningful model capability for an architecturally better runtime — and the model capability matters more for what I’m trying to do. So vLLM is on hold until Intel ships Qwen3.6 support.

Intel Arc Pro B70 llama.cpp Benchmarks (Qwen3.6-35B-A3B)

For context on what “working well” actually looks like, here are the numbers from my current setup running Qwen3.6-35B-A3B on llama.cpp with the Vulkan backend on the B70:

Metric	Result
Single-stream throughput	42.5 tokens/sec
4-way concurrent aggregate	105 tokens/sec
8-way concurrent aggregate	104 tokens/sec
Context window	262K tokens (4-bit KV cache quantization)

The concurrent numbers tell you the slot architecture is doing its job — total throughput scales when you fan out requests, even if it’s not as elegant as vLLM’s continuous batching would be. 262K context is the headline number for me. That fits because llama.cpp lets me quantize the cache down to 4-bit, which makes it about four times smaller than the default. vLLM-XPU doesn’t document support for this, which means even if Qwen3.6 were supported, switching would have roughly cut my context ceiling in half.

That tradeoff calculation is worth being honest about. vLLM’s elastic concurrency would have been better for my agentic workloads. But losing half my context ceiling for it isn’t free. The right answer for now is the one that works.

This kind of decision — choosing inference frameworks for a private LLM deployment — is what I help CTOs and engineering teams work through. If you’re evaluating local AI infrastructure for your organization, get in touch.

The Recommendation: llama.cpp Today, vLLM Eventually

If you’re setting up an Intel Arc Pro B70 to run Qwen3.6, use llama.cpp with the Vulkan backend. It works, it’s fast enough, and you get the full context window the model supports. Skip vLLM until Intel adds Qwen3.6 to their supported model list.

If you’re running models that are on Intel’s vLLM supported list (Qwen3.5-35B-A3B, Qwen3-30B-A3B, DeepSeek-R1-Distill variants, gpt-oss), then vLLM is probably the better choice. The architecture genuinely is better for mixed workloads when it works.

The bigger principle here, for me, is that trying something and bailing fast when it doesn’t work is the right move. The cost of experimentation has dropped enormously — I can iterate on infrastructure scripts in seconds with Claude Code’s help. That changes the calculus on what’s worth trying. It’s not “should I commit a weekend to this,” it’s “should I spend an hour seeing if it works.” That’s a different question with a different answer.

A black computer tower with blue accent lighting — Photo by Andy Holmes on Unsplash

What’s Next

This is an ongoing project, and I’ll write more as it develops. The next checkpoint is whenever Intel ships their next llm-scaler-vllm image with Qwen3.6 support and I can re-test. Once that works — assuming it eventually does — I’ll publish an update with what the migration actually looks like and what the performance picture changes to.

In the meantime, I’ve got more local AI experiments in the queue: agent infrastructure, RAG over my own data, fine-tuning runs on hardware I own. The B70 is the first piece of an actually useful private stack, not the whole thing.

If you’re standing up local LLM infrastructure for yourself — picking an inference engine, sizing a build against a model, choosing between vLLM and llama.cpp for whatever you’re actually trying to run — and you want to work through the architecture before you commit weeks of iteration to it, reach out.

And if you’re a company looking at private AI infrastructure and you’d rather run inference on hardware you control than send everything through OpenAI or Anthropic, this is one of the things I help with. I work as a fractional CTO across a range of technical problems, and the same tradeoffs that decide between vLLM and llama.cpp on a B70 scale up — the choice of engine, the cost of a context window, the operational reality of self-hosted AI are the same shape of decision at any size. The budgets and the stakes just get bigger.