Why I built llm-queue: one local model, one queue
BEFORE two processes, two private queues, one small GPU
jobbot ─┐
├──▶ Ollama
slop filter ─┘
both hit the model at once → reload thrash, ~4x slower
AFTER one shared queue over HTTP, in front of one model
jobbot ─┐
├─ HTTP ─▶ llm-queue ─ fetch ─▶ Ollama
slop filter ─┘
one request at a time → the model loads once and stays putI have two side projects that both lean on a local LLM, and for a while they kept making each other slow. Fixing that turned into a small package called llm-queue: one process that owns the model and runs every request, from every program, through a single serialized priority queue behind an OpenAI-compatible HTTP API.
The two projects are jobbot, a bot that scrapes job boards and classifies each posting with a local model (is this actually a job? is it remote? does it want a language I don’t speak?), and a browser extension that hides LinkedIn feed slop the same way. One’s a Node cron job, the other’s an MV3 service worker. Both reach for the same Ollama model on my laptop.
The problem: one model, one small GPU
Two facts make concurrency a bad bet here. A local model answers one request at a time, so parallel callers don’t actually run in parallel. And my GPU has 6 GB, which barely fits a single model, so there’s no room to load a second one beside it.
So the obvious plan, let both projects run and hit Ollama whenever they want, fell apart fast. When two processes hit it at once with different models, or even the same model at a different context size, Ollama unloads one and loads the other on every call. That reload is gigabytes off disk, and it turns out to be most of your wall-clock time. I benchmarked it to be sure: throughput dropped to about a quarter of what a single client got. Concurrency made things slower. About four times slower.
Pinning both models in memory doesn’t save you either. They don’t fit, so they spill onto the CPU and fight over cores, and now both are slow. On this hardware, nothing beats one request at a time.
The fix: one shared queue
The answer is boring. Don’t run requests in parallel. Put them in one line and serve them one at a time. The line just has to be global, shared across every process, and the one thing all my programs already speak is HTTP.
So llm-queue is a single process that owns the Ollama endpoint and the model. Everything else talks to it over HTTP. One queue, one model, one request in flight, ever, so the model loads once and stays put.
There’s one wrinkle: not every request is equal. jobbot fires off dozens of background classifications while it scrapes, but when I click a job to draft an application, I’m waiting on that one answer right now. So it’s a priority queue. A request marked priority jumps the waiting backlog, though it never interrupts whatever’s already running.
Making it speak OpenAI
I didn’t want to ship a client library. Everything already speaks the OpenAI API, so I made the service speak it too. Point any OpenAI client at http://localhost:11500/v1 and you get the queue for free, no SDK swap:
import OpenAI from 'openai'
const client = new OpenAI({ baseURL: 'http://localhost:11500/v1', apiKey: 'unused' })
const r = await client.chat.completions.create({
model: 'granite4.1:8b',
messages: [{ role: 'user', content: 'hello' }],
})The two knobs the queue needs don’t exist in the OpenAI shape, so they ride along as extra body fields. priority is the line-jump from above. numCtx is the context window a client wants; the service runs at the largest size any client asks for and never shrinks back, so the model doesn’t reload to a smaller window mid-day. Stock OpenAI servers ignore fields they don’t recognize, so the exact same request still works against the real API. It just drops the two extras.
The service also hands back the model’s raw string and nothing else. No JSON parsing, no schema validation. My two consumers parse the output differently, and every time I pictured “helping” them inside the queue, I was really just guessing wrong on their behalf. The queue does the one job only it can do: serialize access, with a per-attempt timeout and one retry. The rest is the caller’s problem.
Two projects, one queue, no shared code
The part I like most is that neither consumer depends on the package. jobbot doesn’t npm install llm-queue. Neither does the extension. They just POST to the running service:
const res = await fetch('http://localhost:11500/v1/chat/completions', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
messages: [
{ role: 'system', content: SYSTEM_PROMPT },
{ role: 'user', content: pageText },
],
response_format: { type: 'json_object' }, // ask for JSON
numCtx: 8192, // llm-queue extension
}),
})
const { choices } = await res.json()That’s the whole integration. The extension, running in a completely separate browser, sends its own classifications to the same service. (Browsers block a content script from calling localhost, so that fetch lives in the service worker, and the service sends permissive CORS headers so it goes through.) The two projects have never heard of each other. They don’t share a queue in code. They share one in production, because there is exactly one and it’s reachable over HTTP. On a 6 GB GPU that’s the difference between a model that loads once and one that thrashes all day.
Building it
It’s small on purpose: about 700 lines across the queue, the HTTP server, the fetch transport, and the CLI. TypeScript, ESM, Node 20+, almost no dependencies, talking to Ollama over plain fetch. tsup builds it, vitest covers the queue and the server, and release-please turns my conventional commits into the version bump, changelog, and npm publish on merge.
Try it
If you run a local model and more than one thing wants to use it, this is the missing piece:
npm i -g llm-queue
OLLAMA_MODEL=granite4.1:8b llm-queue serve # → http://127.0.0.1:11500Then point anything OpenAI-shaped at http://localhost:11500/v1, or just fetch it directly. It’s on npm, the source is on GitHub, MIT licensed. If it saves your local model from thrashing the way it saved mine, a star on the repo is the easiest way to help the next person find it.