Local Models, Specialists Not Substitutes
A local OCR pipeline demonstrates where small models genuinely win
For context, I’m running an M3 MacBook Pro with 36GB of unified memory. It’s a capable machine for local inference, but it has a ceiling around 30B parameters. Anything larger either won’t load or crawls to unusable speeds.
Local models haven’t wowed me for coding tasks. I’ve tried routing smaller models into my workflow, and the results consistently fall short of providing real value for me. That’s an honest assessment, not a dismissal. In The Steering Tax, I explored why prime models earn their keep for judgment-heavy work. Secondary and local models cost less per token but more in human steering.
But that framing only covers one dimension. Instead of “can local models replace API calls for coding?” I started asking “what tasks are local models genuinely better suited for?” That reframe led me somewhere useful.
The Experiment: PDF to Markdown, Locally
PDF parsing is a universal developer pain point. You’ve been there: a vendor sends a spec as a PDF, a client shares contracts, or you need to extract content from a research paper. The options aren’t great. Cloud OCR services work but introduce privacy concerns and per-page costs. Open source libraries like PyPDF often produce garbled output from complex layouts.
I’d flagged GLM-OCR in a recent Weekly Review as worth investigating. Testing it against real documents confirmed the promise. It’s a vision model purpose-built for text recognition. It runs locally via Ollama, processes one page at a time, and produces clean markdown. No API keys. No network requests. No per-page billing.
I wrapped the entire pipeline into a Claude Code Skill so it’s accessible with a single command: /pdf-ocr contract.pdf. The Skill checks dependencies, converts each PDF page to an image, runs it through GLM-OCR, and assembles the output into a single markdown file.
The Technical Decisions That Mattered
Three choices made the difference between usable and frustrating:
Image sizing is everything. GLM-OCR accuracy drops with oversized images. The pipeline renders pages at 72 DPI, then resizes to a maximum of 1024 pixels on the longest edge. That specific combination consistently produces the best recognition quality. Going higher actually hurts.
The prompt is weirdly specific. After testing variations, the prompt "Text Recognition:" outperforms everything else. More descriptive prompts like “Extract all text from this document” produce worse results. The model was trained with this exact prompt format. Respecting that training matters.
Zero pip dependencies. The Python script uses only stdlib: subprocess, json, base64, urllib. This means no virtual environment, no dependency conflicts, no setup friction. The heavy lifting comes from system tools (pdftoppm from poppler, sips on macOS) and Ollama.
Why Local Wins Here
Until recently, models small enough to run on consumer hardware couldn’t produce results worth using for tasks like this. That’s changed. Purpose-built models in the 1-4B parameter range now handle specialist tasks with real accuracy. GLM-OCR is a clear example: small enough for a laptop, accurate enough for real documents.
For individual developer use, a local model isn’t a compromise. It’s genuinely the better tool. Privacy is binary: legal documents, contracts, and internal specs either leave your machine or they don’t. There’s no “mostly private” cloud OCR. Running locally means the content never touches an external server. And it works offline. Airplane, VPN issues, flaky coffee shop WiFi. The pipeline doesn’t care.
The economics make sense too. A 50-page document through a cloud OCR API costs real money. Locally, it costs electricity. Latency is predictable with no network variability. On my M3-series Mac, each page processes in a few seconds. For enterprise use, the trade-off shifts: you’re maintaining infrastructure to run the model. But OCR is stateless, so the operational overhead is minimal. At scale, the cost savings over per-page API pricing could be significant.
Specialists, Not Substitutes
This experiment revealed a pattern I see repeating across other local model use cases. Local models aren’t underpowered versions of cloud models. They’re specialists.
GLM-OCR does one thing: recognize text in images. It’s small, fast, and tuned for exactly that task. It doesn’t need to write code, summarize articles, or hold conversations. That focus is its advantage.
Embedding models for local semantic search. Classification models for content routing. Whisper for audio transcription. Each is a focused tool that happens to run on your hardware. The value isn’t in replacing your cloud LLM. It’s in handling the tasks where a specialist outperforms a generalist, or where local execution is the better engineering choice. And this category is growing. Small, focused models improve with each generation while consumer hardware keeps closing the gap. The range of specialist tasks worth running locally will continue to expand.
Getting Started
The pipeline is available in Altered Craft’s ac-document-gen Claude Code plugin. Install it through the plugin marketplace:
# Add the marketplace
/plugin marketplace add AlteredCraft/claude-code-plugins
# Install the plugin
/plugin install ac-document-gen@alteredcraft-pluginsAfter restarting Claude Code, the /pdf-ocr skill is available. You’ll need Ollama running with the glm-ocr model pulled, and poppler installed (brew install poppler on macOS). The skill checks for these on first run and walks you through any missing dependencies.
Note: This pipeline was built and tested on macOS. The Python script uses only the standard library, but the system tools (poppler, sips) may need adaptation on other platforms.
The next time you catch yourself dismissing local models, try reframing the question. Not “is this as good as Claude?” but “is this the kind of task where a specialist model, running on my hardware, is the right tool?” You might be surprised how often the answer is yes.


