Whenever you offer a user interface where people can input data freely, you give them a chance to type in Personal Identifiable Information (PII). In most cases this causes no issues. But in heavily regulated domains, government services, asylum case management, healthcare, leaked PII can have serious consequences. So how do you stop sensitive data from reaching your servers in the first place?
Local-first AI can be part of the solution.
What is Local-first AI?
The core idea is simple: run inference locally by default, and only escalate to cloud services when the task requires it. Local-first AI is not a replacement for frontier models. It is a design choice about where data flows and who controls the inference layer.

There are four main drivers for archtiecutring for Local-first AI:
- Reduced Cost: $0 marginal cost per token.
- Privacy: Data never leaves the client.
- Speed: Near-instant latency for classification and summarization.
- Offline capability: Once the model is downloaded, no internet connection is required.
Local-First AI in the Browser
Most local AI models ship as part of a native application. Image recognition on a phone, for example. But what about the browser? A huge share of user interactions happen in web apps, outside any downloaded native application. On December 1, 2025, WebGPU was announced as shipping in all major browsers, making it the new standard for modern web graphics and compute. When you combine WebGPU with Transformers.js, you can now run small models directly in the browser with low latency.

Real-Time PII Detection Running Entirely in the Browser

The GIF above shows a form where, as the user types, OpenAI Privacy Filter, scans the input in real time and highlights personal information, such as names, emails and addresses. All of this happens before a single byte is sent to a server. No API calls, no network round-trips, no cloud inference. The model has 1.5B total parameters but only 50M active thanks to sparse mixture-of-experts, making it small enough to run in a browser tab via WebGPU. It was downloaded once on first load and runs locally on every keystroke. If sensitive data is detected, the user is warned and can redact it themselves meaning the server, and any downstream logs, LLM providers, or third parties, never sees the raw PII.
The model is open source under Apache 2.0. This enables teams to fine-tune it on domain-specific data such as medical terminology, legal jargon, or immigration case language, without licensing restrictions.
Conclusion
Local-first AI represents an architectural shift, not a technological replacement. Its primary value lies in:
- Improved control over data flows
- Reduced dependency on external APIs
- More predictable cost structures for selected workloads
With the maturity of WebGPU, browser-based runtimes, and small language models, local inference is now a practical option in modern AI system design. The key design question is no longer whether local AI is feasible, but where it provides measurable value within a broader hybrid architecture.
By: Rasmus Haugland, AI Engineer at Computas.
Local-First AI: Protecting Users from Themselves was originally published in Compendium on Medium, where people are continuing the conversation by highlighting and responding to this story.


