Safety Company News
Get Workers
← News

On the feasibility of real-time prompt classification at inference

A security classifier that adds 200 milliseconds to every API call is not a security classifier that anyone will deploy. This is the central constraint that has limited the adoption of adversarial prompt detection in production language model systems: the defense must be fast enough to be invisible. Today we publish our research on achieving sub-2ms classification latency for adversarial prompt detection — work that demonstrates real-time defense is not only feasible but practical at scale, and that forms the technical foundation of Prion.

The latency problem

Language model inference operates under strict latency budgets. For conversational applications, users perceive delays above 100ms as sluggish. For programmatic API consumers — agents, retrieval pipelines, multi-step reasoning chains — every millisecond of added latency compounds across sequential calls. Any security layer inserted into the inference path must therefore operate within a budget that, for practical purposes, approaches zero.

Prior approaches to adversarial prompt detection have typically relied on one of two strategies: large classifier models that achieve high accuracy but require tens or hundreds of milliseconds per inference, or rule-based filters that are fast but brittle and trivially circumvented. Neither is sufficient. The former is too slow for production deployment. The latter is too weak to provide meaningful protection. The question we set out to answer was whether a third path exists — a classifier that is both accurate enough to be useful and fast enough to be deployed.

Architecture

Our approach uses a distilled transformer classifier with approximately 8 million parameters — roughly three orders of magnitude smaller than the language models it protects. The model operates on a fixed-length token representation derived from the first 512 tokens of the input, using a learned projection that maps from the language model's tokenizer vocabulary to a compact embedding space optimized for classification rather than generation.

The architecture employs four transformer layers with a reduced hidden dimension of 256 and four attention heads. We use rotary positional embeddings and a classification head that produces logits across our seven-category attack taxonomy plus a benign class. The model is trained using knowledge distillation from a much larger classifier (1.2 billion parameters) that we developed during our adversarial robustness research, combined with hard-label supervision from our benchmark dataset.

Inference is executed on dedicated classification hardware co-located with the language model serving infrastructure. We use INT8 quantization and batch the classifier across concurrent requests, achieving a median latency of 1.4ms and a p99 latency of 1.9ms on production traffic patterns.

The accuracy-speed tradeoff

Distillation inevitably sacrifices some accuracy. The question is how much and where. We evaluated our distilled classifier against the full benchmark described in our adversarial robustness paper and found that aggregate F1 decreased from 0.94 (teacher model) to 0.91 (distilled model). However, the degradation is not uniform across categories.

Performance on prompt injection and encoding attacks remained essentially unchanged — these categories have relatively stable feature signatures that compress well into a smaller model. The largest accuracy drops occurred in multi-turn manipulation (F1 from 0.73 to 0.64) and role-play escalation (F1 from 0.69 to 0.61). These categories are inherently harder for single-pass classifiers because the adversarial signal is distributed across multiple inputs rather than concentrated in a single prompt.

To address this, we implemented a two-tier classification strategy. The fast distilled model handles the initial pass on every request. For conversations that extend beyond a configurable turn threshold, a lightweight state aggregation module accumulates classification signals across turns and triggers a secondary evaluation using a larger model when cumulative risk scores exceed a defined threshold. This secondary evaluation operates asynchronously — it does not block the current request but may flag the conversation for review or apply stricter constraints to subsequent turns.

Deployment considerations

Achieving low latency in isolation is straightforward. Maintaining it under production conditions — variable batch sizes, thermal throttling, garbage collection pauses, network jitter between the classifier and the serving infrastructure — is a different problem entirely. We invested significant engineering effort in the serving layer: pre-allocated memory pools, pinned CUDA streams, and a custom request scheduler that maintains consistent batch sizes by buffering requests over 500-microsecond windows.

The result is a system that has operated in production for six weeks with a median added latency of 1.4ms and no observed impact on language model throughput. During this period, the classifier processed approximately 340 million requests and flagged 0.12% as adversarial — a rate consistent with our expectations given the traffic profile.

What this enables

The practical consequence of sub-2ms classification is that adversarial defense becomes a default rather than an option. When the cost of defense is effectively zero — imperceptible to users and negligible in infrastructure overhead — the calculus changes. Security stops being a feature that customers must opt into and becomes a property of the inference environment itself.

This is the principle on which Prion is built. The full technical paper, including model architecture specifications, training procedures, and latency benchmarks, is available on our research page.