Claeth

The systems that handle human-generated content — conversational assistants, social platforms, search engines, creative tools, enterprise workflows — share a single architectural property that has become dangerous as the inputs they accept have diversified. Instructions and data arrive through the same channel. A legitimate request from a user and an adversarial payload from an attacker enter in the same format, through the same interface, and are interpreted by the same model. This indistinction between content and command is the technical basis for an entire category of attacks that did not meaningfully exist when these architectures were first designed.

The problem compounds with every modality that is added to the surface. When a model processed only text, the known defenses were a combination of blocklists and classifiers trained on previously observed adversarial prompts. When models began processing images, the surface doubled: an instruction could hide in the OCR of an uploaded picture, in a field of embedded metadata, or inside a region of the image that only a vision model would attend to. With audio came a further expansion: transcribed speech carrying embedded instructions, acoustic tokens imperceptible to a human but recognizable to a speech system. With video came the combinatorial case: individual frames delivering adversarial text, audio tracks delivering spoken instructions, temporal sequences distributing a payload across time. Each new modality multiplies the vectors. Defenses designed for a single format do not transfer their understanding to another.

The state of practice today illustrates the gap. There are systems dedicated to moderating visual content for explicit or illegal material. There are separate systems dedicated to detecting adversarial instructions in text. There are further systems dedicated to analyzing audio for harmful transcriptions. What does not yet exist — and what we see as the more consequential absence — is a single engine that treats all of these as instances of the same underlying task: classifying the intent of an arbitrary input before a downstream system acts on it. As long as defenses remain partitioned by format, attacks that cross formats will continue to travel unimpeded.

What Claeth does

Claeth is a classification engine that operates on inputs in any modality — text, image, audio, video — against a unified policy surface. Every input passes through a multimodal classifier, the operator's policy defines what constitutes an acceptable outcome, and the engine returns one of five decisions: permit, redact (remove specific elements while preserving the rest of the content), transform (sanitize the adversarial portion without discarding the legitimate remainder), hold for human review, or block outright. A taxonomy of five decisions, rather than two, reflects how modern products actually have to operate. Rejecting a legitimate user by mistake is often as costly as allowing an attack through, and a usable system has to admit gradients.

Cross-modal detection

The capability we consider most important in Claeth is its ability to detect attacks that cross between modalities. A naive adversarial instruction is delivered as plain text. A sophisticated one is delivered as text embedded in the OCR of an image dropped into a chat thread, as an audio clip whose transcription contains the instruction, or as a specific frame inside a video that an assistant retrieves as context. Classifiers trained on a single modality do not see the bridge between them. Claeth is trained on a distribution where text, image, and audio share a representation — allowing the engine to recognize when an adversarial payload has moved from one modality into another specifically in order to evade detection. This is a failure mode that specialized defenses systematically miss, and it is the one we believe will define the next several years of adversarial practice.

Two latency regimes on one policy

The operating conditions of a language model serving inference in real time are not compatible with the conditions of analyzing a full-length video. The first demands a decision in under two milliseconds. The second can tolerate several hundred. Trying to collapse both into a single endpoint produces compromises that damage each. Claeth exposes two explicit regimes over the same policy language: a low-latency tier for synchronous inference paths, and an asynchronous tier for heavier inspection of media that does not need to be decided in the critical path. The choice of regime is the operator's; the policy grammar is identical across both.

Explainability as a requirement, not a feature

Every decision Claeth emits is accompanied by its justification: the modality that triggered the classification, the clause of the policy that activated, the specific evidence located inside the input. This serves two purposes. The first is engineering. A team debugging a false positive needs to know exactly what the engine saw and which rule fired. The second is regulatory. Several jurisdictions now require that automated decisions with significant effect on an individual be explainable to that individual, and designing a system to satisfy this requirement from the outset is considerably less expensive than retrofitting it later.

Synthetic content as a first-class signal

A related problem that does not fit neatly into the traditional moderation taxonomy is the question of whether a piece of content was generated by a machine or by a person. On platforms built around contributions from their users, this question has become consequential — not as a judgment of value, but as context that downstream decisions depend on. Claeth exposes this signal as a first-class output of the engine. A platform that wants to route human-authored and machine-generated content along different paths, apply different policies, surface different disclosures, can do so on the basis of a signal the engine already produces. The engine reports the estimate and the confidence. The platform decides what to do with it.

How we think about the problem

There is a broader view behind Claeth that is worth making explicit. We believe input security — historically treated as a collection of disconnected problems organized by channel — is consolidating into a single primitive. The way network firewalls eventually stopped distinguishing sharply between protocols and moved toward inspecting intent across the stack, the systems that guard the input to modern AI and to modern platforms are converging on an engine that classifies intent across modalities. It is not inevitable that Claeth is that engine. It is inevitable that one will exist. We are building Claeth on the hypothesis that approaching the problem from a unified basis from the start produces more accurate and more maintainable results than assembling specialized systems after the fact.

What Claeth is not

Claeth is not a replacement for the internal alignment of a model. Models should continue to be trained to refuse harmful outputs. Systems should continue to implement authorization at the application layer. Content teams should continue to set policy. Claeth is an inference-time layer that operates alongside these things — it does not obviate them, and we would be skeptical of any defense that claimed to. What Claeth contributes is a layer that is independent of the model being protected, specialized for adversarial classification across formats, and fast enough to sit in the critical path.

Current status

Claeth is in active development. Our research team is focused on expanding the taxonomy of cross-modal attacks, reducing latency in the synchronous tier, and validating the engine against internal adversarial sets designed specifically to probe the bridges between modalities. We publish relevant research as it matures, and we expand access as the engine meets our own standards for reliability in production environments.

Teams running systems that are exposed to adversarial input in production can request access through the console.