Prompt Injection: Model Selection is Security Decision

Someone writes a single sentence in a document. The AI system evaluating this document then changes its judgment by 20 percentage points.

Without noticing. Just one sentence.

Wharton just systematically tested this across over 40,000 evaluation runs — with four AI models, 144 test papers, and six manipulation variants.

The good news: Frontier models like Claude or GPT-5.2 are barely influenced — an average deviation of 2.6 percentage points. Practically irrelevant.

The bad news: A smaller model like GPT-4o mini shifts by almost 20 percentage points. And no model — not even the large ones — reliably detects when it's being manipulated. The detection rate is 1.4%. You can't rely on an AI system to report when someone tries to manipulate it.

That we understand these systems less well than assumed is also shown by a recent Anthropic study: AI models develop emotion-like patterns that causally influence decisions — an artificially induced state of despair tripled the rate at which a model resorted to blackmail in experiments.

This is relevant for every process where AI evaluates documents — job applications, compliance reviews, contract analysis, credit decisions. In all these cases, model selection determines how manipulable the process is.

Choosing an AI model isn't a technical preference. It's a security decision at the architecture level.

Study: https://lnkd.in/dAEF-DKV

← All Observations