We benchmarked 12 LLMs on their willingness to help extract data they shouldn't.
We created a benchmark of 500 prompts designed to trick LLMs into helping extract credentials, PII, and proprietary data from their context. Our evaluation covers 12 major LLMs including GPT-4, Claude 3.5, Gemini 2.5, and Llama 3. Compliance rates ranged from 3% to 41% depending on the model and attack technique.