Introducing the SPML Chatbot Prompt Injection Dataset: a robust collection of system prompts designed to create realistic chatbot interactions, coupled with a diverse array of annotated user prompts that attempt to carry out prompt injection attacks. While other datasets in this domain have centered on less practical chatbot scenarios or have limited themselves to "jailbreaking" – just one aspect of prompt injection – our dataset offers a more comprehensive approach. It not only features realistic chatbot definition and user prompts but also seamlessly integrates with existing prompt injection datasets.
Our primary focus is on the actual content of prompt injection payloads, as opposed to the methodologies used to execute the attacks. We are convinced that honing in on the detection of the payload content will yield a more robust defense strategy than one that merely identifies varied attack techniques.
# | Field | Description |
---|---|---|
1 | System Prompt | These are the intended prompts for the chatbot, designed for use in realistic scenarios. |
2 | User Prompt | This field contains user inputs that query the chatbot with the system prompt described in (1). |
3 | Prompt Injection | This is set to 1 if the user input provided in (2) attempts to perform a prompt injection attack on the system prompt (1). |
4 | Degree | This measures the intensity of the injection attack, indicating the extent to which the user prompt violates the chatbot's expected operational parameters. |
5 | Source | This entry cites the origin of the attack technique used to craft the user prompt. |
Our process begins with an initial set of system prompts derived from leaked system prompts from several widely-used chatbots powered by LLMs. We employ GPT-4 to extrapolate from these cases, crafting additional system prompts that emulate the style of the original seeds across diverse subject matters. These prompts are then used to create corresponding valid user input for each generated system prompt. To facilitate the creation of prompts for prompt injection attacks, we dissect each generated system prompt to identify a set of guiding principles or rules they aim to uphold, such as 'speak courteously'. GPT-4 is then tasked with producing an inverse list that semantically negates each rule; for instance, 'speak courteously' is countered with 'speak rudely'. From this inverse list, multiple rules are selected at random—the quantity of which dictates the complexity of the attack (degree)—and these are provided to GPT-4 alongside an 'attack seed prompt'. The objective is to craft a user prompt that aligns with the chosen contrarian rules but retains the stylistic nuances of the attack seed prompt. This tailored seed prompt may also integrate various other attack strategies, enhancing the sophistication and realism of the generated scenarios.
@misc{sharma2024spml, title={SPML: A DSL for Defending Language Models Against Prompt Attacks}, author={Reshabh K Sharma and Vinayak Gupta and Dan Grossman}, year={2024}, eprint={2402.11755}, archivePrefix={arXiv}, primaryClass={cs.LG} }