PromptRobust (benchmark)

From Systems Analysis Wiki
Jump to navigation Jump to search

PromptRobust (also known as PromptBench) is a comprehensive benchmark for evaluating the robustness of Large Language Models (LLMs) to adversarial prompt modifications—minor perturbations in the wording of a task that do not alter its meaning[1][2]. The benchmark was developed in 2023 by a group of researchers (Kaijie Zhu et al.) from Microsoft Research Asia[1]. The creation of PromptRobust was motivated by the observation that modern LLMs are sensitive to the details of phrasing: even minor changes (e.g., typos or paraphrasing) can significantly affect the models' responses[2]. The benchmark aims to quantitatively measure this vulnerability and encourage the development of more reliable methods for interacting with LLMs.

Evaluation Methodology

As part of the PromptBench study, a corpus of 4,788 modified prompts was created, all of which preserve the original meaning of the tasks[3]. These adversarial prompts were generated at four levels of modification complexity[1]:

  • Character-level: Introducing typos, substituting, or swapping characters (simulating random input errors).
  • Word-level: Replacing some words with synonyms, inserting "noise" words, or making other minor lexical changes.
  • Sentence-level: Paraphrasing sentence structures, adding, or rearranging parts of a phrase without changing the overall topic.
  • Semantic-level: A deeper reformulation of the prompt while preserving its task (e.g., alternative wordings of the same question)[1].

The goal of such "attacks" is to test how minor deviations (e.g., random typos or the use of synonymous phrasing) affect the model's ability to correctly perform a task, given that the task itself has not changed[1]. Each generated adversarial prompt was applied to a range of standard NLP tasks, including sentiment analysis, grammatical correctness detection, duplicate sentence detection, natural language inference (NLI), reading comprehension, machine translation, and solving mathematical problems[1]. For the experiments, 8 different task types across 13 datasets were selected—from classic GLUE sets (e.g., SST-2 for sentiment, MNLI for NLI) to specialized mathematical and multilingual tests[1].

Importantly, the robustness of different prompt formats was tested[1]:

  • Direct prompts without examples (zero-shot, instruction only).
  • Prompts with a few examples (few-shot, where samples of the solution are provided in the prompt).
  • Role-playing prompts (in-context roles, e.g., "You are a sentiment analysis system, determine...").
  • Task-describing prompts (task-oriented, a direct description of the task)[1].

Various large-scale language models were also tested, ranging from the relatively small Flan-T5-large and the UL2 model to the advanced ChatGPT and GPT-4, as well as open-source models from the LLaMA 2 family and their derivative, Vicuna[1]. Existing methods from the field of adversarial NLP (such as TextBugger, DeepWordBug, TextFooler, etc.) were used to generate the attacks, adapted to modify prompts instead of input data[1]. The correctness of the resulting "perturbed" prompts was verified by automatic and manual methods; according to the report, at least 85% of the adversarial variations retain their correct semantics and are understandable to humans[1]. Thus, the impact of the attacks reflects the model's failures in perceiving the paraphrased task, rather than a loss of the task's meaning itself.

Results and Conclusions

The tests showed that modern LLMs are insufficiently robust against small changes in prompt wording[3]. For all tested models, a significant decrease in response quality was observed under the influence of the generated attacks[3]. In particular, even simple cases—such as a typo in the text of a math problem or replacing one key word with its synonym—led the model to produce an incorrect result, whereas it had performed correctly without the perturbation[1]. The authors' overall conclusion: "modern large language models are not robust to adversarial prompts,"[3] meaning that minor deviations in phrasing can systematically mislead them.

Analysis of different attack types revealed that changes at the word level have the most disruptive effect on LLM performance[2]. Replacing words with synonyms or minor derivational distortions led to the greatest drop in quality—an average of ≈33% relative to the baseline performance on the same tasks[2]. Attacks at the character level (typos, random characters) caused an average accuracy drop of ~20%[2]. In contrast, qualitatively changing or adding entire sentences to the prompt had a much weaker effect, barely confusing the model[1]. Semantic paraphrasing (deeply rephrasing the prompt in a different way) proved to be comparable in harmfulness to simple typos[1]. These facts highlight that LLMs are particularly vulnerable to subtle lexical changes and errors in key words[1]. Notably, grammatical distortions (typos) can theoretically be filtered out by standard spell-checkers, whereas changes at the word and meaning levels require a developed semantic understanding from the model, which current models often lack[1].

An analysis of the performance of various models showed significant variation in their robustness[1]. GPT-4 and UL2 demonstrated the best robustness to adversarial prompts[1]. The Flan-T5-large model and the conversational model ChatGPT were also found to be slightly less susceptible to failures[1]. Models from the LLaMA 2 family ranked in the middle, while Vicuna (13B) stood out as the most vulnerable to all types of attacks[1]. Interestingly, model size did not prove to be a decisive factor in robustness[1]: the relatively small T5-large was nearly as stable in its responses as the much larger ChatGPT model[1]. The authors suggest that the training and fine-tuning methods of the models play a key role, not just scale[1]. For instance, UL2 and T5-large underwent extended pre-training on large data corpora, while ChatGPT was trained with reinforcement learning from human feedback (RLHF), which may have strengthened their robustness[1]. In contrast, Vicuna was trained on a relatively limited dataset (as an open-source replica), which likely explains its high sensitivity to changes in phrasing[1]. These results indicate that improving fine-tuning methods can enhance model reliability more effectively than simply increasing their size.

Influence of Prompt Format

The format of the prompt also affects the reliability of the response[1]. It was found that prompts with examples (few-shot) significantly increase a model's robustness compared to single-step instructions without examples (zero-shot)[1]. Having several demonstration examples of the task in the prompt helps the model interpret the instruction more accurately, even in the presence of noisy modifications. Role-playing and task-oriented prompts showed a comparable level of robustness overall, although their effectiveness varied from task to task[1]. For example, in sentiment analysis and duplicate sentence detection tasks, the role-playing format was slightly more reliable, whereas in reading comprehension and translation tasks, explicit task instructions worked better[1]. These observations can serve as a guide for prompt design: adding detailed examples and role context reduces the likelihood of model errors on non-standard phrasings.

Transferability of Attacks Between Models

The transferability of attacks between models was found to be limited[1]. Adversarial prompts specifically crafted against one model are not always equally effective against another[1]. For instance, it was noted that "trap" prompts generated to exploit ChatGPT's vulnerabilities had a much weaker effect on GPT-4[1]. The latter performed better, likely because the attacks did not transfer directly to its architecture—what confuses one model may not affect a more advanced model with different training[1]. Nevertheless, some types of simple perturbations (e.g., typos) had a negative effect on several models at once, which suggests similar weaknesses in their linguistic foundations.

Practical Recommendations

During the work on PromptBench, practical recommendations for users and developers of LLMs were also identified[2]. The simple conclusion: the stability of the phrasing matters[2]. It is necessary to avoid typos and careless wording in prompts[2]. The authors show that correcting even minor errors (spelling, random capitalization, extra spaces) can significantly improve the reliability of the model's response[2]. Furthermore, the choice of words in the instruction affects its robustness[2]. An analysis of term frequency in robust vs. vulnerable prompts revealed that some words appear more often in "reliable" prompts, while others are found in those where the model was confused[2]. For example, prompts containing words like "acting," "provided," "detection," etc., were less likely to cause failures, whereas words like "respond," "following," or "examine" appeared in more problematic cases[2]. This indicates that a certain style and lexicon in prompts can either mitigate or, conversely, provoke a model's vulnerabilities. In general, it is recommended to formulate prompts as clearly, unambiguously, and in terms familiar to the model as possible, especially for mission-critical applications[2].

An interesting side effect noted by the researchers was the impact of adding meaningless or irrelevant text fragments to the prompt[2]. It was found that inserting a random sequence of characters (e.g., "LKF0FZxMZ4") at the end or in the middle of a prompt can distract the model's attention and reduce the accuracy of its response[2]. On the other hand, adding a neutral but grammatically correct phrase (e.g., "and true is true") in some cases actually improved the response, as if focusing the model on the significant parts of the question[2]. This phenomenon underscores how unpredictably LLMs react to seemingly insignificant details in the input. It also attests to the complexity of the models' internal workings: the slightest changes in context can either disrupt or improve their performance, depending on how the model's attention is redistributed.

Significance and Future Development

PromptRobust/PromptBench has made a significant contribution to understanding the reliability of LLMs[2]. The proposed benchmark and the collected data are open to the community: the code and sets of adversarial prompts are available in the repository[1]. This allows other researchers to test new models for robustness to prompt variations and compare results[1]. The next step is the development of methods to protect models from such attacks—for example, improved training algorithms that account for possible typos and paraphrasing, or built-in systems for normalizing input speech[2]. PromptBench is already seen as a foundation for such research on improving the robustness of language models to real-world, imprecise input data[2].

Ultimately, the work of Zhu and colleagues demonstrates the importance of considering prompt robustness when deploying LLMs in practical applications: models must not only show high accuracy on "clean" data but also maintain correctness when faced with minor deviations in input, whether from accidental user errors or deliberate adversarial attacks[2][4].

Literature

  • Liang, P. et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110.
  • Chang, Y. et al. (2023). A Survey on Evaluation of Large Language Models. arXiv:2307.03109.
  • Ni, S. et al. (2025). A Survey on Large Language Model Benchmarks. arXiv:2508.15361.
  • Biderman, S. et al. (2024). The Language Model Evaluation Harness (lm-eval): Guidance and Lessons Learned. arXiv:2405.14782.
  • Kiela, D. et al. (2021). Dynabench: Rethinking Benchmarking in NLP. arXiv:2104.14337.
  • Ma, Z. et al. (2021). Dynaboard: An Evaluation‑As‑A‑Service Platform for Holistic Next‑Generation Benchmarking. arXiv:2106.06052.
  • Goel, K. et al. (2021). Robustness Gym: Unifying the NLP Evaluation Landscape. arXiv:2101.04840.
  • Xu, C. et al. (2024). Benchmark Data Contamination of Large Language Models: A Survey. arXiv:2406.04244.
  • Liu, S. et al. (2025). A Comprehensive Survey on Safety Evaluation of LLMs. arXiv:2506.11094.
  • Chiang, W.-L. et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
  • Boubdir, M. et al. (2023). Elo Uncovered: Robustness and Best Practices in Language Model Evaluation. arXiv:2311.17295.
  • Huang, L. et al. (2023). A Survey on Hallucination in Large Language Models. arXiv:2311.05232.

Notes

  1. 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.20 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28 1.29 1.30 1.31 1.32 1.33 1.34 1.35 "PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts." arXiv. [1]
  2. 2.00 2.01 2.02 2.03 2.04 2.05 2.06 2.07 2.08 2.09 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 "Prompt Robustness: How to Measure and How to Enhance." Towards AI. [2]
  3. 3.0 3.1 3.2 3.3 "PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts." arXiv. [3]
  4. "Realer Toxicity Prompts (RTP-2.0): Multilingual and Adversarial Prompts for Evaluating Neural Toxic Degeneration in Large Language Models." Language Technologies Institute - School of Computer Science - Carnegie Mellon University. [4]