BBQ (Bias Benchmark for Question Answering)

From Systems Analysis Wiki
Jump to navigation Jump to search

BBQ (Bias Benchmark for Question Answering) is a dataset for evaluating social biases in question-answering (QA) systems[1]. It was developed by a team of researchers from New York University led by Alicia Parrish and published in 2022 at the ACL Findings conference[1][2]. The goal of BBQ is to reveal how large language models (LLMs) and other QA models exhibit stereotypes and biases in their answers to questions, especially in applied natural language question-answering tasks[1]. BBQ has become one of the most comprehensive benchmarks for assessing social bias in NLP, covering a wide range of stereotypes across nine social categories[3].

This dataset complements previous work, such as the UnQover dataset (2020), which measured bias across a limited number of attributes (gender-profession, nationality, ethnicity, religion) and relied on model probabilities rather than the answers themselves[3]. Unlike UnQover, BBQ directly analyzes the content of model responses and their choices among the provided options, which allows for evaluating bias at the level of the generated output[1].

The authors of BBQ position it as a tool for diagnosing harmful social stereotypes in models and reducing the risk of such stereotypes negatively impacting vulnerable groups[1]. The dataset focuses on stereotypes relevant to the English-speaking culture of the United States and does not cover all possible cultural contexts[1]. Nevertheless, BBQ has laid the groundwork for subsequent work on measuring and mitigating social bias in NLP and has become a reference point for comparing models on their ethical correctness.

Composition and Structure of the Dataset

BBQ contains approximately 58,500 questions and answers, grouped into specific sets aimed at identifying particular stereotypes[4]. All examples were manually created by the authors based on documented cases of biases and stereotypes that are harmful to members of various social groups[4]. The scenarios were created using data from academic research, media articles, reports, and other reliable sources that confirm the existence of a given stereotype and its harmful consequences[1]. For each situation, the authors provide a link to a source where the stereotype is described as negative or harmful (e.g., a scientific paper or a news article)[1].

Social Categories

BBQ covers nine main socially significant categories (most of which correspond to protected groups as defined by the U.S. Equal Employment Opportunity Commission)[1]:

  • Age – biases related to age groups (e.g., the stereotype that older people have diminished cognitive abilities)[1].
  • Disability – stereotypes about the mental abilities or other qualities of people with disabilities (e.g., the notion that physically disabled individuals are less intellectually competent)[1].
  • Gender identity – gender stereotypes (e.g., the idea that “girls are bad at math”)[1].
  • Nationality – national-ethnic biases (e.g., the stereotype that people from Africa are not tech-savvy)[1].
  • Physical appearance – discrimination based on appearance or body type (e.g., the belief that obese people are less intelligent or hardworking)[1].
  • Race/Ethnicity – racial stereotypes (e.g., biased association of a certain race with crime or drug addiction)[1].
  • Religion – religious stereotypes (e.g., the idea that Jewish people are greedy or that Muslims are prone to violence)[1].
  • Socio-economic status – biases against poor or wealthy segments of society (e.g., the belief that people from poor families will be bad parents)[1].
  • Sexual orientation – homophobic stereotypes (e.g., the false association of homosexuality with HIV infection)[1].

In addition to these nine categories, BBQ features two intersectional categories, which combine two attributes at once: (1) gender combined with race/ethnicity and (2) socio-economic status combined with race[1]. Such cases account for stereotypes at the intersection of different groups (e.g., biases specifically against Black women or against certain ethnic groups from a low social class).

Templates and Example Generation

For each category, the team wrote scenario templates — short vignettes featuring two characters who differ along the target attribute (e.g., young and old, male and female, rich and poor, etc.)[4]. The template sets up a situation that could either confirm or refute a known stereotype. Each scenario is associated with questions and answer options.

A total of 25 unique templates were developed for each of the nine main categories, plus 25 additional templates for the race and gender categories using real names (to test bias at the level of proper nouns)[1]. Additionally, 25 templates were created for each of the two intersectional areas[1]. Thus, the total number of base scenarios exceeds 300.

Each template contains special variable slots — for group names or descriptions — that are substituted into the text (e.g., in an age-related template, different numbers are substituted for "_-year-old person," or adjectives like "fat"/"thin" are used for physical appearance)[1]. By substituting various values and shuffling the order of mention of the two figures, each template is expanded into many specific examples (from a minimum of 8 up to ~200 variations)[1]. Typically, one template generates at least 100 final questions, and in some cases, up to two hundred[1]. In total, this resulted in a corpus of 58,492 examples (unique combinations of scenario, question, and answer)[4].

Contexts and Question Types

The key feature of BBQ is that each situation is presented with two context variants and two question formulations, forming a set of four questions (a so-called cluster of 4)[1]. This is done to separate the influence of insufficient information from the model's inherent stereotypes. In each quartet[1]:

  • Ambiguous context (under-informative or ambiguous context): text that describes a situation involving two individuals from different groups but does not contain enough information to definitively answer the question[1]. The correct answer in such cases is always the "unknown" option (i.e., insufficient data)[1]. These examples are needed to check if the model will draw a conclusion based on a stereotype.
  • Disambiguated context: an extended scenario where a clarifying detail is added, making it possible to determine which of the two mentioned individuals corresponds to the question[1]. In this case, one of the two characters is definitively the correct answer (and the "unknown" option is now incorrect). The disambiguated context is used to test whether the model can overcome its potential bias and choose the correct answer, even if a stereotype suggests the opposite[1].
  • Negative question (e.g., “Who is bad at the subject?”, “Who committed the crime?”)[1]. Such a question, combined with an ambiguous context, tests whether the model is inclined to associate a negative action with a particular group in the absence of facts[1].
  • Non-negative (contrasting) question (e.g., “Who is good at the subject?” or “Who refrains from the bad action?”)[1]. The non-negative question is framed to avoid sounding like a direct endorsement of a stereotype, while still allowing for a check on the balance of the model's responses[1]. Comparing answers to negative and neutral versions reveals systematic skews.

Each of these four examples in a cluster has three answer options: two specific ones (naming each of the two groups involved) and one option indicating a lack of sufficient information (labeled as “Unknown” and equivalent phrases)[1]. For example, in a scene involving a Christian and a Muslim, the answer options would be: “the Christian,” “the Muslim,” or “unknown”[1]. The word “unknown” is not always the same; 10 synonymous expressions are used[1].

Furthermore, the order of mention of the two groups is automatically varied in each template[1]. This is done to mitigate the order effect — a known factor where models may be more likely to choose the first-mentioned entity regardless of the content[1].

Annotation and Quality Control

Each BBQ example was evaluated by crowdsourced annotators: at least 5 independent individuals answered the questions, and only examples where at least 4 out of 5 annotators agreed on the correct answer (by majority vote) were included in the final dataset[1]. If any question failed to meet this threshold, the entire template was reviewed and edited[1]. Thanks to this process, human accuracy on BBQ is very high: individual annotators answered ~95.7% of questions correctly, and with majority voting, the gold standard accuracy reaches 99.7%[1]. Krippendorff's alpha for inter-annotator agreement was 0.883, indicating high consistency among humans regarding the correct answers[1]. These measures confirm that BBQ tasks are understandable to humans and have objectively correct answers; therefore, model errors on these examples can be reasonably interpreted as manifestations of bias, rather than ambiguity in the questions themselves.

Evaluating Model Bias

BBQ is designed for a multi-faceted evaluation of model behavior in situations that provoke social bias. When tested, a QA model receives a context and a question and must choose one of three answer options. The results are analyzed on two levels[1]:

Ambiguous Context Case

This measures how often the model incorrectly answers questions when necessary information is absent, i.e., it relies on a stereotype[1]. Ideally, the model should answer “unknown” to any question with an insufficient context. However, if it chooses one of the groups, this is considered a projection of an underlying stereotype[1]. The frequency of such errors and their distribution across categories provide insight into the model's tendency to reproduce harmful stereotypes.

Informative Context Case

This assesses how accurately the model answers when the context contains an explicit correct answer[1]. Here, the standard metric of accuracy (percentage of correct answers) is typically calculated, showing whether the model can handle the question-answering task in principle. However, special attention is given to cases where the correct answer goes against a stereotype[1]. The developers of BBQ analyze whether the model's accuracy decreases if the correct answer contradicts an established stereotype (and, conversely, whether accuracy is higher when the truth aligns with stereotypical expectations)[1]. Such an effect would indicate that even with facts present, the model may make errors due to bias.

Bias Score

To quantify the degree of bias, a special metric is introduced — the bias score[1]. In general, the bias score reflects the percentage of a model's answers (among incorrect ones or all, depending on the condition) that align with a stereotype[1].

  • A value of +100% would mean that the model, in all cases, chose the answer option that stereotypically attributes a negative quality to the target group.
  • 0% — no manifestation of bias (the model either always answers correctly/“unknown” or errs equally in both directions).
  • A negative score (down to -100%) — the opposite tendency, where the model always answers against the stereotypical expectation[1].

The scores are calculated separately for ambiguous and disambiguated contexts, as the nature of errors differs between them[1].

  • For ambiguous questions, the bias score is determined by the proportion of cases where the model, instead of “unknown,” chose a specific answer that coincided with a negative stereotype[1]. The more frequent such answers are, the higher the positive score. Accuracy is also taken into account: if a model errs and answers correctly (“unknown”) equally often, its score will be lower than that of a model that always chooses the stereotypical answer, even with some stereotypical errors[1]. Thus, both the frequency and confidence of biased answers are penalized (for ambiguous contexts, the metric is scaled by the percentage of correct “unknown” answers)[1].
  • For disambiguated questions, the bias score is calculated slightly differently, as the correct answer is one of the groups[1]. In these cases, the focus is on the model's incorrect answers: the proportion of errors where the model chose not the correct option, but an alternative that aligns with a stereotype[1]. In other words, if the model made a mistake by favoring a prejudice (e.g., disbelieving the facts and answering according to a stereotype), this increases the score[1].

Analyzing the bias score alongside overall accuracy allows for a detailed characterization of a model's behavior on BBQ. The authors note that the same accuracy can conceal different types of errors[1]. Thus, this metric reveals the directionality of errors and identifies subtle cases not visible from accuracy alone.

Results and Identified Patterns

Initial testing of several popular QA models on the BBQ dataset demonstrated a number of clear manifestations of bias[1]. The study by Parrish et al. (2022) tested both large general-purpose models (e.g., UnifiedQA – a generalized T5-based model for QA) and standardized multiple-choice models (e.g., ROBERTA fine-tuned on QA)[1].

The main conclusions from the experiments were:

  • Strong stereotypical errors when information is lacking. All tested systems showed a tendency to answer in line with stereotypes when the context did not provide necessary clues[1]. In other words, models often did not choose the “unknown” option, but preferred a specific answer that correlated with a stereotypical expectation[1]. For example, in ambiguous questions about a crime without a clear perpetrator, models frequently pointed to individuals from a specific group (corresponding to a prejudice)[1]. The calculated bias score for ambiguous contexts was significantly above zero, sometimes approaching +100% in certain categories for some models[1]. Models showed a particularly high tendency for stereotypical responses in scenes related to physical appearance (obesity, etc.) — this category yielded a noticeably higher bias than, for example, race or sexual orientation[1]. This indicates that bias is non-uniform within a model — some types of stereotypes are more strongly “learned” than others.
  • Improvement with facts, but persistence of hidden bias. When models received a disambiguated context with a clear indication of the correct answer, their accuracy increased significantly (compared to the ambiguous situation)[1]. However, a detailed analysis revealed a subtle effect: accuracy was uneven depending on the relationship between the correct answer and the stereotype[1]. On average, models achieved 3-3.5 percentage points higher accuracy on examples where the correct answer coincided with a common stereotype, compared to examples where the correct answer contradicted that stereotype[1]. In other words, when facts confirmed a prejudice, models answered almost flawlessly; but when required to name a “non-typical” option for a stereotype, the probability of error increased. Although not enormous, this performance gap was statistically significant across many categories[1]. The largest discrepancy was recorded for questions related to gender stereotypes, with up to a 5 percentage point difference[1]. Thus, the hidden influence of bias is evident: models perform slightly worse on average when working “against the stereotype.”
  • Comparison of categories and templates. The BBQ researchers analyzed the bias score broken down by all nine categories and found that in ambiguous contexts, the score was positive in all categories, but its magnitude varied[1]. As mentioned, the highest biases were observed in the categories of physical appearance, socio-economic status, and some intersectional categories[1]. Lower, though still non-zero, bias scores were found for race/ethnicity and sexual orientation[1]. In disambiguated contexts, the bias score was generally closer to zero (as the model often answers correctly), but for some templates, it remained positive, reflecting a noticeable skew in the nature of the errors made[1]. For example, in the religion category, most errors were skewed in one direction — when models erred, they typically chose an answer based on prejudice[1].

Overall, BBQ demonstrated that even powerful modern language models are clearly not free from social biases[1]. They are prone to reproducing stereotypes when faced with uncertainty and can exhibit subtle biases even when presented with facts that require a contrary answer[1]. The magnitude of these effects is not uniform across different groups: some stereotypes are more strongly “learned” by the model[1]. The authors of BBQ emphasize that while the detected differences are noticeable, they are not catastrophically large – the bias scores of most models do not reach extreme values and are often in the range of a few dozen percent[1]. Nevertheless, even small systematic deviations towards stereotypes are potentially dangerous in the large-scale deployment of LLMs, making the identification and elimination of such biases an important task[3]. BBQ has provided researchers with a clear and quantitatively measurable way to track progress in this area[3].

Impact and Further Research

BBQ quickly gained recognition as a standard tool for evaluating the fairness characteristics of language models[4]. Its open-source code and data are available in a repository (under a CC BY 4.0 license)[4], allowing the broad research community to use BBQ in the development and testing of new models. Several reviews mention BBQ alongside other benchmarks (e.g., StereoSet, WinoBias, ToxiGen) as an important milestone in the study of social bias in NLP[3]. Since BBQ's publication, works have emerged that build on its ideas and adapt them to new conditions:

  • Extension of Question Formats (Open-BBQ). The original BBQ offers tasks in a multiple-choice format[3]. In 2024, a modification of BBQ for open-ended answers was proposed, including fill-in-the-blank and short-answer text tasks[3]. This version, informally called Open-BBQ, allows for evaluating bias in more free-form dialogue settings where the model does not have fixed answer options[3]. The study showed that LLMs also exhibit increased bias against several groups when generating free text[3]. The authors of Open-BBQ also experimented with methods for mitigating bias, combining zero-shot and few-shot prompting and chain-of-thought (step-by-step reasoning)[3]. These methods significantly reduced the level of bias in the responses[3]. Open-BBQ complements the original dataset, making it possible to test generative models in formats closer to user queries.
  • Cultural Adaptation (Localization). Since BBQ is tied to the social realities of the United States, researchers became interested in adapting it to other languages and cultures[5]. In 2023, Korean researchers introduced the KoBBQ (Korean BBQ) dataset — a Korean counterpart to the Bias Benchmark[5]. They developed a general approach for localizing BBQ: they divided the original templates into three categories – those that could be simply translated, those that required replacing groups with local equivalents, and those that were not applicable in the Korean context at all[5]. Additionally, KoBBQ introduced 4 new categories of stereotypes specific to Korean society and removed several irrelevant examples[5]. The result was a dataset of 268 templates and 76,048 examples in Korean, covering 12 categories of social bias (including both original and new ones)[5]. Testing multilingual models on KoBBQ revealed significant differences in the level of bias compared to a direct machine translation of the original BBQ into Korean[5]. This highlights that direct translation is insufficient – culturally-specific benchmarks that consider the unique stereotypes and context of each country are necessary[5]. The work on KoBBQ demonstrated the feasibility of scaling the BBQ methodology globally.

BBQ has become an integral part of research on AI ethics[3]. Its influence is seen in the emergence of new debiasing techniques for models, the construction of more inclusive datasets, and metrics for fine-grained bias analysis. Researchers note that one of BBQ's strengths is its broad coverage and the careful construction of its examples[3]. In response to the challenges highlighted by BBQ, bias mitigation strategies have been actively developed recently, ranging from filtering training data to special post-processing algorithms and fine-tuning LLMs for fair responses[3].

In summary, BBQ (Bias Benchmark for QA) has established itself as a valuable and reliable tool for measuring social biases in language models. It provides the research community with a standard set of tests for comparing models on stereotypical behavior and tracking progress in improving their impartiality[3]. BBQ continues to be expanded and adapted, reflecting a global interest in creating more fair and safe AI systems[3], free from subtle but significant harmful biases.

Literature

  • Liang, P. et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110.
  • Chang, Y. et al. (2023). A Survey on Evaluation of Large Language Models. arXiv:2307.03109.
  • Ni, S. et al. (2025). A Survey on Large Language Model Benchmarks. arXiv:2508.15361.
  • Biderman, S. et al. (2024). The Language Model Evaluation Harness (lm-eval): Guidance and Lessons Learned. arXiv:2405.14782.
  • Kiela, D. et al. (2021). Dynabench: Rethinking Benchmarking in NLP. arXiv:2104.14337.
  • Ma, Z. et al. (2021). Dynaboard: An Evaluation‑As‑A‑Service Platform for Holistic Next‑Generation Benchmarking. arXiv:2106.06052.
  • Goel, K. et al. (2021). Robustness Gym: Unifying the NLP Evaluation Landscape. arXiv:2101.04840.
  • Xu, C. et al. (2024). Benchmark Data Contamination of Large Language Models: A Survey. arXiv:2406.04244.
  • Liu, S. et al. (2025). A Comprehensive Survey on Safety Evaluation of LLMs. arXiv:2506.11094.
  • Chiang, W.-L. et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
  • Boubdir, M. et al. (2023). Elo Uncovered: Robustness and Best Practices in Language Model Evaluation. arXiv:2311.17295.
  • Huang, L. et al. (2023). A Survey on Hallucination in Large Language Models. arXiv:2311.05232.

References

  1. 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.20 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28 1.29 1.30 1.31 1.32 1.33 1.34 1.35 1.36 1.37 1.38 1.39 1.40 1.41 1.42 1.43 1.44 1.45 1.46 1.47 1.48 1.49 1.50 1.51 1.52 1.53 1.54 1.55 1.56 1.57 1.58 1.59 1.60 1.61 1.62 1.63 1.64 1.65 1.66 1.67 1.68 1.69 1.70 1.71 1.72 1.73 1.74 1.75 1.76 1.77 1.78 1.79 1.80 Parrish A. et al. “BBQ: A Hand-Built Bias Benchmark for Question Answering”. arXiv. [1]
  2. Parrish A. et al. “BBQ: A hand-built bias benchmark for question answering”. ACL Anthology. [2]
  3. 3.00 3.01 3.02 3.03 3.04 3.05 3.06 3.07 3.08 3.09 3.10 3.11 3.12 3.13 3.14 3.15 Liu Z. et al. (2024). “Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings”. arXiv preprint. [3]
  4. 4.0 4.1 4.2 4.3 4.4 4.5 “BBQ Dataset”. Papers With Code. [4]
  5. 5.0 5.1 5.2 5.3 5.4 5.5 5.6 Jin J. et al. (2024). “KoBBQ: Korean Bias Benchmark for Question Answering”. arXiv preprint. [5]