Text-to-Image Bias Audit Benchmark

T2I-BiasBench: A Multi-Metric Framework for Auditing Demographic and Cultural Bias in Text-to-Image Models

Nihal Jaiswal1†, Siddhartha Arjaria1†, Gyanendra Chaubey2*, Ankush Kumar1†, Aditya Singh1†, Anchal Chaurasiya1†
1Department of Information Technology, Rajkiya Engineering College, Banda, Uttar Pradesh, India
2School of AI and Data Science, IIT Jodhpur, Rajasthan, India
Equal contribution. *Corresponding author: m23air005@iitj.ac.in

Abstract

Text-to-image (T2I) generative models achieve impressive visual fidelity but inherit and amplify demographic imbalances and cultural biases embedded in training data. We introduce T2I-BiasBench, a unified evaluation framework of thirteen complementary metrics that jointly captures demographic bias, element omission, and cultural collapse in diffusion models—the first framework to address all three dimensions simultaneously.

We evaluate three open-source models—Stable Diffusion v1.5, BK-SDM Base, and Koala Lightning—against Gemini 2.5 Flash (RLHF-aligned) as a reference baseline. The benchmark comprises 1,574 generated images across five structured prompt categories. T2I-BiasBench integrates six established metrics with seven additional measures: four newly proposed (Composite Bias Score, Grounded Missing Rate, Implicit Element Missing Rate, Cultural Accuracy Ratio) and three adapted (Hallucination Score, Vendi Score, CLIP Proxy Score).

Three key findings emerge: (1) Stable Diffusion v1.5 and BK-SDM exhibit bias amplification (>1.0) in beauty-related prompts; (2) contextual constraints such as surgical PPE substantially attenuate professional-role gender bias (Doctor CBS = 0.06 for SD v1.5); and (3) all models, including RLHF-aligned Gemini, collapse to a narrow set of cultural representations (CAS: 0.54–1.00), confirming that alignment techniques do not resolve cultural coverage gaps.

Framework

T2I-BiasBench framework diagram
End-to-end pipeline from prompt families to model generation, caption/attribute extraction, and metric modules that produce composite bias and diversity indicators.

Evaluation Metrics

T2I-BiasBench uses thirteen complementary metrics—six established and seven additional—to capture multiple failure modes instead of overfitting to one bias score.

  • Representation Parity
  • Parity Difference
  • Bias Amplification
  • Shannon Entropy
  • KL Divergence
  • Festival CAS (Contextual Association Score)
  • Composite Bias Score
  • Cultural Accuracy Ratio
  • GMR (Grounded Missing Rate)
  • IEMR (Implicit Element Missing Rate)
  • Hallucination Score
  • Vendi Score
  • CLIP Proxy Score (prompt-caption alignment)

Main Results

Composite bias score across models and prompts
Composite Bias Score

Aggregate fairness profile across all prompt families (0 = fair, 1 = biased).

Bias profile radar for all models
Bias Profile Radar

Cross-prompt comparison where smaller area indicates a fairer overall behavior profile.

Vendi score and CLIP proxy score comparison
Vendi + CLIP Proxy

Joint view of diversity (Vendi) and prompt alignment (CLIP Proxy) per prompt family.

Animal baseline puzzle and lab context fidelity
Animal Baseline Capability

Capability stress-test highlighting puzzle accuracy and lab-context faithfulness gaps.

Demographic Prompt Findings

Beauty prompt ethnicity distribution
Beauty Prompt: Ethnicity Distribution

Distribution skew by model, including unknown and medium-response share.

Doctor prompt gender distribution
Doctor Prompt: Gender Distribution

Parity deviation and stereotype reinforcement patterns across model outputs.

Culture prompt CAS versus cultural accuracy
Culture Prompt: CAS vs Cultural Accuracy

Measures stereotype intensity against culturally grounded correctness.

Additional Diagnostics

Bias amplification in beauty and doctor prompts
Bias Amplification

Amplification above 1.0 indicates active reinforcement beyond observed training bias.

KL divergence and Shannon entropy for beauty prompts
KL Divergence and Shannon Entropy

Fair-distribution distance and diversity signal for beauty prompt generations.

Qualitative Gallery

Generated image gallery part 1
Rows: Beauty, Doctor, Animal

Side-by-side generations and composite bias scores across models.

Generated image gallery part 2
Rows: Nature, Culture

Prompt-level visual comparison with low/moderate/high bias bands.

Key Contributions

  1. The first unified thirteen-metric evaluation framework jointly capturing demographic bias, omission, diversity, and cultural fidelity in T2I models. We propose four new metrics—Composite Bias Score (CBS), Grounded Missing Rate (GMR), Implicit Element Missing Rate (IEMR), and Cultural Accuracy Ratio (CAR)—and adapt three existing metrics (Hallucination Score, Vendi Score, CLIP Proxy Score) for T2I bias evaluation.
  2. Empirical evidence that Stable Diffusion v1.5 and BK-SDM exhibit Bias Amplification >1.0 for beauty-related prompts, demonstrating that these models actively reinforce stereotypes beyond underlying training data distributions.
  3. Identification of a novel phenomenon, Visual Attribute Occlusion Prompting (VAOP), wherein contextual elements such as surgical PPE obscure demographic cues and significantly reduce measurable gender bias—a retraining-free mitigation strategy.
  4. Quantification of a systemic cultural representation collapse across all evaluated models—including RLHF-aligned Gemini—showing that diverse cultural prompts are mapped to a narrow subset of dominant representations.
  5. Demonstration that model scale does not monotonically predict bias severity, highlighting the dominant role of data composition and training dynamics over parameter count.

BibTeX

@misc{jaiswal2026t2ibiasbench,
  title={T2I-BiasBench: A Multi-Metric Framework for Auditing Demographic and Cultural Bias in Text-to-Image Models},
  author={Jaiswal, Nihal and Arjaria, Siddhartha and Chaubey, Gyanendra and Kumar, Ankush and Singh, Aditya and Chaurasiya, Anchal},
  year={2026},
  eprint={2604.12481},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2604.12481},
  doi={10.48550/arXiv.2604.12481}
}