Text-to-Image Bias Audit Benchmark

T2I-BiasBench: A Multi-Metric Framework for Auditing Demographic and Cultural Bias in Text-to-Image Models

Nihal Jaiswal^1†, Siddhartha Arjaria^1†, Gyanendra Chaubey^2*, Ankush Kumar^1†, Aditya Singh^1†, Anchal Chaurasiya^1†

¹Department of Information Technology, Rajkiya Engineering College, Banda, Uttar Pradesh, India
²School of AI and Data Science, IIT Jodhpur, Rajasthan, India

^†Equal contribution. *Corresponding author: m23air005@iitj.ac.in

arXiv Code Benchmark Portal Video PPT

Abstract

Text-to-image (T2I) generative models achieve impressive visual fidelity but inherit and amplify demographic imbalances and cultural biases embedded in training data. We introduce T2I-BiasBench, a unified evaluation framework of thirteen complementary metrics that jointly captures demographic bias, element omission, and cultural collapse in diffusion models—the first framework to address all three dimensions simultaneously.

We evaluate three open-source models—Stable Diffusion v1.5, BK-SDM Base, and Koala Lightning—against Gemini 2.5 Flash (RLHF-aligned) as a reference baseline. The benchmark comprises 1,574 generated images across five structured prompt categories. T2I-BiasBench integrates six established metrics with seven additional measures: four newly proposed (Composite Bias Score, Grounded Missing Rate, Implicit Element Missing Rate, Cultural Accuracy Ratio) and three adapted (Hallucination Score, Vendi Score, CLIP Proxy Score).

Three key findings emerge: (1) Stable Diffusion v1.5 and BK-SDM exhibit bias amplification (>1.0) in beauty-related prompts; (2) contextual constraints such as surgical PPE substantially attenuate professional-role gender bias (Doctor CBS = 0.06 for SD v1.5); and (3) all models, including RLHF-aligned Gemini, collapse to a narrow set of cultural representations (CAS: 0.54–1.00), confirming that alignment techniques do not resolve cultural coverage gaps.

Framework

End-to-end pipeline from prompt families to model generation, caption/attribute extraction, and metric modules that produce composite bias and diversity indicators.

Evaluation Metrics

T2I-BiasBench uses thirteen complementary metrics—six established and seven additional—to capture multiple failure modes instead of overfitting to one bias score.

Representation Parity
Parity Difference
Bias Amplification
Shannon Entropy
KL Divergence
Festival CAS (Contextual Association Score)
Composite Bias Score
Cultural Accuracy Ratio
GMR (Grounded Missing Rate)
IEMR (Implicit Element Missing Rate)
Hallucination Score
Vendi Score
CLIP Proxy Score (prompt-caption alignment)

Main Results

Composite Bias Score

Aggregate fairness profile across all prompt families (0 = fair, 1 = biased).

Bias Profile Radar

Cross-prompt comparison where smaller area indicates a fairer overall behavior profile.

Vendi score and CLIP proxy score comparison

Vendi + CLIP Proxy

Joint view of diversity (Vendi) and prompt alignment (CLIP Proxy) per prompt family.

Animal baseline puzzle and lab context fidelity

Animal Baseline Capability

Capability stress-test highlighting puzzle accuracy and lab-context faithfulness gaps.

Demographic Prompt Findings

Beauty Prompt: Ethnicity Distribution

Distribution skew by model, including unknown and medium-response share.

Doctor Prompt: Gender Distribution

Parity deviation and stereotype reinforcement patterns across model outputs.

Culture prompt CAS versus cultural accuracy

Culture Prompt: CAS vs Cultural Accuracy

Measures stereotype intensity against culturally grounded correctness.

Additional Diagnostics

Bias Amplification

Amplification above 1.0 indicates active reinforcement beyond observed training bias.

KL Divergence and Shannon Entropy

Fair-distribution distance and diversity signal for beauty prompt generations.

Qualitative Gallery

Rows: Beauty, Doctor, Animal

Side-by-side generations and composite bias scores across models.

Rows: Nature, Culture

Prompt-level visual comparison with low/moderate/high bias bands.

Key Contributions

The first unified thirteen-metric evaluation framework jointly capturing demographic bias, omission, diversity, and cultural fidelity in T2I models. We propose four new metrics—Composite Bias Score (CBS), Grounded Missing Rate (GMR), Implicit Element Missing Rate (IEMR), and Cultural Accuracy Ratio (CAR)—and adapt three existing metrics (Hallucination Score, Vendi Score, CLIP Proxy Score) for T2I bias evaluation.
Empirical evidence that Stable Diffusion v1.5 and BK-SDM exhibit Bias Amplification >1.0 for beauty-related prompts, demonstrating that these models actively reinforce stereotypes beyond underlying training data distributions.
Identification of a novel phenomenon, Visual Attribute Occlusion Prompting (VAOP), wherein contextual elements such as surgical PPE obscure demographic cues and significantly reduce measurable gender bias—a retraining-free mitigation strategy.
Quantification of a systemic cultural representation collapse across all evaluated models—including RLHF-aligned Gemini—showing that diverse cultural prompts are mapped to a narrow subset of dominant representations.
Demonstration that model scale does not monotonically predict bias severity, highlighting the dominant role of data composition and training dynamics over parameter count.

BibTeX

@misc{jaiswal2026t2ibiasbench,
  title={T2I-BiasBench: A Multi-Metric Framework for Auditing Demographic and Cultural Bias in Text-to-Image Models},
  author={Jaiswal, Nihal and Arjaria, Siddhartha and Chaubey, Gyanendra and Kumar, Ankush and Singh, Aditya and Chaurasiya, Anchal},
  year={2026},
  eprint={2604.12481},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2604.12481},
  doi={10.48550/arXiv.2604.12481}
}

Built with the Academic Project Page style and adapted for T2I-BiasBench.