Racial Bias in Facial Recognition: The Role of Training Data and Human Prejudice

Feb 1, 2024 11 min read

Photo by rawpixel on Unsplash

Summary

This project investigates how dataset imbalance and annotator prejudice contribute to racial bias in facial recognition systems.
Study 1-2 focuses on how training data diversity affects model performance across racial groups.
Study 3 examines how human bias in labeling influences classification accuracy, integrating psychological measures with deep learning pipelines.

Key Findings

Imbalanced datasets reduced categorization accuracy for underrepresented racial groups, especially racially ambiguous faces.
Annotator bias was embedded in the training pipeline, leading to misclassification patterns that reflected human prejudice.

Team & Collaboration:

Jeff J. Berg, Ph.D., New York University
David M. Amodio, Ph.D., University of Amsterdam & New York University

Goal

To understand how training set balance and human social biases shape the fairness of AI models—bridging psychology and computer vision to uncover where inequality truly originates in algorithmic decision-making.

Why it matters

Reveals how bias originates in human labeling, not just in algorithms.
Provides insight for building responsible, fair, and interpretable AI systems.
Encourages interdisciplinary collaboration between psychology and machine learning to design socially aware technologies.

Challenge

Training Set Imbalance (Study 1–2):

Navigate the abundance of existing face datasets and determining which best fit research goals.
Handle scale with automation—ingestion, labeling, and file movement for millions of face images.
Need to move beyond monoracial test sets; evaluate performance on racially ambiguous and multiracial faces to reflect real-world diversity.

Addressing Human Annotator Bias (Study 3):

Determine whether human bias—beyond dataset imbalance—contributes to bias observed in AI systems.
Integrate psychological data into machine learning pipelines.
Develop automated systems to collect large-scale human data and prepare it efficiently for model training.

Solutions to the challenge

Training Set Imbalance (Study 1–2):

Reviewed and compared existing datasets to analyze their racial composition and representation.
Built custom Python scripts for automation (image handling, labeling, and organization)—pre-LLM based AI Assistant era, learning primarily from Stack Overflow.
Generated racially ambiguous faces using WebMorph to bridge gaps in existing datasets.

Addressing Human Annotator Bias (Study 3):

Statistically controlled for annotator bias to isolate human contributions to AI bias.
Launched large-scale data collection on MTurk to gather annotations from diverse human participants.
Coded automation pipelines for data handling and processing (pre-LLM AI Helper era, learned primarily from Stack Overflow).

Study 1-2: Dataset Imbalance

Research Process

Stage 1: Identifying the Problem

Challenge: Automated face classification models frequently misclassify non-White faces due to biased training datasets.
Key Question: How does dataset racial balance affect a model’s ability to accurately classify both monoracial and mixed-race faces?
Psychological Context: Reflects findings from social categorization research—biased exposure skews perceptual boundaries, analogous to biased data exposure in AI.

Dataset composition overview
The racial compositions of the most popular publicly available face image datasets with race annotations.

Stage 2: Developing the Study Design

Objective: Examine how dataset diversity, independent of dataset size, influences classification performance.
Approach:
- Study 1: Tested overall race/gender classification accuracy across datasets.
- Study 2: Examined racial ambiguity via mixed-race face morphs.
Datasets:
- UTKFace: ~23,000 images, overrepresenting White and male faces.
- FairFace: ~108,000 images balanced across racial and gender categories.
Programming Environment:
- Implemented in Python using PyTorch (for deep learning) and NumPy / Pandas (for data handling).
- Visualizations generated with Matplotlib and Seaborn for plotting classification accuracy and with R and quickpsy package for PSE curves.
Model Architecture: All experiments used ResNet-50, pretrained on ImageNet, retrained on customized datasets.

Model performance comparison (UTKFace vs FairFace)
Demographic composition of the original UTKFace dataset (FForiginal) and the original FairFace (UTKoriginal) dataset.

Stage 3: Dataset Training

Gathered 40,000+ face images from both UTKFace and FairFace.
Constructed balanced and size-matched subsets to disentangle effects of diversity vs. dataset size.
Generated mixed-race face morphs (Black↔White continuum) for Study 2 to test classification thresholds.

Stage 4: Analysis

Metrics:
- Race and gender classification accuracy.
- Race × gender interaction effects.
- Point of Subjective Equality (PSE): Morph level where faces are equally likely to be labeled Black or White.
Comparisons:
- UTKFace vs. FairFace models.
- Balanced vs. unbalanced subsets.
- Male vs. female face classifications.

Stage 5: Results

Accuracy: Balanced datasets (FairFace) produced higher cross-group accuracy and fewer racial misclassifications.
Bias Direction: UTKFace-trained models showed systematic over-classification of White faces and under-classification of Black faces.
Ambiguity Sensitivity:
- Imbalanced models classified mixed-race faces as White until ~70–80% Black composition.
- Balanced models’ PSEs approached 50%, aligning more closely with perceptual reality.
Stability: PSE values ranged widely (30–93%) across data subsets, showing extreme sensitivity to sampling bias.

Dataset morphing examples
Example of White and Black race classification probabilities for a single face morph spectrum (as classified by the UTKoriginal model).

Dataset PSE distribution
Separate PSE curves for the model trained on the original UTKFace dataset (UTKoriginal) and the model trained on the original FairFace dataset (F Foriginal), separately for male faces and for female faces. The error bars represent the 95% confidence interval around the PSE for each gender category.

Stage 6: Interpretation

Core Finding: Dataset racial diversity—not size—determines fairness and perceptual accuracy.
Psychological Parallel: Similar to own-race bias in human face perception—exposure diversity improves perceptual generalization.
UX Implication: Data representativeness is a design decision, shaping user-facing fairness and trust outcomes.
Ethical Takeaway: Diversity audits should be built into the dataset design process, much like user diversity testing in UX workflows.

STUDY 3: Annotator Prejudice Propagation