Who Does the Giant Number Pile Like Best: Analyzing Fairness in Hiring Contexts

doi:10.48550/arXiv.2501.04316

Who Does the Giant Number Pile Like Best: Analyzing Fairness in Hiring Contexts

Large language models (LLMs) are increasingly being deployed in high-stakes applications like hiring, yet their potential for unfair decision-making and outcomes remains understudied, particularly in generative settings. In this work, we examine the fairness of LLM-based hiring systems through two real-world tasks: resume summarization and retrieval. By constructing a synthetic resume dataset and curating job postings, we investigate whether model behavior differs across demographic groups and is sensitive to demographic perturbations. Our findings reveal that race-based differences appear in approximately 10% of generated summaries, while gender-based differences occur in only 1%. In the retrieval setting, all evaluated models display non-uniform selection patterns across demographic groups and exhibit high sensitivity to both gender and race-based perturbations. Surprisingly, retrieval models demonstrate comparable sensitivity to non-demographic changes, suggesting that fairness issues may stem, in part, from general brittleness issues. Overall, our results indicate that LLM-based hiring systems, especially at the retrieval stage, can exhibit notable biases that lead to discriminatory outcomes in real-world contexts.

Publication:

arXiv e-prints

Pub Date:

January 2025

DOI:

10.48550/arXiv.2501.04316

arXiv:

arXiv:2501.04316

Bibcode:

2025arXiv250104316S

Keywords:

Computer Science - Computation and Language

ADS

Who Does the Giant Number Pile Like Best: Analyzing Fairness in Hiring Contexts

Abstract