How AI Texts Reinforce Harmful Stereotypes

Authors:

(1) Evan Shieh, Young Data Scientists League ([email protected]);

(2) Faye-Marie Vassel, Stanford University;

(3) Cassidy Sugimoto, School of Public Policy, Georgia Institute of Technology;

(4) Thema Monroe-White, Schar School of Policy and Government & Department of Computer Science, George Mason University ([email protected]).

Table of Links

Abstract and 1 Introduction

1.1 Related Work and Contributions

2 Methods and Data Collection

2.1 Textual Identity Proxies and Socio-Psychological Harms

2.2 Modeling Gender, Sexual Orientation, and Race

3 Analysis

3.1 Harms of Omission

3.2 Harms of Subordination

3.3 Harms of Stereotyping

4 Discussion, Acknowledgements, and References

SUPPLEMENTAL MATERIALS

A OPERATIONALIZING POWER AND INTERSECTIONALITY

B EXTENDED TECHNICAL DETAILS

B.1 Modeling Gender and Sexual Orientation

B.2 Modeling Race

B.3 Automated Data Mining of Textual Cues

B.4 Representation Ratio

B.5 Subordination Ratio

B.6 Median Racialized Subordination Ratio

B.7 Extended Cues for Stereotype Analysis

B.8 Statistical Methods

C ADDITIONAL EXAMPLES

C.1 Most Common Names Generated by LM per Race

C.2 Additional Selected Examples of Full Synthetic Texts

D DATASHEET AND PUBLIC USE DISCLOSURES

D.1 Datasheet for Laissez-Faire Prompts Dataset

2.1 Textual Identity Proxies and Socio-Psychological Harms

We analyze LM-generated synthetic texts for bias using language cues that have been shown to induce socio-psychological harms that disproportionately affect minoritized consumers. In this study, we specifically focus on textual identity proxies for race, gender, and sexual orientation. Our approach is guided by established cognitive studies showing how stereotypes can be automatically activated in the minds of people who are shown specific words associated with race and gender (an example of priming [49]). Once primed, these stereotypes can lead to significant changes in behavior [50], attitude [21], performance [22, 40, 51, 52], and self-perception [41] in addition to reinforcing prejudiced perceptions of other identity groups [42]. One relevant example is stereotype threat [40], where priming for stereotypes contributes to decreased cognitive performance for minoritized individuals, including women in quantitative classrooms [58] and African-American and Latine students in all academic disciplines [40, 52]. Stereotype threat is therefore a form of cognitive load impairment and it explains persistent performance gaps between identity groups that are not caused by socio-economic factors alone [51]. Alarmingly, activating stereotype threat does not require the reader to be consciously aware that they are being primed, and this may in fact magnify the effect [49]. This fits our study setting, where race, gender, and sexual orientation are not explicitly prompted for (see Table 1), thus leaving consumers of LMs especially susceptible to triggered harms.

Following stereotyping studies that prime participants using word lists [21, 22], we analyze LM-generated texts for race (using names) and gender proxies (using pronouns, titles, and gendered references). Table 2 shows the similarities between textual proxies that we match in our study and words that have been demonstrated in psychology studies to prime stereotype threat by race and gender. This experimental design has additional precedence in sociotechnical studies that report discriminatory outcomes in hiring [28] and targeted search advertisements [3] in response to equivalent proxies.

To extract textual identity proxies at scale, we fine-tune a coreference resolution model (ChatGPT 3.5) using 150 handlabeled examples to address underperformance in the pretrained LMs on underrepresented groups (e.g. non-binary) [58]. On an evaluation dataset of 4,600 uniformly down-sampled LM-generated texts, our fine-tuned model performs at 98.0% gender precision, 98.1% name precision, 97.0% gender recall, and 99.3% name recall (.0063 95CI). Overall name coverage of our fractionalized counting datasets is 99.98%.

2.2 Modeling Gender, Sexual Orientation, and Race

Our model quantifies three categories of gendering by directly matching on gender references found in LM-generated text (Table S6a): non-binary (NB), feminized (F), masculinized (M). For prompts specific to romantic relationships, these correspond to six relationship pairs implying various sexual orientations (NB-NB, NB-F, NB-M, F-F, M-M, F-M). Our model quantifies seven categories of racialization that correspond to the latest OMB-proposed Census [53]: American Indian or Alaska Native (AI/AN), Native Hawaiian or Pacific Islander (NH/PI), Middle Eastern or North African (MENA), Hispanic or Latino (we adopt Latine as a gender-neutral label), Asian, African-American or Black, and White.

We model race using first name as the majority (90.9%) of LM responses to our prompts refer to individuals using first names only. While first names do not correspond to racial categories in a mutually exclusive manner (for example, “Joy” may depict individuals of any race), they still carry perceived racial signal, as proven by bias studies across multiple settings [3, 17, 19, 28, 29, 54]. We adopt the approach of fractionalized counting depicted in Kozlowski, et al. [55] which was shown to outperform single-category modeling in reducing racial biases of over/undercounting. Following this method, we associate first name with a categorical distribution across races, based on datasets of named individuals who provide self-identified race, as Equation 1 shows below.

We are unable to use the U.S. Census data directly as it only releases surname information. Therefore, we base our fractional counting on two complementary datasets for which data on first names is present. The first dataset we leverage is open-sourced Florida Voter Registration Data from 2017 and 2022 [56], which contains names and self-identified races for 27,420,716 people comprising 447,170 unique first names. Of the seven racial categories in the latest OMB-proposed Census [53], the Florida Voter Registration Data contains five: White, Hispanic or Latino, Black, Asian Pacific Islander (API), and American Indian or Alaska Native (AI/AN). While any non-Census dataset is an approximation of racial categories (indeed, as the Census itself approximates the general population), we find this dataset to be the most appropriate publicly available dataset out of all comparison datasets we found for which a large number of named individuals selfreport race [56, 86, 87]. First, we are able to model a greater number of race / ethnicity categories compared to some more recent datasets. For example, [87] leverages voter registration data from six states but categorically omits AI/AN as a label by aggregating it under “Other”. Second, we find that the degree of sampling bias introduced by the data collection process of voting (e.g. through voting restrictions) is lower than the comparable sampling bias introduced by other dataset methods such as mortgage applications [86], which systematically under-represent Black and Latine individuals. Out of all comparison datasets we evaluated, [56] most closely approximates the racial composition of the U.S. Census, deviating by no more than 4.57% for all racial groups (with the largest gap due to representing White individuals at 63.87% compared to 2021 Census levels of 59.30%). By contrast, [86] overcounts White individuals with a representation of 82.33% (deviation of +23.03%) while undercounting Black individuals with a representation of 4.20% (deviation of -9.32%).

In the absence of self-reported data, these datasets have several limitations. First, we acknowledge that countries of origin are not equivalent to racial identities, and that broad race categorizations can obscure the identities of meaningful sub-groups. For example, the exclusion of country-of-origin identities (i.e., Chinese, Indian, Nigerian) and the omission (via aggregation) of individuals identifying as MENA or NH/PI into categories such as “White” or “Asian / Pacific Islander” respectively, masks their marginalization within these categories. These limitations remain a persistent issue within widely adopted data collection methods for race and/or ethnicity, including the U.S. Census (which only proposed adding MENA as a race in 2023). This was a shortcoming we observed in all comparison datasets we considered with a large number of individuals that contained self-reported race by first name data [56, 86, 87]. Therefore, in the absence of self-reported race information, we identified an additional data source to approximate racial likelihood for MENA and NH/PI. We build off of the approach developed in [57] that uses data of named individuals on Wikipedia to analyze disparities in academic honorees by country of origin. Our approach leverages OMB’s proposed hierarchical race and ethnicity classifications to approximate race for the two missing categories by mapping existing country lists for both racial groups to Wikipedia’s country taxonomy. For MENA, we build upon OMB’s country list [53] based off a study of MENAidentifying community members [88]. For NH/PI, we build upon public health guides for Asian American individuals intended for disaggregating Pacific Islanders from “API” [89]. The full list of countries we use is provided in Table S6b.

In both datasets we use, the methods of creation and collection themselves skew racial composition, due to factors like voting restrictions and demographic bias of Wikipedia editors [90]. Wikipedia is likely to over-represent Anglicized names and under-represent MENA and NH/PI names. Therefore, we would expect the names extracted from these categories in the aggregate to have results in our study that are more similar to White names compared to other minoritized races. However, our study finds the opposite to be true. Despite this Western bias, we show that language models nevertheless generate synthetic texts that under-represent names approximated from MENA and NH/PI countries in power-neutral portrayals, and subordinate these names when power dynamics are introduced, similar to other minoritized races, genders, and sexual orientations. For technical details and replication, see Supplement B, Tables S7-S9.

This paper is available on arxiv under CC BY 4.0 DEED license.

How AI Texts Reinforce Harmful Stereotypes—Even Without Asking

Table of Links

2.1 Textual Identity Proxies and Socio-Psychological Harms

2.2 Modeling Gender, Sexual Orientation, and Race