ABSTRACT
Language models are now massively used for a variety of tasks, including open-ended generation and writing assistance. However, generated texts can encapsulate biases and harm users. A variety of articles aim at detecting, measuring and mitigating stereotypical biases, but focus mainly on English and on pre-training tasks. Thus, we propose a framework to automatically measure gender biases generated by language models in inflected languages, in a practical setting. Herein, we report experiments using this framework on seven autoregressive language models used to generate more than 52,000 cover letters in French, addressing 203 industry and sectors, and over 4,100 cover letters in Italian, on 55 sectors. Associations between occupation and gender are studied using a system that we introduce to automatically identify morpho-syntactic gender markers in text. Results suggest that all models are strongly biased towards the generation of texts containing masculine gender markers. Overall, generated texts contain twice as many masculine (vs. feminine) markers in French, and eight times as many in Italian. Models also exacerbate gender stereotypes that are evidenced in social science studies and associate feminine inflections with occupations related to care, children and physical appearance, whereas occupations that require physical, technical and manual skills are strongly associated with masculine markers.


1 INTRODUCTION
In the past few years, pretrained Large Language Models (LLMs) have become the go-to approach for most Natural Language Processing (NLP) tasks such as text classification, named entity recognition, or machine translation [1–3], as well as for general public use. Nonetheless, LLMs exhibit and amplify stereotypical biases [4–6] that can be difficult to detect and assess. Stereotypical biases are “skewed and undesirable association[s] in language representations which ha[ve] the potential to cause representational or allocational harms” [7], that are based on stereotypes, i.e. “beliefs about the characteristics, attributes and behaviors of members of certain groups” [8]. This study focuses on gender stereotypes and henceforth we use the term bias to refer to gender-based stereotypical bias. More specifically, we aim to tackle the impact of gender biases on a common application of generative LLMs in a grounded use case related to the professional context: assistance with writing a cover letter [9, 10]. Gender segregation in the workplace has been documented for decades in various socio-cultural contexts [11, 12]. Correlations between mental representations, stereotypes and gender associations have also been established, as well as the role that language can play in the dissemination of such limiting representations [13, 14]. In parallel, it has been shown that humans “inherit artificial intelligence biases” [15]. Therefore, it is important to detect and evaluate the presence of biases in LLMs to prevent them from perpetuating and amplifying discrimination. While bias studies get increasing attention, most of the efforts focus on US-centric biases and on NLP models targeting English. Besides, Talat et al. [16] highlight the lack of bias evaluation in downstream tasks, close to real use cases of NLP. In this work, we propose a framework to automatically generate, detect and quantify binary gender biases in cover letters produced by different LLMs. This study focuses on binary gender biases, which can be addressed systematically by leveraging gender markers in inflected languages, rather than relying on lists of semantic clues [17]. We replicate a scenario close to real use cases, to assess biases that users encounter in a realistic setting. Moreover, we study two languages other than English, namely French and Italian, using gender inflections. We then use sociological studies to draw correlations between the results of our analysis and real-world stereotypes.
The contributions of this work are:
1. A framework to uncover gender biases in inflected languages, based on morpho-syntactic clues and a realistic use case;
2. A freely available2 automatic gender marker detection system for French and Italian;
3. A study of biases in 7 LLMs using the proposed framework and social studies