Multilingual Human-Centered Alignment to Mitigate Gender Bias in LLMs

slide_0 (2).png

LLMs generally behave according to one-dimensional values: The dominant values in an alignment dataset will be learned by the fine-tuned model, with these values very often dictated by those who develop the model itself and the available data. To overcome this issue, this project aims to create a pluralistic alignment dataset and fine-tune a series of models that behave according to different social demographics' values. With gender as a case study, we collect human preferences from the US and Germany from diverse social demographics through participatory feedback. By asking diverse stakeholders, particularly marginalised and underrepresented communities, about different values in the context of gender, such as helpfulness, harmlessness, emotional empowerment, and sensitivity, we collect diverse perspectives, capturing cultural nuances and varied experiences.

We use these diverse preferences in a two-fold way. First, we analyse them and uncover underlying associations and perceptions about values in gender across different social demographics, highlighting the divergence of attitudes across various stakeholders. For example, participants from Germany were less likely to perceive responses as helpful than those from the United States. Age was also influential, with older participants (aged 51–60) being less likely to rate responses as helpful than younger participants. Gender identity had a notable impact, with participants identifying as males demonstrating lower odds of perceiving responses as toxic or biased compared to those identifying as females. Moreover, individuals identifying as rather conservative were less likely to rate responses as stereotypically biased than liberals.

Second, by considering the above results, we filter the alignment dataset iteratively and fine-tune LLMs that comply with the values externalised by underrepresented participants such as non-white individuals, female and non-binary. We showcase how the fine-tuned models’ behaviours diverge from the behaviour of the model fine-tuned in all the data and the one fine-tuned on the data of dominant demographics. We benchmark different metrics of the models, such as toxicity, stereotypes, and helpfulness and analyse the implications of the findings.

The project contributes to Participatory AI development by actively involving diverse communities in value alignment within multilingual LLMs. By leveraging participatory feedback from individuals with varied cultural, political, gender, and demographic backgrounds, the project ensures that the development process is inclusive and directly shaped by those most negatively impacted by AI harms. This approach promotes fairness and inclusivity and is a model for ethical AI development responsive to global communities' diverse needs and values.