Check out this article by Jan Fillies and Adrian Paschke, investigating how emerging slurs in youth language introduce bias in BERT-based models, ultimately affecting their accuracy and fairness.
The research focuses on identifying and mitigating bias in hate speech detection models caused by emerging youth language. The study measures how new slurs impact model performance and whether existing classifiers can fairly and accurately detect hateful language used by young people. To do this, the researchers develop a new framework to detect and quantify youth language bias in hate speech classifiers. They create three distinct test sets: one featuring newly identified youth slang, another containing established slurs, and a third designed to detect model overfitting using neutral words. These test sets allow them to compare how well different models handle both familiar and emerging hate speech terms.
Four BERT-based hate speech classifiers—three from academic research and one commercial model (Google Jigsaw’s Perspective API)—are evaluated using this framework. The study introduces a new Youth Language Bias Score (YLS) to quantify how well these models handle emerging youth language slurs. The results show that all tested models struggle with new youth slang, performing significantly worse on the dataset containing emerging slurs than on one with well-known slurs. The IMSyPP model, however, stood out as the most effective, demonstrating the highest accuracy in detecting both emerging and established slurs while maintaining the lowest bias score. To address the problem, the researchers fine-tune the IMSyPP model using a dataset that includes youth language, improving its accuracy in detecting emerging slurs from 41.9% to 68.9%. This process proves that fine-tuning models with new linguistic data can significantly reduce bias and improve performance.
One of the study’s key takeaways is that overfitting remains a major challenge in hate speech detection. Some models, such as the R4 Target, performed well in identifying known slurs but struggled when faced with new ones. This suggests that they were learning specific sentence structures rather than truly understanding the words themselves. The findings carry important policy and research implications. AI-driven moderation tools used by online platforms need to be updated with youth-specific datasets to reduce bias and ensure that harmful content does not go undetected. Ethical AI development requires continuous adaptation to evolving language trends. Further research is needed to determine whether similar biases exist in larger language models such as GPT-4 or LLaMA 3.
This research underscores the importance of adapting AI-driven moderation tools to dynamic linguistic changes, ensuring that online spaces remain safe for all users, especially young people.