Check out this new article by Do-Kyung Kim and colleagues on a smart way to make hate speech detection models better—especially when the hate is subtle and hard to spot.
The paper tackles a key issue in hate speech detection: implicit hate. This kind of speech doesn’t rely on slurs or obvious abuse but hides in jokes, metaphors, and coded language. It’s often ambiguous and context-dependent, which makes it hard for both people and machines to detect—and existing models don’t generalise well to new datasets. The authors aim to improve generalisability by cleaning up the training data using a new strategy they call CONELA.
So what’s the method? The team combines two ideas: training dynamics (how easy or hard it is for a model to learn each training example), and human annotation agreement (whether annotators agree on the label). They split the data into three groups—easy, ambiguous, and hard to learn—and then further split each group into consensual and non-consensual, based on whether humans agreed. Then, they remove the non-consensual examples from the easy and hard categories—keeping only those that are ambiguous or clearly labelled.
This refined dataset, they argue, helps the model focus on useful patterns instead of noisy or misleading data. They test this approach using well-known datasets like SBIC, OLID, and ETHOS for training, and DYNAHATE and ToxiGen for out-of-domain testing. They run experiments across multiple models, including BERT, RoBERTa, HateBERT, and GPT-4 variants.
The results are strong: CONELA consistently improves F1 scores across models and datasets, especially in out-of-domain scenarios. For example, models trained with CONELA on ETHOS scored up to 12.88 percentage points higher than the baseline when tested on OLID. It also holds up well against large language models (LLMs) like GPT-4—traditional models trained with CONELA sometimes outperform them.
One challenge they address is data scarcity—especially with smaller datasets like ETHOS. To deal with this, they introduce a weighted loss function that maximises disagreement in ensemble models. This means the model learns more robust features by encouraging diversity in predictions, even with less data.
Policy-wise, this matters because content moderation systems need to work well in real-world, unpredictable scenarios. Training on cleaner, more reliable data—while preserving the ambiguity that reflects real-world hate speech—could help models detect hate more accurately across different platforms, languages, and cultures.
For researchers, the big takeaway is that refining datasets using both machine learning insights and human perspectives improves model generalisation. The code is open source too: github.com/kdkcode/CONELA.