Interpretable Hate Speech Detection via Large Language Model-extracted Rationales
Document
Description
Social media platforms have become widely used for open communication, yet their lack of moderation has led to the proliferation of harmful content, including hate speech. Manual monitoring of such vast amounts of user-generated data is impractical, thus necessitating automated hate speech detection methods. Pre-trained language models have been proven to possess strong base capabilities, which not only excel at in-distribution language modeling but also show powerful abilities in out-of-distribution language modeling, transfer learning and few-shot learning. However, these models operate as complex function approximators, mapping input text to a hate speech classification, without providing any insights into the reasoning behind their predictions. Hence, existing methods often lack transparency, hindering their effectiveness, particularly in sensitive content moderation contexts. Recent efforts have been made to integrate their capabilities with large language models like ChatGPT and Llama2, which exhibit reasoning capabilities and broad knowledge utilization. This thesis explores leveraging the reasoning abilities of large language models to enhance the interpretability of hate speech detection. A novel framework is proposed that utilizes state-of-the-art Large Language Models (LLMs) to extract interpretable rationales from input text, highlighting key phrases or sentences relevant to hate speech classification. By incorporating these rationale features into a hate speech classifier, the framework inherently provides transparent and interpretable results. This approach combines the language understanding prowess of LLMs with the discriminative power of advanced hate speech classifiers, offering a promising solution to the challenge of interpreting automated hate speech detection models.