Abstract:
People use informal language on microblog platforms to share their opinions on
products, events, sports, or politics. Moreover, microblog platforms often harbor
instances of hate speech and cyberbullying, resulting in a massive amount of
data available for natural language processing applications. Most studies have
predominantly focused on common languages like English for tasks such as hate
speech detection, sentiment analysis, and emotion analysis. Dialectal Arabic
presents additional challenges due to its morphological richness and complexity,
making NLP applications more intricate.
While recent research has explored Arabic and Arabizi dialects, there has been
limited attention given to Lebanese Arabizi. To address this gap, our objective
was to construct a substantial Lebanese Arabizi dataset and make it accessible
for NLP research. Additionally, we sought to develop a new approach to Arabizi
detection and explored the identification of sarcasm and emotion recognition.
The dataset comprised 11,000 rows, a combination of comments collected from
Instagram and tweets. We utilized a pre-trained DziriBERT model for Arabizi identification and sarcasm detection, comparing the performances of contextual
embedding and semantic embedding models. The word embeddings were then
input into a Bidirectional Long Short-Term Memory (BiLSTM) model for emotion
recognition.
The Arabizi identification model achieved an impressive macro F1 score of 98%,
while the sarcasm detection model achieved an average macro F1 score of 63%.
This Arabizi detection model not only contributes to expanding the Arabizi
dataset but also holds potential for broader applications. Sarcasm detection is
crucial for microblog platforms to filter content, particularly since it heavily relies
on the manual reporting of offensive material. Additionally, emotion recognition
assists companies in understanding customers’ opinions about their products and services.