Abstract:
With the vast increase of social media users over the past few years, millions of product
reviews are discussed and posted in online forums and social media such as Facebook
and Twitter. There are many applications for sentiment analysis and opinion mining in
which governments or stock market observers use social media data to study the opinion
of the public and predict election results or stock fluctuations. This is also useful for
companies to collect feedback on their product releases. Filling rating surveys is no
longer efficient when we have a free growing database full of the public’s opinion. It is
therefore intuitive to make use of the social media’s textual data to build an automated
software that predicts the sentiment of the public; however the challenge arises in
analyzing informal languages. Most sentiment analysis research and progress is currently conducted in formal English. One major challenge is applying sentiment
analysis techniques onto other languages. With approximately four million tweets posted
daily in several Arabizi dialects, an informal Arabic whereby sentences are written using
English alpha numerals e.g. Yalla 7abibi, it is very useful to have a data mining tool that
can analyze the sentiment of Twitter users in the Arab world. We took the initiative to
make use of this abundance of data by analyzing it and predicting sentiment. Applying
the same sentiment analysis techniques that are used on English for Arabic is not a
simple task due to their semantic and structural differences, and because Arabic is a rich
morphological language. Informal Arabic lacks standardization and has no grammar,
thus sentimental analysis in this area is considered a complex process. Sentiment Analysis for Arabic has been studied for MSA (Modern Standard Arabic) but rarely for
informal Arabic, and non-existent for Arabizi; whereas most of the youth in Lebanon
text in Arabizi claiming that it is easier than texting in Arabic. The prevalence of this
expanding linguistic trend motivated us to target this NLP challenge. In this study, we
created a novel Lexicon of around 10,000 informal opinion words using regular
expressions to match over 50,000 words. We also created an algorithm that lemmatizes
Arabizi words, and classifies input sentences into positive, negative or neutral
categories. We collected around 400,000 Lines of Arabizi data from Whatsapp,
Facebook, and Twitter. We filtered them and tested a small sample across our classifier
achieving 80% classification accuracy. The dialect chosen for the lexicon is Lebanese,
our native language.