Sentiment analysis for Arabizi in social media. (c2015)

Tobaili, Taha

Sentiment analysis for Arabizi in social media. (c2015)

Tobaili, Taha

URI: http://hdl.handle.net/10725/2702

DOI: https://doi.org/10.26756/th.2015.27

Date: 2016-02-02

Abstract:

With the vast increase of social media users over the past few years, millions of product reviews are discussed and posted in online forums and social media such as Facebook and Twitter. There are many applications for sentiment analysis and opinion mining in which governments or stock market observers use social media data to study the opinion of the public and predict election results or stock fluctuations. This is also useful for companies to collect feedback on their product releases. Filling rating surveys is no longer efficient when we have a free growing database full of the public’s opinion. It is therefore intuitive to make use of the social media’s textual data to build an automated software that predicts the sentiment of the public; however the challenge arises in analyzing informal languages. Most sentiment analysis research and progress is currently conducted in formal English. One major challenge is applying sentiment analysis techniques onto other languages. With approximately four million tweets posted daily in several Arabizi dialects, an informal Arabic whereby sentences are written using English alpha numerals e.g. Yalla 7abibi, it is very useful to have a data mining tool that can analyze the sentiment of Twitter users in the Arab world. We took the initiative to make use of this abundance of data by analyzing it and predicting sentiment. Applying the same sentiment analysis techniques that are used on English for Arabic is not a simple task due to their semantic and structural differences, and because Arabic is a rich morphological language. Informal Arabic lacks standardization and has no grammar, thus sentimental analysis in this area is considered a complex process. Sentiment Analysis for Arabic has been studied for MSA (Modern Standard Arabic) but rarely for informal Arabic, and non-existent for Arabizi; whereas most of the youth in Lebanon text in Arabizi claiming that it is easier than texting in Arabic. The prevalence of this expanding linguistic trend motivated us to target this NLP challenge. In this study, we created a novel Lexicon of around 10,000 informal opinion words using regular expressions to match over 50,000 words. We also created an algorithm that lemmatizes Arabizi words, and classifies input sentences into positive, negative or neutral categories. We collected around 400,000 Lines of Arabizi data from Whatsapp, Facebook, and Twitter. We filtered them and tested a small sample across our classifier achieving 80% classification accuracy. The dialect chosen for the lexicon is Lebanese, our native language.