Preprocessing steps for English-Arabic translation. (c2007)

Salameh, Mohammad

dc.contributor.author	Salameh, Mohammad
dc.date.accessioned	2011-10-20T06:25:38Z
dc.date.available	2011-10-20T06:25:38Z
dc.date.copyright	2007	en_US
dc.date.issued	2011-10-20
dc.date.submitted	2007-11-30
dc.identifier.uri	http://hdl.handle.net/10725/826
dc.description	Includes bibliographical references (leaves 84-86).	en_US
dc.description.abstract	Parallel corpus is an essential resource in any statistical machine translation system. It constitutes the basic block for training the system and creating translation and language models that acts as the knowledge base for translation. In this thesis, we present preprocessing steps for English-Arabic Translation. These steps will help in improving the word alignment in the machine learning phase of statistical machine translation. The aim is to make the frequency of Arabic words increase in the text and to minimize the number of words in English and Arabic sentences by splitting them. The preprocessing steps include filtering the Arabic texts from diacritizations and other unnecessary characters, separating the prefixes and suffixes from Arabic words, and splitting English-Arabic sentence pairs according to predetermined stopwords. We apply our technique on a parallel corpus taken from the United Nation's documents. Our results show that it is essential to preprocess English-Arabic text. We obtained an error rate of7.7% on splitting the sentence by stopwords and around 5% when splitting on the comma.	en_US
dc.language.iso	en	en_US
dc.subject	English language -- Machine translating	en_US
dc.subject	Arabic language -- Machine translating	en_US
dc.subject	Machine translating	en_US
dc.title	Preprocessing steps for English-Arabic translation. (c2007)	en_US
dc.type	Thesis	en_US
dc.term.submitted	Fall	en_US
dc.author.degree	MS in Computer Science	en_US
dc.author.school	Arts and Sciences	en_US
dc.author.idnumber	200103921	en_US
dc.author.commembers	Dr. Rached Zantout
dc.author.commembers	Dr. Faisal Abukhzam
dc.author.commembers	Dr. Lama Hamandi
dc.author.woa	OA	en_US
dc.description.physdesc	1 bound copy: 86 leaves; 30 cm. available at RNL.	en_US
dc.author.division	Computer Science	en_US
dc.author.advisor	Dr. Nashaat Mansour
dc.keywords	Word alignment	en_US
dc.keywords	Sentence alignment	en_US
dc.keywords	Parallel corpora	en_US
dc.keywords	Sstatistical machine translation	en_US
dc.identifier.doi	https://doi.org/10.26756/th.2007.26	en_US
dc.publisher.institution	Lebanese American University	en_US