Preprocessing steps for English-Arabic translation. (c2007)

LAUR Repository

Show simple item record

dc.contributor.author Salameh, Mohammad
dc.date.accessioned 2011-10-20T06:25:38Z
dc.date.available 2011-10-20T06:25:38Z
dc.date.copyright 2007 en_US
dc.date.issued 2011-10-20
dc.date.submitted 2007-11-30
dc.identifier.uri http://hdl.handle.net/10725/826
dc.description Includes bibliographical references (leaves 84-86). en_US
dc.description.abstract Parallel corpus is an essential resource in any statistical machine translation system. It constitutes the basic block for training the system and creating translation and language models that acts as the knowledge base for translation. In this thesis, we present preprocessing steps for English-Arabic Translation. These steps will help in improving the word alignment in the machine learning phase of statistical machine translation. The aim is to make the frequency of Arabic words increase in the text and to minimize the number of words in English and Arabic sentences by splitting them. The preprocessing steps include filtering the Arabic texts from diacritizations and other unnecessary characters, separating the prefixes and suffixes from Arabic words, and splitting English-Arabic sentence pairs according to predetermined stopwords. We apply our technique on a parallel corpus taken from the United Nation's documents. Our results show that it is essential to preprocess English-Arabic text. We obtained an error rate of7.7% on splitting the sentence by stopwords and around 5% when splitting on the comma. en_US
dc.language.iso en en_US
dc.subject English language -- Machine translating en_US
dc.subject Arabic language -- Machine translating en_US
dc.subject Machine translating en_US
dc.title Preprocessing steps for English-Arabic translation. (c2007) en_US
dc.type Thesis en_US
dc.term.submitted Fall en_US
dc.author.degree MS in Computer Science en_US
dc.author.school Arts and Sciences en_US
dc.author.idnumber 200103921 en_US
dc.author.commembers Dr. Rached Zantout
dc.author.commembers Dr. Faisal Abukhzam
dc.author.commembers Dr. Lama Hamandi
dc.author.woa OA en_US
dc.description.physdesc 1 bound copy: 86 leaves; 30 cm. available at RNL. en_US
dc.author.division Computer Science en_US
dc.author.advisor Dr. Nashaat Mansour
dc.keywords Word alignment en_US
dc.keywords Sentence alignment en_US
dc.keywords Parallel corpora en_US
dc.keywords Sstatistical machine translation en_US
dc.identifier.doi https://doi.org/10.26756/th.2007.26 en_US
dc.publisher.institution Lebanese American University en_US

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search LAUR

Advanced Search


My Account