dc.contributor.author |
Salameh, Mohammad |
|
dc.date.accessioned |
2011-10-20T06:25:38Z |
|
dc.date.available |
2011-10-20T06:25:38Z |
|
dc.date.copyright |
2007 |
en_US |
dc.date.issued |
2011-10-20 |
|
dc.date.submitted |
2007-11-30 |
|
dc.identifier.uri |
http://hdl.handle.net/10725/826 |
|
dc.description |
Includes bibliographical references (leaves 84-86). |
en_US |
dc.description.abstract |
Parallel corpus is an essential resource in any statistical machine
translation system. It constitutes the basic block for training the system and
creating translation and language models that acts as the knowledge base for
translation. In this thesis, we present preprocessing steps for English-Arabic
Translation. These steps will help in improving the word alignment in the
machine learning phase of statistical machine translation. The aim is to make
the frequency of Arabic words increase in the text and to minimize the
number of words in English and Arabic sentences by splitting them. The
preprocessing steps include filtering the Arabic texts from diacritizations and
other unnecessary characters, separating the prefixes and suffixes from
Arabic words, and splitting English-Arabic sentence pairs according to
predetermined stopwords. We apply our technique on a parallel corpus taken
from the United Nation's documents. Our results show that it is essential to preprocess English-Arabic text.
We obtained an error rate of7.7% on splitting the sentence by stopwords and
around 5% when splitting on the comma. |
en_US |
dc.language.iso |
en |
en_US |
dc.subject |
English language -- Machine translating |
en_US |
dc.subject |
Arabic language -- Machine translating |
en_US |
dc.subject |
Machine translating |
en_US |
dc.title |
Preprocessing steps for English-Arabic translation. (c2007) |
en_US |
dc.type |
Thesis |
en_US |
dc.term.submitted |
Fall |
en_US |
dc.author.degree |
MS in Computer Science |
en_US |
dc.author.school |
Arts and Sciences |
en_US |
dc.author.idnumber |
200103921 |
en_US |
dc.author.commembers |
Dr. Rached Zantout |
|
dc.author.commembers |
Dr. Faisal Abukhzam |
|
dc.author.commembers |
Dr. Lama Hamandi |
|
dc.author.woa |
OA |
en_US |
dc.description.physdesc |
1 bound copy: 86 leaves; 30 cm. available at RNL. |
en_US |
dc.author.division |
Computer Science |
en_US |
dc.author.advisor |
Dr. Nashaat Mansour |
|
dc.keywords |
Word alignment |
en_US |
dc.keywords |
Sentence alignment |
en_US |
dc.keywords |
Parallel corpora |
en_US |
dc.keywords |
Sstatistical machine translation |
en_US |
dc.identifier.doi |
https://doi.org/10.26756/th.2007.26 |
en_US |
dc.publisher.institution |
Lebanese American University |
en_US |