Abstract:
Multilingual natural language processing systems are increasingly relying on parallel corpus to ameliorate their
output. Parallel corpora constitute the basic block for training a statistical natural language processing system and creating
translation and language models. Several systems have been devised that automatically align words of a pair of sentences,
each in a language. Such systems have been used successfully with European languages. In this paper, one such system is used
to align sentences in an English-Arabic corpus. The system works poorly given raw unaligned sentence English-Arabic
sentence pairs. This prompted the development of a preprocessing step to be applied to the Arabic sentences. The same corpus
was then preprocessed and a significant improvement is reported when alignment is attempted using the preprocessed
unaligned sentences.
Citation:
Salameh, M., Zantout, R., & Mansour, N. (2011). Improving the accuracy of English-Arabic statistical sentence alignment. Int. Arab J. Inf. Technol., 8(2), 171-177.