Extracting Persian-English Parallel Sentences from DocumentLevel Aligned Comparable Corpus using Bi-DirectionalTranslation

سال انتشار: 1393
نوع سند: مقاله ژورنالی
زبان: انگلیسی
مشاهده: 718

فایل این مقاله در 7 صفحه با فرمت PDF قابل دریافت می باشد

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

JR_ACSIJ-3-5_008

تاریخ نمایه سازی: 12 آبان 1393

چکیده مقاله:

Bilingual parallel corpora are very important in variousfiled of natural language processing (NLP). The quality of aStatistical Machine Translation (SMT) system stronglydependent upon the amount of training data. For low resourcelanguage pairs such as Persian-English, there are not enoughparallel sentences to build an accurate SMT system. This paperdescribes a new approach to use the Wikipedia as a comparablecorpus to extract Persian-English parallel sentences andeventually improve SMT system performance. This newapproach is also applicable to other low resource language pairs.In order to calculate the similarity score between two sentences, anovel bi-directional translation-based information retrievalsystem is proposed. A length penalty score is introduced toincrease the accuracy of extracted corpus. Using extractedparallel sentences, the performance of existing Persian-EnglishSMT is improved drastically

نویسندگان

Ebrahim Ansari

Department of Computer Science and Engineering, Shiraz UniversityShiraz, Fars, Iran

Mohammad Hadi Sadreddin

Department of Computer Science and Engineering, Shiraz UniversityShiraz, Fars, Ira

Alireza Tabebordba

Department of Computer Science and Engineering, Shiraz UniversityShiraz, Fars, Iran

Richard WALLAC

Distributed Systems Architecture Research Group, Complutense UniversityMadrid, Spain