عنوان

Grammar-based preprocessing for PPM compression and classification

پدید آورنده

Aljehane, Nojood Obaidallah M.

موضوع

Computer Science

رده

کتابخانه

مرکز و کتابخانه مطالعات اسلامی به زبان‌های اروپایی

محل استقرار

استان: قم ـ شهر: قم

تماس با کتابخانه : 32910706-025

شماره کتابشناسی ملی

شماره

TLets760217

عنوان و نام پديدآور

عنوان اصلي

Grammar-based preprocessing for PPM compression and classification

نام عام مواد

[Thesis]

نام نخستين پديدآور

Aljehane, Nojood Obaidallah M.

نام ساير پديدآوران

Teahan, William

وضعیت نشر و پخش و غیره

نام ناشر، پخش کننده و غيره

Bangor University

تاریخ نشرو بخش و غیره

2018

یادداشتهای مربوط به پایان نامه ها

جزئيات پايان نامه و نوع درجه آن

Thesis (Ph.D.)

امتياز متن

2018

یادداشتهای مربوط به خلاصه یا چکیده

متن يادداشت

The aim of this study is to investigate the efficiency of novel methods using context-free grammars and Prediction by Partial Matching (PPM) in order to build and evaluate the quality of compression models for text files such as English, Arabic, Persian,Welsh and Chinese. A further aim is then to apply these models to the problem of the classification of text to see how well they perform at this application. We apply grammar-based pre-processing prior to using the PPM compression algorithm. The methods achieve significantly better compression for different natural language texts compared with other well-known compression methods. Our method first generates a grammar based on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) in the text being compressed and then substitutes these sequences using the respective non-terminal symbols defined by the grammar in a pre-processing phase prior to compression. We describe further improvements using a two-pass scheme where grammar-based preprocessing is applied again in a second pass through the text. We then apply the algorithms to the files in the Calgary Corpus and also achieve significantly improved results in compression when compared with other compression algorithms, including a grammar-based approach known as the Sequitur algorithm. Despite the advances of the PPM method in predicting upcoming symbols or words in the English language, more research is required to devise better compression methods for other languages, such as Arabic due to, for example, the rich morphological nature of Arabic text, in which a single word can take many different forms. In this dissertation, we propose a new method that achieves the best compression rates not only for Arabic language text but also for other languages that use Arabic script in their writing systems, such as Persian. Our word-based method (GRW-PPM) constructs a contextfree grammar for the text; this grammar is then encoded using PPM to achieve excellent compression rates. Finally, we investigate the classification of genre in English and Arabic text by using our new character-based text compression scheme (GRB-PPM). Experimental results on a parallel Arabic and English corpus show that our new method is very effective compared with traditional compression-based classification methods. We have also confirmed that good compression leads to good classification.

موضوع (اسم عام یاعبارت اسمی عام)

موضوع مستند نشده

Computer Science

نام شخص به منزله سر شناسه - (مسئولیت معنوی درجه اول )

مستند نام اشخاص تاييد نشده

Aljehane, Nojood Obaidallah M.

نام شخص - ( مسئولیت معنوی درجه دوم )

مستند نام اشخاص تاييد نشده

Teahan, William

شناسه افزوده (تنالگان)

مستند نام تنالگان تاييد نشده

Bangor University

دسترسی و محل الکترونیکی

نام الکترونيکي

وضعیت انتشار

فرمت انتشار

اطلاعات رکورد کتابشناسی

نوع ماده

[Thesis]

کد کاربرگه

276903

اطلاعات دسترسی رکورد

سطح دسترسي

تكميل شده

عنوان Grammar-based preprocessing for PPM compression and classification

پدید آورنده Aljehane, Nojood Obaidallah M.

موضوع Computer Science

رده

کتابخانه مرکز و کتابخانه مطالعات اسلامی به زبان‌های اروپایی

محل استقرار استان: قم ـ شهر: قم