Grammar-based preprocessing for PPM compression and classification
نام عام مواد
[Thesis]
نام نخستين پديدآور
Aljehane, Nojood Obaidallah M.
نام ساير پديدآوران
Teahan, William
وضعیت نشر و پخش و غیره
نام ناشر، پخش کننده و غيره
Bangor University
تاریخ نشرو بخش و غیره
2018
یادداشتهای مربوط به پایان نامه ها
جزئيات پايان نامه و نوع درجه آن
Thesis (Ph.D.)
امتياز متن
2018
یادداشتهای مربوط به خلاصه یا چکیده
متن يادداشت
The aim of this study is to investigate the efficiency of novel methods using context-free grammars and Prediction by Partial Matching (PPM) in order to build and evaluate the quality of compression models for text files such as English, Arabic, Persian,Welsh and Chinese. A further aim is then to apply these models to the problem of the classification of text to see how well they perform at this application. We apply grammar-based pre-processing prior to using the PPM compression algorithm. The methods achieve significantly better compression for different natural language texts compared with other well-known compression methods. Our method first generates a grammar based on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) in the text being compressed and then substitutes these sequences using the respective non-terminal symbols defined by the grammar in a pre-processing phase prior to compression. We describe further improvements using a two-pass scheme where grammar-based preprocessing is applied again in a second pass through the text. We then apply the algorithms to the files in the Calgary Corpus and also achieve significantly improved results in compression when compared with other compression algorithms, including a grammar-based approach known as the Sequitur algorithm. Despite the advances of the PPM method in predicting upcoming symbols or words in the English language, more research is required to devise better compression methods for other languages, such as Arabic due to, for example, the rich morphological nature of Arabic text, in which a single word can take many different forms. In this dissertation, we propose a new method that achieves the best compression rates not only for Arabic language text but also for other languages that use Arabic script in their writing systems, such as Persian. Our word-based method (GRW-PPM) constructs a contextfree grammar for the text; this grammar is then encoded using PPM to achieve excellent compression rates. Finally, we investigate the classification of genre in English and Arabic text by using our new character-based text compression scheme (GRB-PPM). Experimental results on a parallel Arabic and English corpus show that our new method is very effective compared with traditional compression-based classification methods. We have also confirmed that good compression leads to good classification.
موضوع (اسم عام یاعبارت اسمی عام)
موضوع مستند نشده
Computer Science
نام شخص به منزله سر شناسه - (مسئولیت معنوی درجه اول )