Feature Representation in Mining and Language Processing
General Material Designation
[Thesis]
First Statement of Responsibility
Vu, Thuy
Subsequent Statement of Responsibility
Parker, D. Stott
.PUBLICATION, DISTRIBUTION, ETC
Date of Publication, Distribution, etc.
2017
DISSERTATION (THESIS) NOTE
Body granting the degree
Parker, D. Stott
Text preceding or following the note
2017
SUMMARY OR ABSTRACT
Text of Note
Feature representation has been one of the most important factors for the success of machine learning algorithms. Since 2006, deep learning has been widely considered for various problems in different disciplines and, most of the time, has reset state-of-the-art results --- thanks to its excellent ability to learn highly abstract representations of data. I focus on extracting additional structural features in network analysis and natural language processing (NLP) --- via learning novel vector-based representations, usually known as embeddings. For network analysis, I propose to learn representations for nodes, node embeddings, for social network applications. The embeddings are computed using attributes and links of nodes in the network. Experimental studies on community detection and mining tasks suggest that node embeddings can further reveal deeper structure of the network. For NLP, I address the learning of representations at three levels: words, word relations, and linguistic expressions. First, I propose to extend the standard word embedding training process into two phases, treating context as second order in nature. This strategy can effectively compute embeddings for polysemous concepts of words, adding an extra conceptual layer for standard word embeddings. Second, I introduce the representations of ``semantic binders'' for words. These representations are learned using categorial grammar and are shown to effectively handle disambiguation, especially when meaning of a word largely depends on a specific context. Finally, I present a three-layer framework to learn representation for linguistic expressions --- for solving the semantic compositionality problem, using recurrent neural networks driven by categorial-based combinatory rules. This strategy specifically addresses the limitations of recurrent neural network approaches in deciding how --- and when --- to include individual information in the compositional embedding. The framework is flexible and can be integrated with the proposed representations. I study the efficiency of the proposed representations in different NLP applications: word analogies, subject-verb-object agreement, paraphrasing, and sentiment analysis.