Authors:
(1) Todd K. Moon, Electrical and Computer Engineering Department, Utah State University, Logan, Utah;
(2) Jacob H. Gunther, Electrical and Computer Engineering Department, Utah State University, Logan, Utah.
Abstract and 1 Introduction and Background
2 Statistical Parsing and Extracted Features
7 Conclusions, Discussion, and Future Work
A. A Brief Introduction to Statistical Parsing
B. Dimension Reduction: Some Mathematical Details
Part-of-speech (POS) tagging classifies the words in a sentence according to their part of speech, such as noun,verb, or interjection. Because of the complexity of English language, there is potential for ambiguity. For example, many words (such as “seat” or “bat” or “eye”) can be either nouns or verbs. The ambiguity can be dealt with using statistical parsing, in which a large corpus of language is used to develop probabilistic models for words which are based on contextual words. These models are typically trained with the assistance of human linguistic experts. The parser used in this work uses a language model developed using the annotated corpus called the Penn Treebank, which is a corpus of over 7 million words of American English, collected from multiple sources, labeled using human and semi-automated markup [22, 23]. The parser is described in [21]. It is a probabilistic context free grammar (PCFG) parser [24], with language transition probabilities determined based on the Penn Treebank corpus. The parser software is known as the Stanford Parser [25]. Parsing results presented here are produced by version 4.2.0, released November 17, 2020.
Table 1 lists the POS labels (the POS tagset) associated with words when a sentence is parsed by this parser. It also lists the syntactic tagset, produced by the parser when doing grammatical parsing. (see [23, Table 1.1, Table 1.2], [26, Chapter 5]).
A brief introduction to statistical parsing is provide in Appendix A.
As an example of the parsing, consider the first sentence of The Federalist Papers 1 by Alexander Hamilton:
Parsing this sentences yields the tree representation portrayed in figure 1(a). The leaf nodes correspond to the words of the sentence, each labeled with a POS. The non-leaf (interior) nodes represent syntactic (grammatical structure) information determined by the parser. The label of each node of the tree is referred to as a token. The parse tree can be represented using the text string shown in 1(b). This is formatted to show the various levels of the tree implied by the nesting of the parentheses in figure 1(c).
e nesting of the parentheses in figure 1(c). In preparing to extract feature vectors from a parse tree, some additional tidying-up is performed. The parser creates a ROOT node for each tree, which is therefore uninformative and is removed. Punctuation nodes in the tree, such as (, ,), (. .), or (. ?) are removed. Since the intent is to explore how the parsed information can be used for
classification, rather than the words of the document, the words of the sentence are removed from the parse tree. With these edits, the sentence (1) has the parsed representation
From this prepared data, various feature vectors were extracted, as described below. (The text manipulation and data extraction was done using the Python language, making extensive use of Python’s dictionary type. The parsed string (2) can be used, for example, as a key to a Python dictionary.)
This paper is available on arxiv under CC BY 4.0 DEED license.