Authors:
(1) Todd K. Moon, Electrical and Computer Engineering Department, Utah State University, Logan, Utah;
(2) Jacob H. Gunther, Electrical and Computer Engineering Department, Utah State University, Logan, Utah.
Abstract and 1 Introduction and Background
2 Statistical Parsing and Extracted Features
7 Conclusions, Discussion, and Future Work
A. A Brief Introduction to Statistical Parsing
B. Dimension Reduction: Some Mathematical Details
At the suggestion of an anonymous reviewer, this appendix was written to provide a brief a discussion of the statistical parsing, drawing very closely from [26, Chapter 14, Statistical Parsing]. More detailed discussions are provided in [21, 24]. The probabilistic grammar employed is a probabilistic context-free grammar (PCFG). In this grammar are rules for transforming nonterminal symbols to a string of symbols which could be nonterminal symbols or terminal symbols. In a PCFG each rule is accompanied by a probability. As an example, Table 17 shows a grammar for a toy language (used for airline reservations). Each rule in the table of the form A → B [p] means that p is the probability that the non-terminal A will be expanded to the sequence B. This can be alternatively represented as P(A → B) or as P(A → B|A) or as P(LHS|RHS), where LHS and RHS mean “left hand side” and “right hand side,” respectively.
In Table 17, S denotes an start symbol (for a sentence). The grammar’s first rule says that a sentence may consist of a NP (noun phrase) followed by a VP (verb phrase), and that such a rule occurs with probability 0.8. The second rule says that a sentence may consist of an auxiliary (such as does or can) followed by NP then a VP, with probability 0.15. The next rule says that a sentence can be a verb phrase, VP. The tokens obtained by application of a rule can be recursively expanded using their own rules, shown in the table. Thus, a NP may consist of a pronoun, or a proper-noun, etc. The probabilities are estimated from a large corpus of data parsed by a linguist.
There is also a lexicon (or dictionary) of terms, each term of which has a probability within its lexicon type. The right side of Table 17 illustrates a lexicon for this application. For example, a determiner Det can be that (with probability 0.1) or a (with probability 0.3) or the (with probability 0.6). A noun (in this application context) can be book, flight, meal, money, flights, or dinner, with their respective probabilities. Probabilities are (again) estimated from a corpus.
A PCFG can be used to estimate the probability of a parse tree, which can be used to disambiguate multiple parsings, or it can be used to determine the probability of a sentence in a language modeling setting.
“The probability of a particular parse tree T is defined as the product of the probabilities of all the rules used to expand each of the n non-terminal nodes in the parse tree T, where each rule i can be expressed as LHSi → RHSi
The resulting probability P(T, S) is . . . the joint probability of the parse and the sentence and the probability of the parse P(T).” [26, p. 462] . In computing the probability on a tree, there is a factor for every rule, which corresponds to every edge on the tree.
As an example, consider two different ways of parsing the sentence: “Book the dinner flight.” This can be parsed (understood) either as
The parse tree, rules, and corresponding probability for parsing 1 is shown here:
The parse tree, rules, and corresponding probability for parsing 2 is shown here:
The probabilities computed for these two parse structures are
The parsing 1 has much higher probability than parsing 2 (which accords with a common understanding of the sense of the sentence).
The parser works through the text being parsed, probabilistically associating the word with its grammatical element, in the context of the tree that is being built. When competing trees are constructed, the tree with highest probability is accepted.
This paper is available on arxiv under CC BY 4.0 DEED license.