Authors:
(1) Todd K. Moon, Electrical and Computer Engineering Department, Utah State University, Logan, Utah;
(2) Jacob H. Gunther, Electrical and Computer Engineering Department, Utah State University, Logan, Utah.
Table of Links
Abstract and 1 Introduction and Background
2 Statistical Parsing and Extracted Features
7 Conclusions, Discussion, and Future Work
A. A Brief Introduction to Statistical Parsing
B. Dimension Reduction: Some Mathematical Details
3 Parse Tree Features
The richness of the parsed representation introduces the possibility of many different feature vectors. Of the many possible feature vectors that might be chosen, four are discussed here. Examples are provided based on the sentence above to illustrate the features.
All Subtrees One set of features is the set of all subtrees of a given depth encountered among all the parsed sentences. For example, Figure 2 shows eleven subtrees of depth 3 extracted from (2). Subtrees of a given depth may appear more than once within a sentence. For example, the subtree
(NP(NP(DT)(JJ)(NN))(PP(IN)(NP(NP)(PP))))
appears twice in (2).
Across all the sentences in the documents considered, there is a very large number of subtrees. This leads to vectors of very high dimension. This is a problem that is dealt with later.
Rooted Subtrees A rooted subtree is a subtree of a tree whose root node is the root node of the overall tree, down to some specified level. The first few rooted subtrees can be thought of summarizing the general structure of a sentence, with the amount of detail in the summary related to the number of levels of the subtree. Fig. 3 illustrates the subtrees of levels one, two, and three for the tree of Fig. 1.
Part-of-Speech A simple set of features ignores the tree structure, and simply extracts the counts of tokens in the parse tree. For (2), the counts of the POS are
POS by Level A more complicated set of features is the histogram of tokens at each level of the tree. For the tree of (2), this is shown in Table 2.
For purposes of author classification, the idea, of course, is to see how the patterns in the feature vectors obtained from the sentences of one author compare with the patterns in the feature vectors of other authors.
4 Classifier
This section describes the basic operation of the classifier employed in the tests for this paper. In this paper, when “classes” are referred to, it refers to the different authors under consideration. Let k denote the number of classes (authors).
In the tests performed for the investigation in this paper, the classifier works as follows (see figure ??).
This paper is available on arxiv under CC BY 4.0 DEED license.