Can Grammar Patterns Unmask a Writer’s Identity?

tldt arrow

Too Long; Didn't Read

By analyzing parse tree features like subtrees, rooted structures, and part-of-speech distributions, this study refines authorship classification. These NLP-based features help distinguish writing styles more effectively than traditional word-based methods.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Can Grammar Patterns Unmask a Writer’s Identity?
Technology for Authoring Blog Posts HackerNoon profile picture
0-item

Authors:

(1) Todd K. Moon, Electrical and Computer Engineering Department, Utah State University, Logan, Utah;

(2) Jacob H. Gunther, Electrical and Computer Engineering Department, Utah State University, Logan, Utah.

Abstract and 1 Introduction and Background

2 Statistical Parsing and Extracted Features

3 Parse Tree Features

4 Classifier

5 Dimension Reduction

6 The Federalist Papers

6.1 Sanditon

7 Conclusions, Discussion, and Future Work

A. A Brief Introduction to Statistical Parsing

B. Dimension Reduction: Some Mathematical Details

References


3 Parse Tree Features

The richness of the parsed representation introduces the possibility of many different feature vectors. Of the many possible feature vectors that might be chosen, four are discussed here. Examples are provided based on the sentence above to illustrate the features.


All Subtrees One set of features is the set of all subtrees of a given depth encountered among all the parsed sentences. For example, Figure 2 shows eleven subtrees of depth 3 extracted from (2). Subtrees of a given depth may appear more than once within a sentence. For example, the subtree


(NP(NP(DT)(JJ)(NN))(PP(IN)(NP(NP)(PP))))


appears twice in (2).


Across all the sentences in the documents considered, there is a very large number of subtrees. This leads to vectors of very high dimension. This is a problem that is dealt with later.


Rooted Subtrees A rooted subtree is a subtree of a tree whose root node is the root node of the overall tree, down to some specified level. The first few rooted subtrees can be thought of summarizing the general structure of a sentence, with the amount of detail in the summary related to the number of levels of the subtree. Fig. 3 illustrates the subtrees of levels one, two, and three for the tree of Fig. 1.


Part-of-Speech A simple set of features ignores the tree structure, and simply extracts the counts of tokens in the parse tree. For (2), the counts of the POS are





POS by Level A more complicated set of features is the histogram of tokens at each level of the tree. For the tree of (2), this is shown in Table 2.


For purposes of author classification, the idea, of course, is to see how the patterns in the feature vectors obtained from the sentences of one author compare with the patterns in the feature vectors of other authors.


4 Classifier

This section describes the basic operation of the classifier employed in the tests for this paper. In this paper, when “classes” are referred to, it refers to the different authors under consideration. Let k denote the number of classes (authors).



Figure 1: Example parse tree


Figure 2: Some subtrees of depth 3 extracted from the tree in (2)


Table 2: POS counts by level for the tree (2).


Figure 3: Rooted Subtrees of the tree in (2) of one, two, and three levels



In the tests performed for the investigation in this paper, the classifier works as follows (see figure ??).



This paper is available on arxiv under CC BY 4.0 DEED license.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks