paint-brush
Can AI Tell Who Wrote Something Just by Analyzing Grammar?by@authoring
New Story

Can AI Tell Who Wrote Something Just by Analyzing Grammar?

by AuthoringMarch 7th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

By using statistical parsing and probabilistic grammar models, this research refines authorship detection. The study removes words from parsed structures, focusing on syntactic patterns to create feature vectors, enhancing classification accuracy with NLP techniques.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Can AI Tell Who Wrote Something Just by Analyzing Grammar?
Authoring HackerNoon profile picture
0-item

Authors:

(1) Todd K. Moon, Electrical and Computer Engineering Department, Utah State University, Logan, Utah;

(2) Jacob H. Gunther, Electrical and Computer Engineering Department, Utah State University, Logan, Utah.

Abstract and 1 Introduction and Background

2 Statistical Parsing and Extracted Features

3 Parse Tree Features

4 Classifier

5 Dimension Reduction

6 The Federalist Papers

6.1 Sanditon

7 Conclusions, Discussion, and Future Work

A. A Brief Introduction to Statistical Parsing

B. Dimension Reduction: Some Mathematical Details

References

2 Statistical Parsing and Extracted Features

Part-of-speech (POS) tagging classifies the words in a sentence according to their part of speech, such as noun,verb, or interjection. Because of the complexity of English language, there is potential for ambiguity. For example, many words (such as “seat” or “bat” or “eye”) can be either nouns or verbs. The ambiguity can be dealt with using statistical parsing, in which a large corpus of language is used to develop probabilistic models for words which are based on contextual words. These models are typically trained with the assistance of human linguistic experts. The parser used in this work uses a language model developed using the annotated corpus called the Penn Treebank, which is a corpus of over 7 million words of American English, collected from multiple sources, labeled using human and semi-automated markup [22, 23]. The parser is described in [21]. It is a probabilistic context free grammar (PCFG) parser [24], with language transition probabilities determined based on the Penn Treebank corpus. The parser software is known as the Stanford Parser [25]. Parsing results presented here are produced by version 4.2.0, released November 17, 2020.


Table 1 lists the POS labels (the POS tagset) associated with words when a sentence is parsed by this parser. It also lists the syntactic tagset, produced by the parser when doing grammatical parsing. (see [23, Table 1.1, Table 1.2], [26, Chapter 5]).


A brief introduction to statistical parsing is provide in Appendix A.


As an example of the parsing, consider the first sentence of The Federalist Papers 1 by Alexander Hamilton:



Parsing this sentences yields the tree representation portrayed in figure 1(a). The leaf nodes correspond to the words of the sentence, each labeled with a POS. The non-leaf (interior) nodes represent syntactic (grammatical structure) information determined by the parser. The label of each node of the tree is referred to as a token. The parse tree can be represented using the text string shown in 1(b). This is formatted to show the various levels of the tree implied by the nesting of the parentheses in figure 1(c).


e nesting of the parentheses in figure 1(c). In preparing to extract feature vectors from a parse tree, some additional tidying-up is performed. The parser creates a ROOT node for each tree, which is therefore uninformative and is removed. Punctuation nodes in the tree, such as (, ,), (. .), or (. ?) are removed. Since the intent is to explore how the parsed information can be used for


Table 1: Penn Treebank POS Tagset and Syntactic Tagset.


classification, rather than the words of the document, the words of the sentence are removed from the parse tree. With these edits, the sentence (1) has the parsed representation



From this prepared data, various feature vectors were extracted, as described below. (The text manipulation and data extraction was done using the Python language, making extensive use of Python’s dictionary type. The parsed string (2) can be used, for example, as a key to a Python dictionary.)


This paper is available on arxiv under CC BY 4.0 DEED license.