paint-brush
When Words Won’t Talk, Sentence Structures Spill the Truthby@authoring
New Story

When Words Won’t Talk, Sentence Structures Spill the Truth

by AuthoringMarch 7th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Statistical parsing proves effective in distinguishing authors by analyzing grammatical structures rather than traditional word-based methods. While The Federalist Papers benefited from dimensionality reduction, Sanditon did not, highlighting the method’s adaptability. This technique complements traditional stylometric approaches and could enhance machine learning models in authorship attribution.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - When Words Won’t Talk, Sentence Structures Spill the Truth
Authoring HackerNoon profile picture
0-item

Authors:

(1) Todd K. Moon, Electrical and Computer Engineering Department, Utah State University, Logan, Utah;

(2) Jacob H. Gunther, Electrical and Computer Engineering Department, Utah State University, Logan, Utah.

Abstract and 1 Introduction and Background

2 Statistical Parsing and Extracted Features

3 Parse Tree Features

4 Classifier

5 Dimension Reduction

6 The Federalist Papers

6.1 Sanditon

7 Conclusions, Discussion, and Future Work

A. A Brief Introduction to Statistical Parsing

B. Dimension Reduction: Some Mathematical Details

References

7 Conclusions, Discussion, and Future Work

As this paper has demonstrated, information drawn from statistical parsing of a text can be used to distinguish between between authors. Different sets of features have been considered (all subtrees, rooted subtrees, POS, and POS by Level), with different degrees of performance among them. Other than the POS these features have not been previously considered (to the knowledge of the authors), including in the large set of features examined in [16]. This suggests that these tree-based features, especially the features based on all subtrees, may be beneficially included among other features.


It appears that the Sanditon texts are easier to classify than the The Federalist Papers. Even without the generally performance-enhancing step of dimension reduction, Sanditon classifies well, even using the POS feature vectors which are not as strong when applied to the The Federalist Papers. This is amusing, since the completer of Sanditon attempted to write in an imitative style, suggesting that these structural features are not easily faked.


The methods examined here does not preclude the excellent work on author identification that has previously been done, which is usually done using more obvious features in the document (such as word counts, with words selected from some appropriate set). This makes previous methods easier to compute. But at the same time, it may make it easy to spoof the author identification. The grammatical parsing provides more subtle features which will be more difficult to spoof.


Another tradeoff is the amount of data needed to extract a statistically meaningful feature vector. The number of trees — the number of feature elements — quickly becomes very large. In order to be statistically significant a feature element should have multiple counts. (Recall that for the chi-squared test in classical statistics a rule of thumb is that at least five counts are needed.) This need to count a lot of features indicates that the method is best applied to large documents.


In light of these considerations, the method described here may be considered supplemental to more traditional author identification methods.


The method is naturally agnostic to the particular content of a document — it does not require selecting some subset of words to use for comparisons — and so should be applicable to documents across different styles and genres. The analysis could be applied to any document amenable to statistical parsing. (It does seem that documents with a lot of specialized notation, such as mathematical or chemical notation would require adaptation to the parser.)


This paper introduces many possibilities for future work. Of course there is the question of how this will apply to other work in author identification. It is curious that the dimension reduction behaves so differently for the Federalist and Sanditon — Federalist best in smaller dimensions, but Sanditon works better in larger dimensions. Given recent furor over machine learning, it would be interesting to see if the features extracted by the grammatical parser correspond in any way to features that would be extracted by a ML tool. (My suspicion is that training on current ML tools does not extract grammatical information applicable to the author identification problem.)


Table 17: Example rules for a PCFG (see Figure 14.1 of [26]). S=start symbol (or sentence); NP=noun phrase; VP = verb phrase; PP=prepositional phrase.)


This paper is available on arxiv under CC BY 4.0 DEED license.