Authors:
(1) Todd K. Moon, Electrical and Computer Engineering Department, Utah State University, Logan, Utah;
(2) Jacob H. Gunther, Electrical and Computer Engineering Department, Utah State University, Logan, Utah.
Abstract and 1 Introduction and Background
2 Statistical Parsing and Extracted Features
7 Conclusions, Discussion, and Future Work
A. A Brief Introduction to Statistical Parsing
B. Dimension Reduction: Some Mathematical Details
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used, for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship using grammatical structural information extracted using a statistical natural language parser. This paper provides a proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted from the statistical natural language parser were explored: all subtrees of some depth from any level; rooted subtrees of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the features into a lower dimensional space. Statistical experiments on these documents demonstrate that information from a statistical parser can, in fact, assist in distinguishing authors.
There has been considerable effort over the years related to using statistical methods to identify authorship of texts, based on examples from candidate authors, in what is sometimes called “stylometry” or “author identification.” Statistical analysis of documents goes back to Augustus de Morgan in 1851 [1, p. 282], [2, p. 166], who proposed that word length statistics might be used to determine the authorship of the Pauline epistles. Stylometry was employed as early as 1901 to explore the authorship of Shakespeare [3]. Since then, it has been employed in a variety of literary studies (see, e.g. [4, 5, 6]), including twelve of The Federalist Papers which were of uncertain authorship [7] — which we re-examine here — and an unfinished novel by Jane Austen —which we also re-examine here. Information theoretic techniques have also been used more recently [8]. Earlier work in stylometry has been based on “noncontextual words,” words which do not convey the primary meaning of the text, but which act in the background of the text to provide structure and flow. Noncontextual words are at least plausible, since an author may address a variety of topics, so particular distinguishing words are not necessarily revealing of authorship. In noncontextual word studies, a set of most common words noncontextual is selected [2], and documents are represented by word counts, or ratios of word counts to document length. A review of the statistical methods is in [9]. As a variation, sets of ratios of counts of noncontextual word patterns to other word patterns are also employed [10]. Statistical analysis based on author vocabulary size vs. document length — the “vocabulary richness” — has also been explored [11]. For other related work, see [12, 13, 14, 15]
A more recent paper [16] considers the effectiveness of a wide variety of feature sets. Feature sets considered there include: vectors comprising frequencies of pronouns; function words (that is, articles, pronouns, particles, expletives); part of speech (POS); most common words; syntactic features (such as noun phrase, or verb phrase); or tense (e.g. use of present or past tense); voice (active of passive). In [16], feature vectors are formed from combinations of histograms, then reduced in dimensionality using a two-stage process of principle component analysis [17] followed by dimension reduction using linear discriminant analysis (LDA). In their LDA, the within-cluster scatter matrix is singular (due to the high dimension of the feature vectors relative the number of available training vectors), so their scatter matrix is regularized. To test this, the authors consider a range of regularization parameters, selecting one which gives the best performance.
More recent work [18] mentions the survey in [15] in which commonly used features in the authorship field are word and character n-grams. As noted, there are risks the statistical methods might be biased by topic-related patterns. As [18] observe, “an authorship classifier (even a seemingly good one) might end up unintentionally performing topic identification if domain-dependent features are used. ... In order to avoid this, researchers might limit their scope to features that are clearly topic-agnostic, such as function words or syntactic features.” The work presented here falls in the latter category, making use of grammatical structures statistically extracted from the text. These appear to be difficult to spoof. Examination of other recent works [19, 20] indicate that there is ongoing interest in author identification methods, but none making use of the grammatical structures use here; there is a tendency to rely more on traditional n-grams.
In this work the feature vectors are obtained using tree information from parse trees from a natural language parsing tool [21]. These features were not among the features considered in [16]. The grammatical structures are, it seems, more subtle than simple counts of classes of words, and hence may be less subject to spoofing or topic bias, since it seems unlikely that an author intending to imitate another would be able to coherently track complicated patterns of usage, and the features do not include any words from the documents. It is found that the tree-based features perform better than the POS features on the test data considered.
The feature vectors so obtained can be of very high dimension, so dimension reduction is also performed here. However, to deal with the singularity of the within-cluster scatter matrix, a generalized SVD approach is used, which avoids the need to select a regularization parameter.
This paper provides a proof-of-concept of these tree-based features to distinguish authorship by applying them to documents which have been previously examined, The Federalist Papers and Sanditon. The ability to classify by authorship is explored for several feature vectors obtained from the parsed information.
This paper is available on arxiv under CC BY 4.0 DEED license.