How LLMs Learn to Recognize Different Writing Styles

Authors:

(1) Todd K. Moon, Electrical and Computer Engineering Department, Utah State University, Logan, Utah;

(2) Jacob H. Gunther, Electrical and Computer Engineering Department, Utah State University, Logan, Utah.

Table of Links

Abstract and 1 Introduction and Background

2 Statistical Parsing and Extracted Features

3 Parse Tree Features

4 Classifier

5 Dimension Reduction

6 The Federalist Papers

6.1 Sanditon

7 Conclusions, Discussion, and Future Work

A. A Brief Introduction to Statistical Parsing

B. Dimension Reduction: Some Mathematical Details

References

5 Dimension Reduction

As described in section 3, the number of elements m of the feature vectors can be very large. It has been found to be helpful to reduce the dimensionality of the feature vectors by projecting them into a lower dimensional space. The reduction of dimension is similar to principle component analysis (PCA) [17], but is used when the dimension of the vectors exceeds than the number of observations of vectors in the classes. This has been used in other textual analysis problems [27, 28] and facial recognition problems [29]. (In [16], dimension reduction is accomplished in a two-stage process, with PCA being following by a process similar to the one described here.) In this section we introduce the criterion used to perform the projection. In Appendix B, a few more details are provided (see [27, 29] for more detail).

While the feature vectors are in high-dimensional space, the salient concepts of dimension reduction can be illustrated in low dimensional space, such as figure 4. In that figure there are two 2-dimensional data sets, denoted with ◦ and ×, respectively. The problem is to determine for a given vector which class it belongs to. Also shown in the figure are two axes upon which the data are projected. (For “projection,” think of the shadow cast by the data points by a light source high above the projection line.) The 1-dimensional data produced in Projection 1 have a cluster widths denoted by SW2 and SW2. This is the within-cluster scatter, a measure of the variance (or width) of the densities. There is also a between-cluster scatter, a measure of how far the cluster centroids are from the overall centroid. In Projection 1, the between-cluster measure is rather small compared with the width of the cluster widths. By contrast, the 1-dimensional data produced in Projection 2 have a much larger between-cluster measure SB. The within-cluster scatter SW1 and SW2 are also larger, but the between-cluster measure appears to have grown more than these within-cluster measures. Projection 2 produces data that would be more easily classified than Projection 1.

More generally, one can conceive of rotating the data at various angle with respect to the axis that data are projected upon. At some angles, the between cluster scatter will be larger compared to the within-cluster scatters.

In light of this discussion, the goal of the projection operation is to determine the best “angles of rotation” to project upon which maximize the between-cluster scatter while minimizing the within-cluster scatter. In general, there are k clusters of data to deal with (not just the two portrayed in figure 4). All this takes place in very high dimensional space, where we cannot visualize the data, so this is done via mathematical transformations. In higher dimensions, it is also not merely a matter of projecting onto a single direction. In m dimensions, the dimension of the projected data could be 1-dimension, 2-dimensions, etc., up to m − 1 dimensions. It is not known in advance what the best dimensionality to project onto is, so this is one of the parameters examined in the experiments described below.

With this discussion in place, we now describe the mathematics of how the projection is accomplished. For class i, with within-cluster scatter matrix —- that is, a measure of how the data in the in the class vary around the centroid of the class — is

The total within-cluster scatter matrix is the sum of the within-class scatter matrices,

It may be surprising that projecting into lower dimensional spaces can improve the performance — it seems to be throwing away information that may be useful in the classification. What happens, however, is that the information

discarded is in directions that are noisy, or may be confusing to the classifier. As the results below indicate, projecting onto lower dimensions can significantly improve the classifier.

This paper is available on arxiv under CC BY 4.0 DEED license.

How LLMs Learn to Recognize Different Writing Styles

Too Long; Didn't Read

Companies Mentioned

Table of Links

5 Dimension Reduction

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

How LLMs Learn to Recognize Different Writing Styles

Too Long; Didn't Read

Companies Mentioned

Table of Links

5 Dimension Reduction

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics