paint-brush
Media Slant: How We Classified Transcripts by TV Source Using Machine Learningby@mediabias

Media Slant: How We Classified Transcripts by TV Source Using Machine Learning

tldt arrow

Too Long; Didn't Read

We train a machine-learning classifier to predict whether a transcript snippet m comes from FNC or CNN/MSNBC. We split the corpus into 80% training data and 20% test data. We build the classifier in the training set and evaluate it in the test set.
featured image - Media Slant: How We Classified Transcripts by TV Source Using Machine Learning
Tech Media Bias [Research Publication] HackerNoon profile picture
0-item

Abstract and 1 Introduction 2. Data

3. Measuring Media Slant and 3.1. Text pre-processing and featurization

3.2. Classifying transcripts by TV source

3.3. Text similarity between newspapers and TV stations and 3.4. Topic model

4. Econometric Framework

4.1. Instrumental variables specification

4.2. Instrument first stage and validity

5. Results

5.1. Main results

5.2. Robustness checks

6. Mechanisms and Heterogeneity

6.1. Local vs. national or international news content

6.2. Cable news media slant polarizes local newspapers

7. Conclusion and References


Online Appendices

A. Data Appendix

A.1. Newspaper articles

A.2. Alternative county matching of newspapers and A.3. Filtering of the article snippets

A.4. Included prime-time TV shows and A.5. Summary statistics

B. Methods Appendix, B.1. Text pre-processing and B.2. Bigrams most predictive for FNC or CNN/MSNBC

B.3. Human validation of NLP model

B.4. Distribution of Fox News similarity in newspapers and B.5. Example articles by Fox News similarity

B.6. Topics from the newspaper-based LDA model

C. Results Appendix

C.1. First stage results and C.2. Instrument exogeneity

C.3. Placebo: Content similarity in 1995/96

C.4. OLS results

C.5. Reduced form results

C.6. Sub-samples: Newspaper headquarters and other counties and C.7. Robustness: Alternative county matching

C.8. Robustness: Historical circulation weights and C.9. Robustness: Relative circulation weights

C.10. Robustness: Absolute and relative FNC viewership and C.11. Robustness: Dropping observations and clustering

C.12. Mechanisms: Language features and topics

C.13. Mechanisms: Descriptive Evidence on Demand Side

C.14. Mechanisms: Slant contagion and polarization

3.2. Classifying transcripts by TV source

We train a machine-learning classifier to predict whether a transcript snippet m comes from FNC or CNN/MSNBC. We split the corpus into 80% training data and 20% test data. We build the classifier in the training set and evaluate it in the test set.


We take two steps to pre-process the features further, both using the training set to ensure a clean evaluation in the test set. First, we do supervised feature selection to reduce the dimensionality of the predictor matrix. Out of the 65,000-bigram dictionary, we select the 2,000 most predictive features based on their χ 2 score for the true label F NC. Second, we scale all predictors in S to variance one (we do not take out the mean, however, as then we would lose sparsity). Let S be the vector of selected and scaled features indexed by b. Let Bb m be the frequency of bigram b in transcript m (and Bm the vector of frequencies for transcript m, of length |S| = 2000).


Our classification method is a penalized logistic regression (Hastie et al., 2009). We parametrize the probability that a transcript is from Fox News as



where ψ is a 2000-dimensional vector of coefficients on each feature. The L2-penalized logistic regression model chooses ψ to minimize the cost objective



where M∗ gives the number of documents in the training sample.



We evaluate the classifier’s performance in the test set, obtaining an accuracy of 0.73 (with a standard deviation of 0.02 across five folds). This performance is much better than guessing (i.e., an accuracy of 0.5 in the balanced sample) and comparable with other work in this literature.[6] Table 1 shows good precision and recall across the two categories.


Next, we compare our model to human judgment. Human annotators (U.S. college students) guessed whether 80-word TV transcript snippets come from FNC or CNN/MSNBC. The annotators are between 73% and 78% accurate in their guesses, and they agree 58% of the time (if guessing randomly, their agreement rate would be 25%). Thus, our machine-learning model resembles human annotations. The 80-word snippets contain significant information about the source network, and our text-based model captures it. Appendix B.3 further describes the human validation.


Table 1: Test-Set Prediction Performance for Identifying Cable News Source


We now examine which bigrams are most important for classification. An advantage of logistic regression is its interpretability: The estimated coefficients of the trained model, ψˆ b, provide a ranking across the 2,000 predictive bigrams for their relative contribution to the predictions. Table B.1 shows some bigram examples with positive (predictive for FNC transcripts) or negative (predictive for CNN/MSNBC) values of ψˆ b, and Table B.2 provides a longer list. Prominent figures like Sean Hannity (predictive of FNC) or Anderson Cooper (predictive of CNN/MSNBC) appear among the bigrams. FNC bigrams allude to intuitively conservative priorities, such as the troops, crime, terrorism, and (implied) extremism of political counterparts (“far left”). CNN/MSNBC bigrams have a more liberal flavor, with mentions of health-policy-related tokens and emphasis on international perspectives.


This paper is available on arxiv under CC 4.0 license.


[6] The prediction accuracy for partisan affiliation in U.K. parliament by Peterson and Spirling (2018) is 60% and 80%, depending on the time period. According to Gentzkow et al. (2019b), one can correctly guess a speaker’s party based on a one-minute speech with 73% in the U.S. Congress (2007–2009). Kleinberg et al. (2017) obtain an AUC of 0.71 in predicting recidivism from criminal defendant characteristics.

Authors:

(1) Philine Widmer, ETH Zürich and [email protected];

(2) Sergio Galletta, ETH Zürich and [email protected];

(3) Elliott Ash, ETH Zürich and [email protected].