paint-brush
How We Measured Media Slant and Text Pre-Processing and Featurizationby@mediabias

How We Measured Media Slant and Text Pre-Processing and Featurization

tldt arrow

Too Long; Didn't Read

The study was published in the online version of the Hacking journal. The results were based on a combination of text pre-processing and bigrams. The main results were that Fox News was the most predictive for prime-time TV shows.
featured image - How We Measured Media Slant and Text Pre-Processing and Featurization
Tech Media Bias [Research Publication] HackerNoon profile picture
0-item

Abstract and 1 Introduction 2. Data

3. Measuring Media Slant and 3.1. Text pre-processing and featurization

3.2. Classifying transcripts by TV source

3.3. Text similarity between newspapers and TV stations and 3.4. Topic model

4. Econometric Framework

4.1. Instrumental variables specification

4.2. Instrument first stage and validity

5. Results

5.1. Main results

5.2. Robustness checks

6. Mechanisms and Heterogeneity

6.1. Local vs. national or international news content

6.2. Cable news media slant polarizes local newspapers

7. Conclusion and References


Online Appendices

A. Data Appendix

A.1. Newspaper articles

A.2. Alternative county matching of newspapers and A.3. Filtering of the article snippets

A.4. Included prime-time TV shows and A.5. Summary statistics

B. Methods Appendix, B.1. Text pre-processing and B.2. Bigrams most predictive for FNC or CNN/MSNBC

B.3. Human validation of NLP model

B.4. Distribution of Fox News similarity in newspapers and B.5. Example articles by Fox News similarity

B.6. Topics from the newspaper-based LDA model

C. Results Appendix

C.1. First stage results and C.2. Instrument exogeneity

C.3. Placebo: Content similarity in 1995/96

C.4. OLS results

C.5. Reduced form results

C.6. Sub-samples: Newspaper headquarters and other counties and C.7. Robustness: Alternative county matching

C.8. Robustness: Historical circulation weights and C.9. Robustness: Relative circulation weights

C.10. Robustness: Absolute and relative FNC viewership and C.11. Robustness: Dropping observations and clustering

C.12. Mechanisms: Language features and topics

C.13. Mechanisms: Descriptive Evidence on Demand Side

C.14. Mechanisms: Slant contagion and polarization

3. Measuring Media Slant

This section describes how we construct the language measures used as outcomes in our regression analysis. We aim to capture the textual similarity between (i) the newspaper article snippets and (ii) the TV show transcripts. Therefore, we implement a supervised machine-learning approach to predict if a newspaper article’s content resembles that from a particular TV station (FNC or CNN/MSNBC).[4]

3.1. Text pre-processing and featurization

First, we preprocess the newspaper articles and TV transcripts, stem all words, and form bigrams (two-word phrases), see details in Appendix B.1.



The frequency threshold excludes infrequent bigrams that are highly distinctive for a given channel but carry little substantive political or topical information. This procedure produces a vocabulary V with 65,000 bigrams. Supervised learning models using n-grams are rarely sensitive to specific pre-processing and featurization choices (e.g., Denny and Spirling, 2018).


This paper is available on arxiv under CC 4.0 license.


[4] The approach is related to Gentzkow et al. (2019b), who also use a regularized linear model with n-gram inputs. Our different approach reflects a different scientific objective. Gentzkow et al. (2019b) are interested in measuring the level of polarization between groups in language. We are interested in forming a predicted probability of the source of a document for scoring influence in a second corpus. Other related methods are Peterson and Spirling (2018) and Osnabrügge et al. (2021).


[5] We have fewer snippets from FNC than from CNN/MSNBC. Thus, we randomly under-sample the snippets from the CNN/MSNBC corpus to match the number of snippets from FNC.

Authors:

(1) Philine Widmer, ETH Zürich and [email protected];

(2) Sergio Galletta, ETH Zürich and [email protected];

(3) Elliott Ash, ETH Zürich and [email protected].