Authors:
(1) Philine Widmer, ETH Zürich and [email protected];
(2) Sergio Galletta, ETH Zürich and [email protected];
(3) Elliott Ash, ETH Zürich and [email protected].
Abstract and 1 Introduction 2. Data
3. Measuring Media Slant and 3.1. Text pre-processing and featurization
3.2. Classifying transcripts by TV source
3.3. Text similarity between newspapers and TV stations and 3.4. Topic model
4. Econometric Framework
4.1. Instrumental variables specification
4.2. Instrument first stage and validity
5. Results
6. Mechanisms and Heterogeneity
6.1. Local vs. national or international news content
6.2. Cable news media slant polarizes local newspapers
Online Appendices
A. Data Appendix
A.2. Alternative county matching of newspapers and A.3. Filtering of the article snippets
A.4. Included prime-time TV shows and A.5. Summary statistics
B. Methods Appendix, B.1. Text pre-processing and B.2. Bigrams most predictive for FNC or CNN/MSNBC
B.3. Human validation of NLP model
B.6. Topics from the newspaper-based LDA model
C. Results Appendix
C.1. First stage results and C.2. Instrument exogeneity
C.3. Placebo: Content similarity in 1995/96
C.8. Robustness: Historical circulation weights and C.9. Robustness: Relative circulation weights
C.12. Mechanisms: Language features and topics
C.13. Mechanisms: Descriptive Evidence on Demand Side
C.14. Mechanisms: Slant contagion and polarization
Our data comes from cable news channels and local newspapers. The resulting panel is from 2005 through 2008, the years for which we can construct cable news viewership by locality. For summary statistics, see Table A.2.
Local newspaper article excerpts. Our analysis starts with a corpus of local newspaper articles. Our source is the news aggregation site NewsLibrary, from which we get the headlines and first 80 words of all published articles for various local U.S. newspapers for 2005-2008. We focus on the first 80 words since, at the time of our data construction (06-08/2019), our subscription allowed us to access these article previews at large.[3] We programmatically read through the snippets and extract the newspaper name, the headline, the plain article text, and the article date (for an example of an article snippet, see Appendix Figure A.1). Our main dataset contains 16 million article snippets from 305 unique newspapers. Appendices A.1 to A.3 provide more information on the sample.
News show transcripts for FNC, CNN, and MSNBC. Our second corpus is from cable news networks. We gather the news show transcripts for FNC, CNN, and MSNBC from LexisNexis. The corpus includes transcripts from around 40,000 episodes of prime-time shows for the three networks for 2005-2008 (for a list of the included shows, see Section A.4). We have several scripts that read through the transcripts to filter out metadata and other non-speech content.
While the newspaper article snippets contain the article’s first 80 words (approximately), the transcripts tend to be longer. We, thus, segment the transcripts into 80-word snippets – to match the length of the newspaper article snippets. That is, we make sure our text-based slant prediction works well on short transcript snippets before applying it to the newspaper article snippets.
Newspaper-level circulation data. Next, we match each local newspaper outlet to one or more counties. We use audited county-level circulation data from the Alliance for Audited Media (AAM), which is available for around 305 unique newspapers (that also appear in the NewsLibrary and the Nielsen rating data). Our main analyses thus include 3,781 observation units at the newspaper-county level (see Section 4). The AAM also provides information on the newspapers’ headquarters location, which we exploit in additional analyses (Section 5.2).
Appendix A.2 describes an alternative method to match newspapers to counties (not relying on the AAM data, but based on the newspaper’s name). This procedure results in fewer observation units – namely, 682. However, this alternative sample represents slightly more underlying newspaper articles (24 million instead of 16 million). We use this alternative sample in robustness checks. Since the audited county-level circulation data allows for more precise county match(es) for each newspaper, we use the 3,781- observation-unit-sample for the main analysis.
Channel positions and viewership. From Nielsen, we have yearly data on channel positions and ratings for Fox News Channel, CNN, and MSNBC. These are the same as the data used by Martin and Yurukoglu (2017). First, we have the channel lineup for all the U.S. broadcast operators and the respective zip code areas served. Second, we have viewership information representing the share of individuals tuned in to each channel by zip code. This value is proportional to the average number of minutes spent watching a channel per household. As the original data are at the zip code level, we follow Ash and Galletta (2023) and aggregate the ratings and the channel positions at the county level. Specifically, we create county-year average channel positions, weighting the observations by population size in the zip code, while we weight ratings by the number of survey individuals in the zip code according to Nielsen. These variables are then collapsed at the county level by computing the mean across the years 2005-2008.
Other demographic covariates. Finally, we have a rich set of demographic covariates from the 2000 census (see Table A.2), measured at the zip code level. To get aggregate county values, we weigh them by zip code population.
This paper is available on arxiv under CC 4.0 license.
[3] Full articles – available on a pay-per-piece basis – were prohibitively expensive given our broad coverage in time and space.