Collection 20: U.S. Top Newspapers, 2000-2018

(sample of all articles)

A collection of word-frequency and other data representing 29,183 unique articles (no duplicates or close variants) published during 2000-2018 in 15 top U.S. newspapers and their associated online blogs. WE1S and other researchers use this data to look for broad patterns and help guide closer study.

Collection 20 contains data representing all 15,692 articles from its set of sources in these years mentioning “humanities” and a sample of the 641,617 articles on everything else from those same sources and years (“random” documents found through searching on common English words). It downsamples these “random” articles (while maintaining the proportions of articles from particular sources and years) to achieve a 50/50 balance of articles from each category. The purpose is to allow media discourse on the humanities to be studied alongside “everything else” and not be buried so far down in the statistical pile that it cannot easily be seen in detail. Collection 20 is thus not a representation of the relative weight of discussion of the humanities in media discourse in general but instead an aid to studying the fine features and structures of each.

News sources in Collection 20 include: Boston Globe, Chicago Tribune, Daily News (New York), Dallas Morning News, Denver Post, Houston Chronicle, Los Angeles Times, New York Post, New York Times (and its blogs), Newsday (New York), Seattle Times, Star Tribune (Minneapolis, MN), Tampa Bay Times, USA Today, Washington Post.

Kinds of Sources (by Tags)

Sources in Collection 20 are associated with the following non-exclusive metadata categories, which describe the kinds of sources in the collection. Of the 29,183 documents in the collection: all are top-circulating newspapers; 15,150 are from publications located in the North East; 5,037 are from publications located in the Midwest; 3,857 are from publications located in the South; 3,481 are from publications located on the West Coast; 965 are from publications located in the Rocky Mountain region or the Southwest; and 693 are categorized as multiregional or as having national reach. Sources are assigned to categories based solely on explicit publication information and/or self-identification.

Suggested Citation

WhatEvery1Says (WE1S) Project. (June 20, 2019). Collection 20: U.S. Top Newspapers, 2000-2018. Zenodo. DOI 10.5281/zenodo.4927419.


Collection Metadata


Topic Models of This Collection

Model Family 1 (created June 20, 2019): models for 25, 50, 100, 150, 200, 250 topics

Visualizations for this model family:

25 topics 50 topics 100 topics 150 topics 200 topics 250 topics
Dfr-browser
TopicBubbles
pyLDAvis
DendrogramViewer
Diagnostics

WE1S Developers Only

This start page for the collection last revised: June 10, 2021