## Twitter
"""
The WE1S twitter
dataset contains 5,024,756 tweets posted to Twitter between December 6th, 2013 and June 30th, 2019. The dataset is divided into subcollections based on the query terms "humanities", "liberal arts", "stem", "science", and "science-es" (that is a query for the presence of either "science" or "sciences"). Subcollections can be identified in the dataset from the value of the metapath
property. The number of tweets in each subcollections is as follows:
- humanities: 1,705,038
liberal-arts: 7,663
stem: 865,156
science: 2,089,985
science-es: 356,914
The tweets are distributed over the following date range:
- 2013: 16,335
- 2014: 862,746
- 2015: 1,711,823
- 2016: 947,561
- 2017: 976,971
- 2018: 3,24,133
- 2019: 185,187
Collectively, the tweets represent the work of 1,886,739 distinct usernames.
Each tweet's mentions, hashtags, and links are recorded, as well the number of likes and retweets. Unlike most other WE1S datasets, the Twitter dataset does not contain extracted features. Instead, it contains the original text of the tweet (the value of the content
property, along with a tidy_tweet
property, which contains the text of the tweet after preprocessing. Tweets were preprocessed using a modified form of the WE1S preprocessing algorithm. Details can be found in the WE1S Tweet-Suite repository.
"""
# `comparison_not_humanities`
The `comparison_not_humanities` collection contains data and metadata about documents collected that do not contain the word "humanities." This collection includes word-frequency and other data representing 1,380,456 unique documents (no duplicate or close-variant documents) published from 2000-2019 in XX sources. WE1S researchers use this data to better understand the place of documents containing the word "humanities" within public discourse more generally. We know that only a small fraction of articles from newspapers contain the word “humanities,” but how small is this fraction?
WE1S researchers collected this data using keyword searches of 3 of the most common words in the English language (based on a well-known analysis of the Oxford English Corpus) that Lexis Nexis indexes and thus makes available for search: "person," "say," and "good". We collected data from the top 15 circulating newspapers in the U.S. from 2000-2019, randomly selecting 1 month per year for each keyword in order to limit results to more manageable numbers (each year searched therefore includes data from 3 months of that year). We also collected data from every other LexisNexis source we had collected humanities keyword data from (we were not able to fully replicate previous searches, so some sources do not have comparison data). For this collection run, we focused on the years 2013-2019 and randomly selected 1 month per year for each keyword in order to limit results. To exclude articles containing the word "humanities" from the results, we searched within each of our selected sources for articles containing “person AND NOT humanities,” “say AND NOT humanities,” and “good AND NOT humanities.”
# `comparison_sciences`
The `comparison_sciences` collection contains data and metadata about documents collected that contain the words "science" or "sciences." This collection includes word-frequency and other data representing 553,699 unique documents (no duplicate or close-variant documents) published from 1977-2019 in XX sources. Most of the data in this collection is from 1985-2019. WE1S researchers use this data to understand how public discourse about the humanities compares to public discourse about science.
WE1S researchers collected this data using keyword searches for "science." This search includes articles containing the word "science" and "sciences" (unlike with the word "humanities," we did not limit our results to only plural forms of the word). We collected data from the top 10 circulating newspapers in the U.S. and from University Wire sources (student newspapers). Documents in this collection may contain the word "humanities," just as documents in the humanities keyword collection may contain the words "science" or "sciences."
# `tvarchive`
This data comes from a scrape of the Internet Archive that Ryan and Jeremy did during summer camp 2020, and it is a scrape of some TV news transcripts from 2009-2017. There is one major complication, however. Currently, in the "tvarchive" dataset stored in Mongodb, there are only 85 documents. Ryan pointed me in the direction of 2854 more documents that are stored on harbor that never made it into Mongodb (these are located in the "internet-archive-humanities-sample" folder in this directory: http://harbor.english.ucsb.edu:11111/tree/write/projects/media_team/archive_news_cc-master/scripts). Ryan said that he and Jeremy ran into some complications in trying to import this data to Mongdb, and looking at the data on harbor, I think I can understand why. Although this data has been converted from csv format (which is the format it takes once scraped from the Internet Archive) to json, it will take some significant work to get this json data into a format that is more in line with the rest of our json data in our Mongodb datasets. Many fields and the values within those fields would need to be massaged. For a total archive of ~3,000 documents, I'm not sure it's worth publishing this dataset at this time, but I'm interested to hear what others think.
It also seems like Ryan and Jeremy downloaded another, larger dataset of TV transcripts from the Internet Archive (the "archive-all-complete.csv" file located in the above directory on harbor). This is a 52 GB file that I haven't tried to download from harbor, but that seems to contain many, many more documents of the kind that are in the "internet-archive-humanities-sample" folder (the "archive-all-complete.csv" seems to contain 2,778,179 documents). This data hasn't been converted to json format, however. While it looks like code exists to do this conversion, it would be a big job, and without a streamlined process to address the above issues, I definitely don't think it's worth doing.