Nineenth-Century Fiction

(external data pipeline test)

A collection of word-frequency and other data representing 2,731 works of fiction, mostly from the nineteenth century (with a few works from earlier or later). The data was collected by the Stanford Literary Lab and is used with their permission. The primary purpose of this collection is as a test for importing external data into the WE1S Workspace. It has not been processed with the aim of studying nineteenth-century fiction, so conclusions best on the models in this collection must be made with extreme caution. Some of the known problems are OCR errors, especially in Internet Archive texts, the presence of editorial boilerplate in the beginnings and ends of texts, and the use of potentially inappropriate (or non-use of potentially appropriate) stop words. Also, titles do not include chunk numbers, so the only way to identify which chunk is referenced in the visualisation is via the JSON file's name property.

The documents were initially chunked into smaller segments consisting of 2,000 space-separated tokens. These chunks were then processed using an algorithm that mimics MALLET's default tokenizer. Tokens were lower-cased, and all non-Unicode word characters were deleted. Tokens less than three characters in length were deleted. The WE1S Standard Stoplist was used in combination with Matthew Jockers' Expanded Stop Word List.

Available Metadata

All metadata accompanying the texts is available through the JSON files.

Suggested Citation

WE1S Project, Nineenth-Century Fiction (external data pipeline test), 2020, doi:[TBD].

Collection Metadata

Topic Models of This Collection

Model Family 1 (created April 6, 2020): models for 50, 100 topics

25 topics 50 topics 100 topics 150 topics 200 topics 250 topics

WE1S Developers Only