The filename contains the date, chatroom, and number of posts; e.g., The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University.
This corpus contains text from 500 sources, and the sources have been categorized by genre, such as Next, we need to obtain counts for each genre of interest.
This particular corpus actually contains dozens of individual texts — one per address — but for convenience we glued them end-to-end and treated them as a single text. also used various pre-defined texts that we accessed by typing This program displays three statistics for each text: average word length, average sentence length, and the number of times each vocabulary item appears in the text on average (our lexical diversity score).
Sometimes these categories overlap, notably in the case of topical categories as a text can be relevant to more than one topic.
Occasionally, text collections have temporal structure, news collections being the most common example.
The graph in fig-inaugural used "word offset" as one of the axes; this is the numerical index of the word in the corpus, counting from the first word of the first address.
However, the corpus is actually a collection of 55 texts, one for each presidential address.