Home
Corpus description and texts

Corpus description and texts

The IPCHG is structured to represent syntactic variation across time and space and will ultimately consist of about 165 texts.

Texts are annotated only up to approximately 10,000 words (although at least 30 of the texts are shorter). The corpus will contain approx. 1.4 million words.

Each parsed text is stored as a UTF-8 text file with the .txt extension. Every word is tagged for part of speech and morphological features (and will eventually be lemmatized). Sentences are syntactically parsed according to the Penn annotation system.

Currently available texts

Texts come from three source corpora and are available under a CC license. An explanation of the version numbers is shown here.

You can download individual texts (although note that these may not be the most current versions, and may contain errors or annotations that we no longer use). To download the most current versions of the texts, and the easiest way to download the whole corpus, use our GitHub page:

Download all texts on GitHub

Please report any errors in the corpus files by emailing Elliott Evans, evansell at iu dot edu.