We generally rely on the sentence divisions in our source corpora:
- MHG texts (from ReM) and ENHG texts (from ReF) indicate the end of a sentence with the annotation
$MSBI
(for "moderne satzbeendende Interpunktion"). Not being part of the original text, we annotate such punctuation as(CODE <.>)
,(CODE <?>)
, etc. - Modern German texts in the DTA are divided into sentences, each with a sentence ID number.
If we feel that the source corpus has erroneously split a sentence into two sentence tokens, we generally maintain this split. As a result, some sentence tokens in our corpus may consist of only a subordinate clause. Any sentence token that does not have the root node IP-MAT, CP-QUE-MAT, or INTJP is given the root node FRAG (including subordinate clauses, chapter titles, etc.)
On the other hand, the Penn annotation system specifies that each sentence token consist of at most one matrix clause (with a small number of exceptions). The following structures require that we split what is considered a single sentence in the source corpus into additional sentence tokens in our corpus.