Sentence tokenization

We generally rely on the sentence divisions in our source corpora:

MHG texts (from ReM) and ENHG texts (from ReF) indicate the end of a sentence with the annotation $MSBI (for "moderne satzbeendende Interpunktion"). Not being part of the original text, we annotate such punctuation as (CODE <.>), (CODE <?>), etc.
Modern German texts in the DTA are divided into sentences, each with a sentence ID number.

If we feel that the source corpus has erroneously split a sentence into two sentence tokens, we generally maintain this split. As a result, some sentence tokens in our corpus may consist of only a subordinate clause. Any sentence token that does not have the root node IP-MAT, CP-QUE-MAT, or INTJP is given the root node FRAG (including subordinate clauses, chapter titles, etc.)

On the other hand, the Penn annotation system specifies that each sentence token consist of at most one matrix clause (with a small number of exceptions). The following structures require that we split what is considered a single sentence in the source corpus into additional sentence tokens in our corpus.

Reasons to create a new sentence token

The following are taken to indicate a new main clause. At this stage of the project, rather than creating a new sentence token, we flag such clauses as IP-MAT-CONJ, to be split into a new sentence token at a later stage.

What seems semantically and syntactically like a second-conjunct main clause, with or without a conjunction:

( (IP-MAT 
	  …
	  (ADVP-RSP (ADV doe))
	  (VBDI^3^PL geyngen)
	  (NP-SBJ (PRO sij))
	  (ADVP (ADV eyuer))
	  (PP (P an)
	      (NP (D den) (N Rait)))
	  (, /)
  (IP-MAT-CONJ 		<- temporary flag for a new main clause
         (CONJ ind)     <- no CONJP when und introduces a main clause
         (NP-SBJ *con*)     <- null subject (conjunction reduction)
         (VBDI^3^PL sachten)
         (NP-OB2 (PRO yn))
 …
  (ID 1360_Hauwe_NeuesBuch_Cologne.,11))

( (IP-MAT (NP-SBJ (D der))
	  (MDDI^3^SG wolte)
	  (PP (P in)
	      (NP (NPR Prasilien)))
	  (VB fahren)
	  (, /)
	  (PP (P auff)
              (NP (N kauffmanschafft)))
	   (, /)
 (IP-MAT-CONJ     <- new main clause, even though no conjunction
              (NP-SBJ *con*)
	      (HVDI^3^SG Hatte)
	      (ADVP (ADV auch))
	      (NP-OB1 (N vrlaub))
	  ...
  (ID 1557_Staden_Historia_Hesse.,110))

The first sentence of a direct quotation (as in HeliPaD, but unlike CHLG and PPCHE, where the first sentence of a quote is the complement of the matrix verb of saying):

( (IP-MAT (PP (ADV+P Hieruff))
	  (VBDI^3^SG antwort)
	  (NP-SBJ (NPR^N^SG Carle))
	  (, /)
	  (CODE <:>)
  (IP-MAT-SPE-CONJ (PUNCQL ")      <-  direct speech IP-MAT = new sentence
		   (NP-ADT (PRO$^G^SG deiner) (N^G^SG rede))
		   (BEPI^1^SG bin)
		   (NP-SBJ (PRO^N^SG ich))
		   (ADJP-PRD (ADV fast) (ADJ^N^SG vnmuotig)))
	 ...
  (ID 1533_Rodler_Fierrabras_Moselfrk.,25))

Exceptions

New main clauses will be kept within a sentence token if:

The clause is parenthetical and embedded in another main clause. These are annotated as IP-MAT-PRN. Frequently these are inquits.
There is clearly coordination of main clauses, with elided material in one of the clauses that can't be explained away as pro-drop or auxiliary-drop. The conjunct clause with elided material is annotated IPX-MAT.
The clause can reasonably be interpreted as a relative clause (begins with der or welcher, sole purpose seems to be to elaborate on an NP in the main clause). This is to prevent potential relative clauses from being inaccessible to a query.