Selected scripts from my Master's thesis. The thesis is a large-scale logistic regression analysis of Kiezdeutsch syntax, which aims to computationally find the syntactical differences between standard German and the urban vernacular Kiezdeutsch.
The thesis exploits generalized linear models (GLMs) to learn which syntactic constructions found in Kiezdeutsch are characteristic of it in comparison to standard German. This is done on both the word and the phrase level using part-of-speech (POS) n-grams. The thesis identified several POS n-gram types which support the following phenomena: bare NPs, ADV SVO, and V1. Moreover, significant associations between Kiezdeutsch and POS trigrams with negation were identified. Furthermore, the thesis found limited evidence to show that lack of relative clauses is linked to Kiezdeutsch.
The datasets utilized in this thesis are two comparable corpora of German. The first of these corpora contains spoken dialogs in Kiezdeutsch (KiDKo) while the other corpus contains dialogs in mostly standard German (GRAIN).
The Kiezdeutsch corpus or KiDKo—-from the German KiezDeutsch Korpus—-is not publically available. Therefore, I shall not be providing data samples of it in this repository.
The GRAIN corpus (German RAdio INterviews) can be found here.