This repository contains a number of UIMA components to process dramatic texts, as well as an executable pipeline. We follow general design ideas implemented in DKPro Core. The full pipeline reads in files in several TEI/XML dialects (see below), and applies the most important NLP tools on them, while keeping the structural annotation of the plays intact (and, if necessary, processing different text layers separately).
- Clone the repository:
git clone https://github.com/quadrama/DramaNLP.git
- Enter the directory:
cd DramaNLP
- If necessary, switch to a branch
git checkout develop/1.0
- If necessary, switch to a branch
- Download dependencies, compile everything and install it locally:
mvn compile install
This produces a lot of output, but at the end, you should see something likeBUILD SUCCESS
- To compile a runnable binary, enter the directory:
cd de.unistuttgart.ims.drama.main
and runmvn package
. This creates a file calleddrama.Main.jar
in the directorytarget/assembly/
. This file contains the code and all its dependencies.
As an example, we'll work on the data from the GerDraCor collection (which is based on TextGrid). Download the files from GitHub and store the XML files in a directory. We will call the directory $TEIDIR
in the following examples. The directory $OUTDIR
is used to store the output of the pipeline. You'll need the file drama.Main.jar
.
Enter the following command in the command line interface:
java -cp target/assembly/drama.Main.jar de.unistuttgart.ims.drama.main.TEI2XMI --input $TEIDIR --output $OUTDIR/xmi --csvOutput $OUTDIR/csv --conllOutput $OUTDIR/conll --skipSpeakerIdentifier --corpus GERDRACOR --collectionId "gdc" --doCleanup
After running, the directory $OUTDIR
contains three sub directories, xmi
, csv
and conll
, which are different file formats for the plays.
This package supports the following drama corpora
- TextGrid (German)
- GerDraCor (German)
- theatre classique (French)