Skip to content
nevenjovanovic edited this page Feb 21, 2020 · 20 revisions

Explore (Greek) treebanks with XQuery: descriptions of queries and the database

The XML database uses linguistic annotations from Vanessa Gorman's Greek treebank collection Github repository.

Visualizations of sentences are here: Gorman Trees – Perseids collections

Our exploration of Gorman's collection was structured around the following tasks:

  1. From the XML sources, create the database and get basic statistical information on the collection (such as word and sentence count, distribution of syntactic relations, lemmata and morphological categories)
  2. Transform the sources into nested XML
  3. Measure some aspects of graphs (degree of node, degree distribution, longest path)
  4. Translate rules from a traditional grammar into XQuery queries, test the rules by querying the corpus

Create databases

XQueries:

  1. createGrcTBG.xq -- create from local cloned repository of Vanessa Gorman's treebanks
  2. createGrcTBGpull.xq and populateGrcTBGfromGit.xq -- pull treebank files directly from Vanessa Gorman's Github repository
  3. Other: a list of XQuery scripts for treebank analysis, on the example of Vanessa Gorman's collection

Statistics on Vanessa Gorman's Greek treebank collection, 2020-02-16+01:00

As produced by GrcTBStatsGeneral.xq.

Texts (documents): 153

Sentences: 25647

Words: 601706

Words excluding punctuation marks: 539062

Ellipses (words originally omitted, added for analysis): 8341

Words with missing or undefined annotations: 346

Sentences (S) grouped by word count (W), in descending order

See the table on separate wiki page

Statistics on syntactic relations

See results on a separate page

Statistics on parts of speech (POS)

See results on a separate page

Statistics on individual relations

Which POS take on a role of a specific relation?

XQuery script: FindPOStag.xq.

See the report for AuxZ, emphasizing particle:POStagAuxZ.

Count of wordforms, first 200 wordforms with frequency count

XQuery script: FindWordforms1.xq

Report: WordformsList (66,313 wordforms on 2019-10-16)

Lemmata by occurrence count

How many lemmata, how many occurrences of each one?

XQuery script: FindLemmata.xq

The report: LemmataList (18,056 lemmata on 2019-10-16)

Statistics on syntactic trees (nested trees format)

Transform the AGLDT XML into nested trees

We transform the Alpheios AGLDT XML (used in Gorman's collection) into nested trees as proposed by Bozia 2018.

The transformation enables us to easier retrieve children and parent nodes using XPath or XQuery.

Use the script ListToTree.xq to populate a new database, created by createGrcTBGTree.xq. Only complete graphs (i. e. those with existing @head attribute values) are included.

A "flat" AGLDT XML of a sentence graph looks like this:

<sentence id="89" document_id="urn:cts:greekLit:tlg0008.tlg001.perseus-grc1" subdoc="13.45" span="οὐκ0:;2">
  <word id="1" form="οὐκ" lemma="οὐ" postag="d--------" relation="AuxZ" head="4"/>
  <word id="2" form="αἰσχροποιός" lemma="αἰσχροποιός" postag="a-s---fn-" relation="PNOM" head="4"/>
  <word id="3" form=";" lemma="punc1" postag="u--------" relation="AuxK" head="0"/>
  <word id="4" insertion_id="0003e" artificial="elliptic" relation="PRED" lemma="εἰμί" postag="v2spia---" form="εἶ" head="0"/>
</sentence>

A transformation into nested XML looks like this:

<sentence id="89" document_id="urn:cts:greekLit:tlg0008.tlg001.perseus-grc1" subdoc="13.45" span="οὐκ0:;2">
  <word id="3" form=";" lemma="punc1" postag="u--------" relation="AuxK" head="0"/>
  <word id="4" insertion_id="0003e" artificial="elliptic" relation="PRED" lemma="εἰμί" postag="v2spia---" form="εἶ" head="0">
    <word id="1" form="οὐκ" lemma="οὐ" postag="d--------" relation="AuxZ" head="4"/>
    <word id="2" form="αἰσχροποιός" lemma="αἰσχροποιός" postag="a-s---fn-" relation="PNOM" head="4"/>
  </word>
</sentence>

Statistics on Vanessa Gorman's Greek treebank collection (nested tree format), 2020-02-16+01:00

Database: grc-tb-g-tree

Texts (documents): 140

Sentences: 23696

Words: 559469

Words excluding punctuation marks: 501361

Ellipses (words originally omitted, added for analysis): 7812

Words with missing or undefined annotations: 281

Sentences (S) grouped by word count (W), in descending order

S W
1 33
1 30
1 22
2 21
4 19
3 18
5 17
11 16
14 15
20 14
22 13
47 12
59 11
120 10
169 9
326 8
539 7
1079 6
1950 5
3540 4
5732 3
10025 2
26 1

See the analysis on a separate page.

Degree of node, degree distribution, longest path

See the pages:

  1. Degree of node
  2. Degree distribution
  3. Longest path distribution

Testing the Cambridge Grammar of Classical Greek (2019)

Without graph properties

CGCG 26.3 Most sentences (...) contain at least a predicate (nearly always a finite verb) and one or more obligatory constituents that belong to that predicate; together these make up the sentence core.

We want to test the following claims:

  1. Most sentences contain a predicate
  2. The predicate is nearly always a finite verb

To test (1), we want to see how many sentences from the corpus (absolutely and relatively) do not contain a predicate. This is done by the FindSentencesWithoutPRED.xq script.

There are 152 of 25,647 sentences (0.6%) without any kind of PRED function (these include "PRED", "PNOM", "PRED_CO", "PRED_AP"). Reading the 152 sentences themselves, new questions are opened, but that is another matter.

To test (2), we retrieve all "PRED" and associated relations and analyze their @postag values; the script is AnalyzePRED.xq. The results are shown in the following table:

POS count percentage
(empty) 1119 2%
-- 1 0%
a- 6353 13%
c- 36 0%
d- 272 1%
l- 67 0%
m- 1 0%
n- 4121 9%
p- 759 2%
p1 69 0%
p2 4 0%
r- 6 0%
t- 1 0%
un 1 0%
v- 1947 4%
v1 2623 6%
v2 1893 4%
v3 27847 59%
v_ 167 0%
x- 1 0%

It turns out that the finite verbs (v1, v2, v3) make for 69% of all predicates – which is somewhat far from "nearly always", as claimed by the CGCG.

Using graph properties

CGCG 26.20 Attributive genitives, adverbs and prepositional phrases serve as head of the noun phrase, always with the article.

  1. Attributive genitive as NP: NPGenitiveAttr.xq (670 results)
  2. Adverbs as NP: NPAdverb.xq (198 results)
  3. Prepositional phrases as NP: NPPrepositionalPhrase.xq (1176 results)
  4. Analyze tree patterns of adverbs as NP: NPAdverbPatterns.xq

The last query returns configuration of relations:

<word relation="SBJ">
  <word relation="AuxZ"/>
  <word relation="AuxP"/>
  <word relation="ATR"/>
  <word relation="ATR"/>
</word>
<word relation="SBJ">
  <word relation="ATR"/>
  <word relation="ADV"/>
  <word relation="AuxP"/>
  <word relation="ATR"/>
  <word relation="ATR"/>
</word>
<word relation="SBJ">
  <word relation="ATR"/>
  <word relation="ADV"/>
  <word relation="AuxP"/>
  <word relation="ATR"/>
  <word relation="ATR"/>
</word>
<word relation="SBJ">
  <word relation="ATR"/>
  <word relation="ATR"/>
  <word relation="AuxP"/>
  <word relation="ATR"/>
  <word relation="ATR"/>
</word>
<word relation="SBJ">
  <word relation="AuxP"/>
  <word relation="COORD"/>
  <word relation="ATR_CO"/>
  <word relation="ATR"/>
  <word relation="ATR_CO"/>
</word>
<word relation="SBJ">
  <word relation="AuxP"/>
  <word relation="COORD"/>
  <word relation="ATR_CO"/>
  <word relation="ATR"/>
  <word relation="ATR_CO"/>
</word>

Repeat existing research

Some ideas (not realized yet) on a separate page: RepeatResearch.

Other features

Visualizing a tree in the text: Visualize Tree

Composed and decomposed characters in Greek Unicode

A brief explanation on difference between Unicode NFC and NFD sequences, and how to work with that in XQuery.