-
Notifications
You must be signed in to change notification settings - Fork 0
Home
The XML database uses linguistic annotations from Vanessa Gorman's Greek treebank collection Github repository.
Visualizations of sentences are here: Gorman Trees – Perseids collections
Our exploration of Gorman's collection was structured around the following tasks:
- From the XML sources, create the database and get basic statistical information on the collection (such as word and sentence count, distribution of syntactic relations, lemmata and morphological categories)
- Transform the sources into nested XML
- Measure some aspects of graphs (degree of node, degree distribution, longest path)
- Translate rules from a traditional grammar into XQuery queries, test the rules by querying the corpus
XQueries:
- createGrcTBG.xq -- create from local cloned repository of Vanessa Gorman's treebanks
- createGrcTBGpull.xq and populateGrcTBGfromGit.xq -- pull treebank files directly from Vanessa Gorman's Github repository
- Other: a list of XQuery scripts for treebank analysis, on the example of Vanessa Gorman's collection
As produced by GrcTBStatsGeneral.xq.
Texts (documents): 153
Sentences: 25647
Words: 601706
Words excluding punctuation marks: 539062
Ellipses (words originally omitted, added for analysis): 8341
Words with missing or undefined annotations: 346
See the table on separate wiki page
See results on a separate page
See results on a separate page
Which POS take on a role of a specific relation?
XQuery script: FindPOStag.xq.
See the report for AuxZ, emphasizing particle:POStagAuxZ.
XQuery script: FindWordforms1.xq
Report: WordformsList (66,313 wordforms on 2019-10-16)
How many lemmata, how many occurrences of each one?
XQuery script: FindLemmata.xq
The report: LemmataList (18,056 lemmata on 2019-10-16)
We transform the Alpheios AGLDT XML (used in Gorman's collection) into nested trees as proposed by Bozia 2018.
The transformation enables us to easier retrieve children and parent nodes using XPath or XQuery.
Use the script ListToTree.xq to populate a new database, created by createGrcTBGTree.xq. Only complete graphs (i. e. those with existing @head
attribute values) are included.
A "flat" AGLDT XML of a sentence graph looks like this:
<sentence id="89" document_id="urn:cts:greekLit:tlg0008.tlg001.perseus-grc1" subdoc="13.45" span="οὐκ0:;2">
<word id="1" form="οὐκ" lemma="οὐ" postag="d--------" relation="AuxZ" head="4"/>
<word id="2" form="αἰσχροποιός" lemma="αἰσχροποιός" postag="a-s---fn-" relation="PNOM" head="4"/>
<word id="3" form=";" lemma="punc1" postag="u--------" relation="AuxK" head="0"/>
<word id="4" insertion_id="0003e" artificial="elliptic" relation="PRED" lemma="εἰμί" postag="v2spia---" form="εἶ" head="0"/>
</sentence>
A transformation into nested XML looks like this:
<sentence id="89" document_id="urn:cts:greekLit:tlg0008.tlg001.perseus-grc1" subdoc="13.45" span="οὐκ0:;2">
<word id="3" form=";" lemma="punc1" postag="u--------" relation="AuxK" head="0"/>
<word id="4" insertion_id="0003e" artificial="elliptic" relation="PRED" lemma="εἰμί" postag="v2spia---" form="εἶ" head="0">
<word id="1" form="οὐκ" lemma="οὐ" postag="d--------" relation="AuxZ" head="4"/>
<word id="2" form="αἰσχροποιός" lemma="αἰσχροποιός" postag="a-s---fn-" relation="PNOM" head="4"/>
</word>
</sentence>
Database: grc-tb-g-tree
Texts (documents): 140
Sentences: 23696
Words: 559469
Words excluding punctuation marks: 501361
Ellipses (words originally omitted, added for analysis): 7812
Words with missing or undefined annotations: 281
S | W |
---|---|
1 | 33 |
1 | 30 |
1 | 22 |
2 | 21 |
4 | 19 |
3 | 18 |
5 | 17 |
11 | 16 |
14 | 15 |
20 | 14 |
22 | 13 |
47 | 12 |
59 | 11 |
120 | 10 |
169 | 9 |
326 | 8 |
539 | 7 |
1079 | 6 |
1950 | 5 |
3540 | 4 |
5732 | 3 |
10025 | 2 |
26 | 1 |
See the analysis on a separate page.
See the pages:
CGCG 26.3 Most sentences (...) contain at least a predicate (nearly always a finite verb) and one or more obligatory constituents that belong to that predicate; together these make up the sentence core.
We want to test the following claims:
- Most sentences contain a predicate
- The predicate is nearly always a finite verb
To test (1), we want to see how many sentences from the corpus (absolutely and relatively) do not contain a predicate. This is done by the FindSentencesWithoutPRED.xq script.
There are 152 of 25,647 sentences (0.6%) without any kind of PRED function (these include "PRED", "PNOM", "PRED_CO", "PRED_AP"
). Reading the 152 sentences themselves, new questions are opened, but that is another matter.
To test (2), we retrieve all "PRED" and associated relations and analyze their @postag
values; the script is AnalyzePRED.xq. The results are shown in the following table:
POS | count | percentage |
---|---|---|
(empty) | 1119 | 2% |
-- | 1 | 0% |
a- | 6353 | 13% |
c- | 36 | 0% |
d- | 272 | 1% |
l- | 67 | 0% |
m- | 1 | 0% |
n- | 4121 | 9% |
p- | 759 | 2% |
p1 | 69 | 0% |
p2 | 4 | 0% |
r- | 6 | 0% |
t- | 1 | 0% |
un | 1 | 0% |
v- | 1947 | 4% |
v1 | 2623 | 6% |
v2 | 1893 | 4% |
v3 | 27847 | 59% |
v_ | 167 | 0% |
x- | 1 | 0% |
It turns out that the finite verbs (v1, v2, v3
) make for 69% of all predicates – which is somewhat far from "nearly always", as claimed by the CGCG.
CGCG 26.20 Attributive genitives, adverbs and prepositional phrases serve as head of the noun phrase, always with the article.
- Attributive genitive as NP: NPGenitiveAttr.xq (670 results)
- Adverbs as NP: NPAdverb.xq (198 results)
- Prepositional phrases as NP: NPPrepositionalPhrase.xq (1176 results)
- Analyze tree patterns of adverbs as NP: NPAdverbPatterns.xq
The last query returns configuration of relations:
<word relation="SBJ">
<word relation="AuxZ"/>
<word relation="AuxP"/>
<word relation="ATR"/>
<word relation="ATR"/>
</word>
<word relation="SBJ">
<word relation="ATR"/>
<word relation="ADV"/>
<word relation="AuxP"/>
<word relation="ATR"/>
<word relation="ATR"/>
</word>
<word relation="SBJ">
<word relation="ATR"/>
<word relation="ADV"/>
<word relation="AuxP"/>
<word relation="ATR"/>
<word relation="ATR"/>
</word>
<word relation="SBJ">
<word relation="ATR"/>
<word relation="ATR"/>
<word relation="AuxP"/>
<word relation="ATR"/>
<word relation="ATR"/>
</word>
<word relation="SBJ">
<word relation="AuxP"/>
<word relation="COORD"/>
<word relation="ATR_CO"/>
<word relation="ATR"/>
<word relation="ATR_CO"/>
</word>
<word relation="SBJ">
<word relation="AuxP"/>
<word relation="COORD"/>
<word relation="ATR_CO"/>
<word relation="ATR"/>
<word relation="ATR_CO"/>
</word>
Some ideas (not realized yet) on a separate page: RepeatResearch.
Visualizing a tree in the text: Visualize Tree
A brief explanation on difference between Unicode NFC and NFD sequences, and how to work with that in XQuery.