A language independent probabilistic model for disambiguation of abstract syntax trees

This project aims to develop a probabilistic model for disambiguating abstract syntax trees in natural language through unsupervised parameter estimation methods using linguistic data for multiple languages. Applications include parsing abstract syntax trees in Grammatical Framework and using the disambiguated trees to extract word sense information and doing macine translations. Emphasis is put on developing a model that is as language independent as possible. The main approach involves using Expectation Maximization to estimate parameters using data from UD-treebanks and automatically parsed UD-trees from various text corpora.

Directory structure

src - main script files for estimating probabilities
evaluation - scripts for evaluation of estimated probabilities
data/feature_counts/{name}/{lang} - raw syntactic n-gram data from (parsed) corpora
data/possibility_dictionaries/{gf/wn}/{lang} - dictionaries describing possible latent representations for each vocabulary item, currently featuring gf based dictionaries and wordnet based dictionaries

Running the code

To run the estimations you first need to preprocess data for the estimation by running src/make_all_em_data.sh, estimation can then be done by running src/run_em.sh. Depending on the type of probabilities you are intrested in you might want to add autoparsed data in the data/feature_counts/autoparsed directory and you might want to make a combined wordnet/GF possibility dictionary by running data/possibility_dictionaries/combine_gf_wn.sh.

Name		Name	Last commit message	Last commit date
Latest commit History 367 Commits
data		data
evaluation		evaluation
report		report
src		src
utils		utils
work		work
.gitattributes		.gitattributes
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A language independent probabilistic model for disambiguation of abstract syntax trees

Directory structure

Running the code

About

Releases

Packages

Contributors 2

Languages

okalldal/gf-exjobb

Folders and files

Latest commit

History

Repository files navigation

A language independent probabilistic model for disambiguation of abstract syntax trees

Directory structure

Running the code

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages