This repository contains a neural transition-based dependency parser for Spanish built around the arc-eager algorithm and a static oracle. It was developed as part of the_ Natural Language Processing_ course at the University of Padua (2022-23).
The project explores dependency parsing with two neural approaches: a Bi-LSTM baseline trained from scratch and a BERT-based model using the Spanish BETO transformer. Both models predict the sequence of parser actions needed to build unlabeled dependency trees from raw sentences.
Training and evaluation use the Universal Dependencies es_AnCora treebank. The code filters non-projective trees and prepares dataloaders for train, validation and test splits.
The parser follows the arc-eager algorithm with classes for the parser and a static oracle. The Bi-LSTM model embeds tokens and processes them with a bidirectional LSTM before classification. The BERT-based model fine-tunes only the BETO pooler layer while using frozen contextual embeddings.
Both models are trained for 20 epochs and evaluated with the Unlabeled Attachment Score (UAS). The Notebook file includes utilities to save checkpoints and plot learning curves.
Final trained models are available here:
- Baseline model: Google Drive folder
- BERT-based model: Google Drive folder
Both models achieved strong parsing accuracy on the test set, with the BERT-based approach slightly outperforming the Bi-LSTM baseline. The learning curves show stable convergence over 20 epochs and confirm that fine-tuning only the BETO pooler layer avoids overfitting while preserving the benefits of pretrained contextual embeddings.
Clone the repository and run the Notebook to:
- Download and preprocess the dataset.
- Train either the Bi-LSTM or the BETO model.
- Evaluate on the test set and visualize results.
Python 3.9 or later with PyTorch, transformers, datasets, conllu, matplotlib, and tqdm. We strongly recommend to use [Google Colab].
This project demonstrates how neural architectures—both recurrent and transformer-based—can be applied to transition-based dependency parsing for Spanish text.