Niels Tiben s203131
Denis Ghiletki s210714
Jakob Schneider s202432
Kristin Anett Remmelgas s203129
Wybren Meinte Oppedijk s203130
The goal of the project is to use natural language processing to solve a classification task of predicting whether a given tweet is about a real disaster or not.
Since we chose a natural language processing problem, we plan to use the Transformers framework.
We plan on utilizing one of the strengths of the Transformers framework which is that it provides thousands of pretrained models to perform different tasks. As a starting point we intend to use some of the pretrained models on our data and then see how we can further improve from there.
We are using the Kaggle dataset Natural Language Processing with Disaster Tweets. Each sample in the train and test set has the following information: a unique identifier, the text of a tweet, a keyword from that tweet (although this may be blank) and the location the tweet was sent from (may also be blank) and the training set also has a target value whether a tweet is about a real disaster (1) or not (0). The dataset was chosen because it is quite simple and straightforward which makes it a great dataset for getting started with Natural Language Processing. It also seems feasible to implement in such a short timeframe.
We intend to use pre-trained models due to limited time, and also train the model(s) additionally on our dataset. Since we are working on tweets then one of the models we plan to use is the BERTweet model which is the first public large-scale pre-trained language model for English Tweets.
We might as well look into ALBERT and DistilBERT models, which optimize the BERT model and make the training process faster. That would be beneficial for us due to time constraints.
See CHECKLIST.md