This unguided research project served as the final paper for the Natural Language Processing graduate course at UT Austin. My partner and I sought to contribute to the field of large language models by attacking the ELECTRA-small model with adversarial data that mimics a production environment. We then evaluated the errors produced by these adversarial attacks and proposed methods for enhancing the robustness and safety of consumer-facing LLMs.
To learn more, please read through the full report, from which the abstract is displayed below. Unfortunately, as this was an assignment for an active class at the University of Texas, sharing the project files would breach the Academic Honesty agreement, but I hope that the paper includes enough detail that this work could be recreated by an interested party.
Question answering is a popular NLP task, driven in part by popular interest in commercializing recent advances in LLMs; however, the excellent performance of these models on common academic QA benchmarks does not always transfer cleanly to industrial contexts (Ribiero et al. 2020). One egregious example of this is when seemingly innocuous changes to the input (e.g a typo or missing word) drastically reduce performance (Gardner et al. 2020). Such model “blind spots” are commonly referred to as dataset artifacts. In this paper we first identify some dataset artifacts that approximate the data imperfections and difficulties that these models might encounter when launched into commercial production. Then we explore methods to mitigate those artifacts during the fine-tuning process of an ELECTRA transformer model on the SQuAD QA benchmark (Clark et al. 2020; Rajpurkar et al. 2016).