This repository combines three assignments from the NLP with LLMs course, supervised by Prof. Michael Elhadad. Each assignment focuses on a different aspect of using large language models (LLMs) in Natural Language Processing (NLP).
The course focuses on two primary objectives: exploring key tasks in Natural Language Processing (NLP) and examining the capabilities of Large Language Models (LLMs) when applied to in-depth linguistic analysis. The course places special emphasis on the linguistic nature of problems and the definition of quality evaluation criteria for LLM-based solutions.
- The fundamental sources of complexity in NLP, such as ambiguity and variability.
- Different levels of linguistic analysis, including syntax, semantics, and pragmatics.
- How LLMs can be used to model a range of complex NLP tasks.
- Advanced methods for evaluating LLM performance.
- The systematic limitations of LLMs when performing certain linguistic tasks.
The curriculum is structured around the application of LLMs to classical NLP problems and modern techniques for working with these models:
- LLM Fundamentals: The course introduces the general use of LLMs, covering topics such as finetuning, alignment (instruction and preference alignment), and various prompting methods like Zero-shot, Few-shot, and Chain-of-Thought (CoT).
- Programming with LLMs: A significant portion of the course is dedicated to programming LLMs, specifically using frameworks like DSPy. This includes optimizing prompts, controlling structured output, and integrating external tools.
- Classical NLP Problems: The course covers how LLMs can be applied to and evaluated on traditional NLP tasks, including POS tagging, NER, and Natural Language Inference (NLI). Special attention is given to adapting these problems to leverage the strengths of LLMs.
- The course reframed traditional tasks through the lens of LLMs, asking: how do model-based approaches compare with established pipelines?
- Assignments were designed to show how LLMs can be integrated into tasks like tagging, extraction, and question answering with attention to both literal understanding and pragmatic (context aware) reasoning.
Compared traditional POS tagging using tools like sklearn with model-based approaches using LLMs like Gemini and Grok. Focused on how LLMs handle word-level linguistic tasks differently from classic models.
Built a range of prompt-driven tools from counting features in text and adversarial NLI to entity extraction and embedding generation: showing the versatility of LLMs across diverse NLP tasks.
Leveraged the PragmatiCQA dataset and DSPy to investigate how LLMs can answer questions cooperatively, taking into account dialogue history, context, and pragmatic intent moving beyond literal answers.
This assignment focused on Part-of-Speech (POS) tagging, with a particular emphasis on identifying challenging cases and exploring the use of Large Language Models (LLMs) to address them. The project uses a subset of the English Universal Dependency (UD) dataset, which is provided in the CoNLL-U format.
The repository is structured into the following key parts:
-
Data Exploration and Baselines: The initial phase involves exploring the dataset by computing basic statistics and establishing a simple statistical baseline using a Unigram tagger. The
count_pos.pyscript demonstrates how to count UD POS tags from a CoNLL-U file. -
Classical Tagger Implementation: A classical POS tagger is implemented using the scikit-learn library and Logistic Regression. This approach relies on a feature engineering process that creates a dictionary of features for each token, including word shape, lexical information, and context from neighboring words. The
ud_pos_tagger_sklearn.ipynbnotebook details this process, from data preprocessing and vectorization to model training and evaluation. -
Error Analysis: The classical tagger's performance is analyzed to identify common errors and their root causes. The analysis in the
ud_pos_tagger_sklearn.ipynbnotebook demonstrates that the model struggles with grammatically ambiguous function words like 'that', 'as', and 'like', which require broader syntactic context to correctly classify. Based on this analysis, "hard sentences" are invented to challenge the tagger. -
LLM-based Tagger: The final section explores using an LLM to perform POS tagging. A simple zero-shot prompting strategy is implemented in
ud_pos_tagger_grok.pyusing the Grok API to evaluate the LLM's performance on the same task, specifically on the identified "hard sentences". Further exploration includes comparing pipeline and joint tagging strategies for sentences that also require segmentation.
This assignment focused on Natural Language Inference (NLI) with a special emphasis on handling explanations and presuppositions. The project explores both classical and LLM-based approaches to this task, using the Adversarial NLI (ANLI) and IMPPRES datasets. The assignment highlights the use of the DSPy library for programming and optimizing LLMs.
The repository is structured into the following key parts:
-
NLI with Explanations: The initial section involves a practical understanding and implementation of textual entailment with the ANLI dataset. A baseline NLI model using DeBERTa-v3-base-mnli-fever-anli is established, and a second baseline is implemented using an LLM with DSPy. The goal is to empirically analyze how providing relevant explanations can improve NLI performance.
-
Presupposition and Implicatures Analysis: The second part delves into a pragmatic task: the analysis of presuppositions. The project uses the IMPPRES dataset, which is designed to test pragmatic inference capabilities of NLI models. This dataset groups related sentence pairs into "paradigms" to verify a model's consistency across linguistic transformations.
-
LLM-based Tagger: An improved IMPPRES classifier is implemented using an LLM and DSPy. This implementation focuses on exploiting the consistency signal within the IMPPRES dataset's paradigms. The model is optimized using a reward that combines prediction accuracy with consistency across each paradigm. The assignment also compares different strategies and analyzes the results.
The repository also includes tutorials on using the DSPy framework to perform tasks like entity extraction, demonstrating how to structure, evaluate, and optimize LLM programs.
This assignment explores the field of pragmatics in Natural Language Processing. The assignment focuses on how context contributes to meaning and how to build a Cooperative Question Answering (QA) system that goes beyond literal interpretation. The project uses the PragmatiCQA dataset, which is specifically designed to evaluate a model's ability to perform pragmatic reasoning in conversations. A key aspect of the assignment is the implementation and evaluation of a program using the DSPy framework, and the reflection on the connection between pragmatic reasoning and the concept of Theory of Mind (ToM) in AI.
The repository is structured around the following core tasks:
-
Cooperative QA and Reasoning: The assignment provides a theoretical background on various QA types, including closed, passage-grounded, and open QA, and introduces the concept of Cooperative QA where a system anticipates a user's intent to provide a more helpful response. This approach is contrasted with simpler methods like retrieving a literal answer from a text span. The project compares a traditional QA model with a multi-step reasoning approach using an LLM to assess which is more effective for pragmatic tasks.
-
PragmatiCQA Dataset: The work utilizes the PragmatiCQA dataset, which contains 6,873 conversational QA pairs. The dataset is notable for its innovative crowdsourcing methodology, which was designed to align crowdworker incentives with the goal of collecting high-quality pragmatic data. The dataset's structure includes both
literalandpragmaticanswer spans, which are used to evaluate how well a model can provide helpful, non-literal information. -
Theory of Mind (ToM) Analysis: A central theme of the assignment is to analyze the extent to which a model demonstrates Theory of Mind. This involves a critical reflection on whether the LLM is genuinely inferring a speaker's intent and state of mind, or if it is simply performing a sophisticated form of pattern matching. The final submission includes a written analysis on this topic, supported by examples from the experiments.
Thanks to Professor Michael Elhadad for the wonderful lectures.