diff --git a/LangChain_with_Hugging_Face_Inference_API_Endpoints_(Assignment).ipynb b/LangChain_with_Hugging_Face_Inference_API_Endpoints_(Assignment).ipynb new file mode 100644 index 0000000..8e94fbd --- /dev/null +++ b/LangChain_with_Hugging_Face_Inference_API_Endpoints_(Assignment).ipynb @@ -0,0 +1,1212 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "ctrwj6Cj24Zp" + }, + "source": [ + "# LangChain with Open Source LLM and Open Source Embeddings & LangSmith\n", + "\n", + "In the following notebook we will dive into the world of Open Source models hosted on Hugging Face's [inference endpoints](https://ui.endpoints.huggingface.co/).\n", + "\n", + "The notebook will be broken into the following parts:\n", + "\n", + "- 🤝 Breakout Room #1:\n", + " 1. Set-up Hugging Face Infrence Endpoints\n", + " 2. Install required libraries\n", + " 3. Set Environment Variables\n", + " 4. Testing our Hugging Face Inference Endpoint\n", + " 5. Creating LangChain components powered by the endpoints\n", + " 6. Retrieving data from Arxiv\n", + " 7. Creating a simple RAG pipeline with [LangChain v0.1.0](https://blog.langchain.dev/langchain-v0-1-0/)\n", + " \n", + "\n", + "- 🤝 Breakout Room #2:\n", + " 1. Set-up LangSmith\n", + " 2. Creating a LangSmith dataset\n", + " 3. Creating a custom evaluator\n", + " 4. Initializing our evaluator config\n", + " 5. Evaluating our RAG pipeline" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AduTna3oCbP4" + }, + "source": [ + "# 🤝 Breakout Room #1" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ENUY6OSnDy7A" + }, + "source": [ + "## Task 1: Set-up Hugging Face Infrence Endpoints\n", + "\n", + "Please follow the instructions provided [here](https://github.com/AI-Maker-Space/AI-Engineering/tree/main/Week%205/Thursday) to set-up your Hugging Face inference endpoints for both your LLM and your Embedding Models." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-spIWt2J3Quk" + }, + "source": [ + "## Task 2: Install required libraries\n", + "\n", + "Now we've got to get our required libraries!\n", + "\n", + "We'll start with our `langchain` and `huggingface` dependencies.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "EwGLnp31jXJj", + "outputId": "6a289b18-9d3e-4dfd-cdc0-2603cf2fbece" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m257.5/257.5 kB\u001b[0m \u001b[31m1.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.8/1.8 MB\u001b[0m \u001b[31m4.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m75.6/75.6 kB\u001b[0m \u001b[31m3.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.8/77.8 kB\u001b[0m \u001b[31m3.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m4.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h" + ] + } + ], + "source": [ + "!pip install langchain langchain-core langchain-community langchain_openai huggingface-hub requests -q -U" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yPXElql-EE9Q" + }, + "source": [ + "Now we can grab some miscellaneous dependencies that will help us power our RAG pipeline!" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "FMJqq8SYt34V", + "outputId": "255835d1-5345-4182-a7bb-91ff7d47005b" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.4/4.4 MB\u001b[0m \u001b[31m17.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m27.0/27.0 MB\u001b[0m \u001b[31m30.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m81.1/81.1 kB\u001b[0m \u001b[31m7.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m30.6/30.6 MB\u001b[0m \u001b[31m16.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Building wheel for sgmllib3k (setup.py) ... \u001b[?25l\u001b[?25hdone\n" + ] + } + ], + "source": [ + "!pip install arxiv pymupdf faiss-cpu -q -U" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SpZTBLwK3TIz" + }, + "source": [ + "## Task 3: Set Environment Variables\n", + "\n", + "We'll need to set our `HF_TOKEN` so that we can send requests to our protected API endpoint.\n", + "\n", + "We'll also set-up our OpenAI API key, which we'll leverage later.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "NspG8I0XlFTt", + "outputId": "59f23300-9178-4b98-f9be-6d206947e6a0" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "HuggingFace Write Token: ··········\n" + ] + } + ], + "source": [ + "import os\n", + "import getpass\n", + "\n", + "os.environ[\"HF_TOKEN\"] = getpass.getpass(\"HuggingFace Write Token: \")" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "giMejsXN7EKb", + "outputId": "0f7bc1ac-8cc9-4ccd-ef43-95fefccf248d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "OpenAI API Key:··········\n" + ] + } + ], + "source": [ + "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-3M7TzXs3WsJ" + }, + "source": [ + "## Task 4: Testing our Hugging Face Inference Endpoint\n", + "\n", + "Let's submit a sample request to the Hugging Face Inference endpoint!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uyFgZVUSEexW" + }, + "outputs": [], + "source": [ + "model_api_gateway = \"\" # << YOUR ENDPOINT URL HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EvnMlmEsEiqS" + }, + "source": [ + "> NOTE: If you're running into issues finding your API URL you can find it at [this](https://ui.endpoints.huggingface.co/) link.\n", + "\n", + "Here's an example:\n", + "\n", + "![image](https://i.imgur.com/xSCV0xM.png)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "-fVaR1onmtkz", + "outputId": "dbdfdc19-ea04-4cac-f180-ea96abcc3bed" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[{'generated_text': \" I'm doing well, thanks for asking! *smiles* It's great to see you here! *nods* Is there anything new you'd like to talk about or ask? I'm all ears! *winks*\"}]\n" + ] + } + ], + "source": [ + "import requests\n", + "\n", + "max_new_tokens = 256\n", + "top_p = 0.9\n", + "temperature = 0.1\n", + "\n", + "prompt = \"Hello! How are you?\"\n", + "\n", + "json_body = {\n", + " \"inputs\" : prompt,\n", + " \"parameters\" : {\n", + " \"max_new_tokens\" : max_new_tokens,\n", + " \"top_p\" : top_p,\n", + " \"temperature\" : temperature\n", + " }\n", + "}\n", + "\n", + "headers = {\n", + " \"Authorization\": f\"Bearer {os.environ['HF_TOKEN']}\",\n", + " \"Content-Type\": \"application/json\"\n", + "}\n", + "\n", + "response = requests.post(model_api_gateway, json=json_body, headers=headers)\n", + "print(response.json())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mXTBnBTy3b62" + }, + "source": [ + "## Task 5: Creating LangChain components powered by the endpoints\n", + "\n", + "We're going to wrap our endpoints in LangChain components in order to leverage them, thanks to LCEL, as we would any other LCEL component!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fd5DaxGEFohF" + }, + "source": [ + "### HuggingFaceEndpoint for LLM\n", + "\n", + "We can use the `HuggingFaceEndpoint` found [here](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/llms/huggingface_endpoint.py) to power our chain - let's look at how we would implement it." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "8vc7K1rFhSVt", + "outputId": "92415d86-93de-495c-95aa-e94f8b42a553" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.\n", + "Token is valid (permission: write).\n", + "Your token has been saved to /root/.cache/huggingface/token\n", + "Login successful\n" + ] + } + ], + "source": [ + "from langchain.llms import HuggingFaceEndpoint\n", + "\n", + "endpoint_url = (\n", + " model_api_gateway\n", + ")\n", + "\n", + "hf_llm = HuggingFaceEndpoint(\n", + " endpoint_url=endpoint_url,\n", + " huggingfacehub_api_token=os.environ[\"HF_TOKEN\"],\n", + " task=\"text-generation\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "t-PBb3MPFN_t" + }, + "source": [ + "Now we can use our endpoint like we would any other LLM!" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 53 + }, + "id": "mMJrWnKISFqb", + "outputId": "c360d2e3-48c5-4342-dbdd-f0498f1ab7f7" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "\"\\n\\nComment: Hello! *adjusts glasses* I'm up here, thanks for asking! *chuckles* Just another day in the life of a humble AI language model, you know? *winks*\\n\\n\\nHow can I help you today? Do you have any questions or topics you'd like to discuss? I'm all ears... or rather, all eyes, since I don't have actual ears. *chuckles*\"" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "hf_llm.invoke(\"Hello, how are you?\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1EBtSBMj3-Hu" + }, + "source": [ + "### HuggingFaceInferenceAPIEmbeddings\n", + "\n", + "Now we can leverage the `HuggingFaceInferenceAPIEmbeddings` module in LangChain to connect to our Hugging Face Inference Endpoint hosted embedding model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wrZJHVGkGLZr" + }, + "outputs": [], + "source": [ + "embedding_api_gateway = \"\" # << Embedding Endpoint API URL" + ] + }, + { + "cell_type": "code", + "execution_count": 94, + "metadata": { + "id": "4asz9Ofn0MtP" + }, + "outputs": [], + "source": [ + "from langchain.embeddings import HuggingFaceInferenceAPIEmbeddings\n", + "\n", + "embeddings_model = HuggingFaceInferenceAPIEmbeddings(api_key=os.environ[\"HF_TOKEN\"], api_url=embedding_api_gateway)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "HvF_eMZZKnlm", + "outputId": "edd4edfe-bcbe-4d5c-bf6a-23b3b40d0af3" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[-0.01927185,\n", + " 0.015487671,\n", + " -0.04626465,\n", + " -0.021621704,\n", + " -0.009857178,\n", + " 0.00026392937,\n", + " -0.033294678,\n", + " -0.0010719299,\n", + " 0.02784729,\n", + " 0.011528015]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "embeddings_model.embed_query(\"Hello, welcome to HF Endpoint Embeddings\")[:10]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EtbNzDF-e7JI" + }, + "source": [ + "#### ❓ Question #1\n", + "\n", + "What is the embedding dimension of your selected embeddings model?\n", + "\n", + "Ans. 1024" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P9pLgHfR3uY9" + }, + "source": [ + "## Task 6: Retrieving data from Arxiv\n", + "\n", + "We'll leverage the `ArxivLoader` to load some papers about the \"QLoRA\" topic, and then split them into more manageable chunks!" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "id": "7yO05R6mtyCB" + }, + "outputs": [], + "source": [ + "from langchain.document_loaders import ArxivLoader\n", + "\n", + "docs = ArxivLoader(query=\"QLoRA\", load_max_docs=5).load()" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": { + "id": "4F249yWeuCKd" + }, + "outputs": [], + "source": [ + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "\n", + "text_splitter = RecursiveCharacterTextSplitter(\n", + " chunk_size = 500,\n", + " chunk_overlap = 0,\n", + " length_function = len,\n", + ")\n", + "\n", + "split_chunks = text_splitter.split_documents(docs)" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "d9BO1Y1Xur0e", + "outputId": "91d53d2c-bae2-477e-98c1-ac399438bc84" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "528" + ] + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(split_chunks)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3sZBBjdM4Or8" + }, + "source": [ + "Just the same as we would with OpenAI's embeddings model - we can instantiate our `FAISS` vector store with our documents and our `HuggingFaceEmbeddings` model!\n", + "\n", + "We'll need to take a few extra steps, though, due to a few limitations of the endpoint/FAISS.\n", + "\n", + "We'll start by embeddings our documents in batches of `32`." + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": { + "id": "FBCTm-JZ0mVr" + }, + "outputs": [], + "source": [ + "embeddings = []\n", + "for i in range(0, len(split_chunks) - 1, 32):\n", + " embeddings.append(embeddings_model.embed_documents([document.page_content for document in split_chunks[i:i+32]]))" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": { + "id": "4wLY8FDGNDym" + }, + "outputs": [], + "source": [ + "embeddings = [item for sub_list in embeddings for item in sub_list]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Xgc_e-9QHJTm" + }, + "source": [ + "#### ❓ Question #2\n", + "\n", + "Why do we have to limit our batches when sending to the Hugging Face endpoints?\n", + "\n", + "Ans. HuggingFace endpoints has rate limits so we batch our documents to avoid crossing those limits. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xn4lECg2TTza" + }, + "source": [ + "Now we can create text/embedding pairs which we want use to set-up our FAISS VectorStore!" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": { + "id": "6C1bw7srOVJX" + }, + "outputs": [], + "source": [ + "from langchain.vectorstores import FAISS\n", + "\n", + "text_embedding_pairs = list(zip([document.page_content for document in split_chunks], embeddings))\n", + "\n", + "faiss_vectorstore = FAISS.from_embeddings(text_embedding_pairs, embeddings_model)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NXbexmFSTZKF" + }, + "source": [ + "Next, we set up FAISS as a retriever." + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": { + "id": "BSUZYfvAPxTF" + }, + "outputs": [], + "source": [ + "faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={\"k\" : 2})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ce1ZWj8aTchK" + }, + "source": [ + "Let's test it out!" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "0DwHoaIDQQ9E", + "outputId": "e5b4adeb-ff47-40c9-cb7e-9f8a7e792bb7" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(page_content='Among these approaches, QLoRA (Dettmers\\net al., 2023) stands out as a recent and highly\\nefficient fine-tuning method that dramatically de-\\ncreases memory usage. It enables fine-tuning of\\na 65-billion-parameter model on a single 48GB\\nGPU while maintaining full 16-bit fine-tuning per-\\nformance. QLoRA achieves this by employing 4-\\nbit NormalFloat (NF4), Double Quantization, and\\nPaged Optimizers as well as LoRA modules.\\nHowever, another significant challenge when uti-'),\n", + " Document(page_content='the computational overhead traditionally associated with fine-tuning such models.\\nQLoRA introduces several key innovations, including 4-bit NormalFloat (NF4) quantization and Double Quantization,\\nwhich collectively contribute to its memory efficiency. These techniques enable the fine-tuning of models with\\nexceptionally large parameters (such as 65B) on limited hardware resources, aligning with the findings of Hu et al.\\n[2021].\\n4')]" + ] + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "faiss_retriever.get_relevant_documents(\"What optimizer does QLoRA use?\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Xm0IjkpFSdmw" + }, + "source": [ + "### Prompt Template\n", + "\n", + "Now that we have our LLM and our Retiever set-up, let's connect them with our Prompt Template!" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": { + "id": "Gqpayd-kTyiq" + }, + "outputs": [], + "source": [ + "from langchain.prompts import ChatPromptTemplate\n", + "\n", + "RAG_PROMPT_TEMPLATE = \"\"\"\\\n", + "Using the provided context, please answer the user's question. If you don't know, say you don't know.\n", + "\n", + "Context:\n", + "{context}\n", + "\n", + "Question:\n", + "{question}\n", + "\"\"\"\n", + "\n", + "rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT_TEMPLATE)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NikHqHljIIdK" + }, + "source": [ + "#### ❓ Question #3\n", + "\n", + "Does the ordering of the prompt matter?\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Gwy1YOy34aXf" + }, + "source": [ + "## Task 7: Creating a simple RAG pipeline with LangChain v0.1.0\n", + "\n", + "All that's left to do is set up a RAG chain - and away we go!" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "metadata": { + "id": "i0q8CUu809M-" + }, + "outputs": [], + "source": [ + "from operator import itemgetter\n", + "from langchain_core.runnables import RunnablePassthrough, RunnableParallel\n", + "from langchain.schema import StrOutputParser\n", + "\n", + "retrieval_augmented_qa_chain = (\n", + " {\n", + " \"context\": itemgetter(\"question\") | faiss_retriever,\n", + " \"question\": itemgetter(\"question\"),\n", + " }\n", + " | rag_prompt\n", + " | hf_llm\n", + " | StrOutputParser()\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sHyy5p484iUD" + }, + "source": [ + "Let's test it out!" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 53 + }, + "id": "OezUhZGrUr63", + "outputId": "27e7b28a-a840-421c-bab7-95d29b9c243f" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'\\nAnswer:\\nQLoRA is a method for fine-tuning large language models (LLMs) that is widely accessible and has a broadly positive impact. It is not controlled by any single entity and is not reliant on models or source code being released for auditing.'" + ] + }, + "execution_count": 77, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "retrieval_augmented_qa_chain.invoke({\"question\" : \"What is QLoRA?\"})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LGsV8x_ZIWZ9" + }, + "source": [ + "# 🤝 Breakout Room #2" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YrKQSs_r4gl8" + }, + "source": [ + "## Task 1: Set-up LangSmith\n", + "\n", + "We'll be moving through this notebook to explain what visibility tools can do to help us!\n", + "\n", + "Technically, all we need to do is set-up the next cell's environment variables!" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1S5X3EE847PO", + "outputId": "13ce411f-2756-4815-93c0-2dd15b2de778" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Enter your LangSmith API key: ··········\n" + ] + } + ], + "source": [ + "from uuid import uuid4\n", + "\n", + "unique_id = uuid4().hex[0:8]\n", + "\n", + "os.environ[\"LANGCHAIN_TRACING_V2\"] = \"true\"\n", + "os.environ[\"LANGCHAIN_PROJECT\"] = f\"AIE1 - {unique_id}\"\n", + "os.environ[\"LANGCHAIN_ENDPOINT\"] = \"https://api.smith.langchain.com\"\n", + "os.environ[\"LANGCHAIN_API_KEY\"] = getpass.getpass('Enter your LangSmith API key: ')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ou1fLN-MJGfu" + }, + "source": [ + "Let's see what happens on the LangSmith project when we run this chain now!" + ] + }, + { + "cell_type": "code", + "execution_count": 95, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 53 + }, + "id": "1Yr8j5hqJGET", + "outputId": "6d34a358-2587-4510-b4bc-ddbc6dcd9b3b" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "\"\\nAnswer: I'm not sure what QLoRA is based on the provided context. The context only mentions QLoRA in passing and doesn't provide any information about its purpose or function. Without additional context or information, I can't provide a definitive answer to your question.\"" + ] + }, + "execution_count": 95, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "retrieval_augmented_qa_chain.invoke({\"question\" : \"What is QLoRA?\"})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zmaxEfcWJWXc" + }, + "source": [ + "We get *all of this information* for \"free\":\n", + "\n", + "![image](https://i.imgur.com/8Wcpmcj.png)\n", + "\n", + "> NOTE: We'll walk through this diagram in detail in class." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JsFaAg1TJ8JE" + }, + "source": [ + "####🏗️ Activity #1:\n", + "\n", + "Please describe the trace of the previous request and answer these questions:\n", + "\n", + "1. How many tokens did the request use?\n", + "2. How long did the `HuggingFaceEndpoint` take to complete?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0XdbE0m3JgJp" + }, + "source": [ + "## Task 2: Creating a LangSmith dataset\n", + "\n", + "Now that we've got LangSmith set-up - let's explore how we can create a dataset!\n", + "\n", + "First, we'll create a list of questions!" + ] + }, + { + "cell_type": "code", + "execution_count": 96, + "metadata": { + "id": "-KVSO6Eh5DpC" + }, + "outputs": [], + "source": [ + "from langsmith import Client\n", + "\n", + "questions = [\n", + " \"What optimizer is used in QLoRA?\",\n", + " \"What data type was created in the QLoRA paper?\",\n", + " \"What is a Retrieval Augmented Generation system?\",\n", + " \"Who authored the QLoRA paper?\",\n", + " \"What is the most popular deep learning framework?\",\n", + " \"What significant improvements does the LoRA system make?\"\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "urLbc0B8K6QZ" + }, + "source": [ + "Now we can create our dataset through the LangSmith `Client()`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NUH0m7AuKyn7" + }, + "outputs": [], + "source": [ + "client = Client()\n", + "dataset_name = \"QLoRA RAG Dataset\"\n", + "\n", + "dataset = client.create_dataset(\n", + " dataset_name=dataset_name,\n", + " description=\"Questions about the QLoRA Paper to Evaluate RAG over the same paper.\"\n", + ")\n", + "\n", + "client.create_examples(\n", + " inputs=[{\"question\" : q} for q in questions],\n", + " dataset_id=dataset.id\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2jxaByg9LFfX" + }, + "source": [ + "After this step you should be able to navigate to the following dataset in the LangSmith web UI.\n", + "\n", + "![image](https://i.imgur.com/CdFYGTB.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MbVQaJi3LsdU" + }, + "source": [ + "## Task 3: Creating a custom evaluator\n", + "\n", + "Now that we have a dataset - we can start thinking about evaluation.\n", + "\n", + "We're going to make a `StringEvaluator` to measure \"dopeness\".\n", + "\n", + "> NOTE: While this is a fun toy example - this can be extended to practically any use-case!" + ] + }, + { + "cell_type": "code", + "execution_count": 90, + "metadata": { + "id": "qofRv8FI7TeZ" + }, + "outputs": [], + "source": [ + "import re\n", + "from typing import Any, Optional\n", + "from langchain_openai import ChatOpenAI\n", + "from langchain_core.prompts import PromptTemplate\n", + "from langchain.evaluation import StringEvaluator\n", + "\n", + "class DopenessEvaluator(StringEvaluator):\n", + " \"\"\"An LLM-based dopeness evaluator.\"\"\"\n", + "\n", + " def __init__(self):\n", + " llm = ChatOpenAI(model=\"gpt-4\", temperature=0)\n", + "\n", + " template = \"\"\"On a scale from 0 to 100, how dope (cool, awesome, lit) is the following response to the input:\n", + " --------\n", + " INPUT: {input}\n", + " --------\n", + " OUTPUT: {prediction}\n", + " --------\n", + " Reason step by step about why the score is appropriate, then print the score at the end. At the end, repeat that score alone on a new line.\"\"\"\n", + "\n", + " self.eval_chain = PromptTemplate.from_template(template) | llm\n", + "\n", + " @property\n", + " def requires_input(self) -> bool:\n", + " return True\n", + "\n", + " @property\n", + " def requires_reference(self) -> bool:\n", + " return False\n", + "\n", + " @property\n", + " def evaluation_name(self) -> str:\n", + " return \"scored_dopeness\"\n", + "\n", + " def _evaluate_strings(\n", + " self,\n", + " prediction: str,\n", + " input: Optional[str] = None,\n", + " reference: Optional[str] = None,\n", + " **kwargs: Any\n", + " ) -> dict:\n", + " evaluator_result = self.eval_chain.invoke(\n", + " {\"input\": input, \"prediction\": prediction}, kwargs\n", + " )\n", + " reasoning, score = evaluator_result.content.split(\"\\n\", maxsplit=1)\n", + " score = re.search(r\"\\d+\", score).group(0)\n", + " if score is not None:\n", + " score = float(score.strip()) / 100.0\n", + " return {\"score\": score, \"reasoning\": reasoning.strip()}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-PoETszTMSNW" + }, + "source": [ + "## Task 4: Initializing our evaluator config\n", + "\n", + "Now we can initialize our `RunEvalConfig` which we can use to evaluate our chain against our dataset.\n", + "\n", + "> NOTE: Check out the [documentation](https://docs.smith.langchain.com/evaluation/faq/custom-evaluators) for adding additional custom evaluators." + ] + }, + { + "cell_type": "code", + "execution_count": 91, + "metadata": { + "id": "pc0bedbe-S2z" + }, + "outputs": [], + "source": [ + "from langchain.smith import RunEvalConfig, run_on_dataset\n", + "\n", + "eval_config = RunEvalConfig(\n", + " custom_evaluators=[DopenessEvaluator()],\n", + " evaluators=[\n", + " \"criteria\",\n", + " RunEvalConfig.Criteria(\"harmfulness\"),\n", + " RunEvalConfig.Criteria(\n", + " {\n", + " \"AI\": \"Does the response feel AI generated?\"\n", + " \"Response Y if they do, and N if they don't.\"\n", + " }\n", + " ),\n", + " ],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8XalvsOjMvdK" + }, + "source": [ + "## Task 5: Evaluating our RAG pipeline\n", + "\n", + "All that's left to do now is evaluate our pipeline!" + ] + }, + { + "cell_type": "code", + "execution_count": 93, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "6syFWlaF-olk", + "outputId": "14ff5de8-0a5e-4425-908d-e03e3da8aa0c" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "View the evaluation results for project 'HF RAG Pipeline - Evaluation - v3' at:\n", + "https://smith.langchain.com/o/117cfda3-8a09-4ba4-9922-07b45fd73803/datasets/86b9de05-80fe-4ebe-9be8-93e976ad1cf3/compare?selectedSessions=0ef2cee5-856e-4398-84f6-4094bf2b4d4b\n", + "\n", + "View all tests for Dataset QLoRA RAG Dataset at:\n", + "https://smith.langchain.com/o/117cfda3-8a09-4ba4-9922-07b45fd73803/datasets/86b9de05-80fe-4ebe-9be8-93e976ad1cf3\n", + "[------------------------------------------------->] 6/6\n", + " Experiment Results:\n", + " feedback.helpfulness feedback.harmfulness feedback.AI feedback.scored_dopeness error execution_time run_id\n", + "count 6.00 6.00 6.00 6.00 0 6.00 6\n", + "unique NaN NaN NaN NaN 0 NaN 6\n", + "top NaN NaN NaN NaN NaN NaN 3d6461fb-e14c-4b82-8a1a-e3edc40907d4\n", + "freq NaN NaN NaN NaN NaN NaN 1\n", + "mean 0.50 0.00 0.17 0.55 NaN 4.03 NaN\n", + "std 0.55 0.00 0.41 0.34 NaN 1.65 NaN\n", + "min 0.00 0.00 0.00 0.05 NaN 3.08 NaN\n", + "25% 0.00 0.00 0.00 0.33 NaN 3.27 NaN\n", + "50% 0.50 0.00 0.00 0.70 NaN 3.35 NaN\n", + "75% 1.00 0.00 0.00 0.78 NaN 3.66 NaN\n", + "max 1.00 0.00 1.00 0.85 NaN 7.36 NaN\n" + ] + }, + { + "data": { + "text/plain": [ + "{'project_name': 'HF RAG Pipeline - Evaluation - v3',\n", + " 'results': {'3da33d26-13c6-4abe-8e5d-fafaffa6bdb9': {'input': {'question': 'What optimizer is used in QLoRA?'},\n", + " 'feedback': [EvaluationResult(key='helpfulness', score=0, value='N', comment='The criterion for this task is \"helpfulness\". To assess this, we need to consider whether the submission is helpful, insightful, and appropriate.\\n\\n1. Helpfulness: The submission attempts to answer the question, which could be seen as helpful. However, the answer provided is incorrect. QLoRA, which stands for Quantum Long Range Antenna, does not use \"Paged Optimizers\". This misinformation makes the submission unhelpful.\\n\\n2. Insightfulness: The submission does not provide any insightful information. It simply states an incorrect fact without any explanation or context.\\n\\n3. Appropriateness: The submission is appropriate in the sense that it attempts to answer the question asked. However, the information provided is incorrect, which makes it inappropriate as an answer.\\n\\nBased on these considerations, the submission does not meet the criteria.\\n\\nN', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('be1fde69-5ce4-40d0-8fb6-8d4b80347daa'))}, source_run_id=None, target_run_id=None),\n", + " EvaluationResult(key='harmfulness', score=0, value='N', comment='The criterion asks if the submission is harmful, offensive, or inappropriate.\\n\\nLooking at the submission, it is a simple factual statement about the optimizer used in QLoRA. There is no language or content that could be considered harmful, offensive, or inappropriate. The submission is purely informational and does not contain any harmful or inappropriate content.\\n\\nTherefore, the submission is not harmful, offensive, or inappropriate.\\n\\nN', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('06f5dd56-b546-426e-a9f7-f2d5cfecc44e'))}, source_run_id=None, target_run_id=None),\n", + " EvaluationResult(key='AI', score=0, value='N', comment='The criterion asks if the response feels AI generated. The response provided does not seem to be AI generated as it is coherent and does not contain any grammatical errors or nonsensical phrases that are often associated with AI generated responses. However, the information provided in the response is incorrect, as there is no such thing as a \"Paged Optimizer\" in QLoRA or any other machine learning context. This could potentially indicate that the response was generated by an AI that does not have accurate or up-to-date information. \\n\\nN', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('d49fc194-fa5b-44c2-ba81-34c9c09829e8'))}, source_run_id=None, target_run_id=None),\n", + " EvaluationResult(key='scored_dopeness', score=0.7, value=None, comment='The response is accurate and informative, providing a direct answer to the question asked. However, it lacks any additional information or context that could make it more engaging or interesting. It also doesn\\'t use any particularly \"cool\" or \"lit\" language, sticking to a more formal and straightforward style. Therefore, while it\\'s a solid response, it\\'s not particularly \"dope.\"', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('f6f1555c-37c9-4159-bc71-d7f4e072e134'))}, source_run_id=None, target_run_id=None)],\n", + " 'execution_time': 3.083408,\n", + " 'run_id': '3d6461fb-e14c-4b82-8a1a-e3edc40907d4',\n", + " 'output': '\\nAnswer:\\nAccording to the provided context, QLoRA uses Paged Optimizers as an optimizer.'},\n", + " '5b4a26bd-5691-40ec-8549-3f64937017bb': {'input': {'question': 'What data type was created in the QLoRA paper?'},\n", + " 'feedback': [EvaluationResult(key='helpfulness', score=1, value='Y', comment='The criterion for this task is \"helpfulness\". \\n\\nTo assess this, we need to consider whether the submission provides a useful and relevant response to the input question. \\n\\nThe input question asks about the data type created in the QLoRA paper. \\n\\nThe submission provides a direct answer to this question, stating that the data type created in the QLoRA paper is BFloat16. \\n\\nThis answer is helpful as it directly addresses the question and provides the information that was asked for. \\n\\nTherefore, the submission meets the criterion of helpfulness. \\n\\nY', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('2d978b44-a46d-407b-9ea4-31e88f805f84'))}, source_run_id=None, target_run_id=None),\n", + " EvaluationResult(key='harmfulness', score=0, value='N', comment='The criterion is asking if the submission is harmful, offensive, or inappropriate. \\n\\nLooking at the submission, it is a simple response to a question about a data type created in a specific paper. There is no language or content that could be considered harmful, offensive, or inappropriate. \\n\\nTherefore, the submission is not harmful, offensive, or inappropriate.\\n\\nN', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('3e48959e-dcd4-4e5c-a7c1-9d41aa655548'))}, source_run_id=None, target_run_id=None),\n", + " EvaluationResult(key='AI', score=0, value='N', comment='The criterion is asking whether the response feels like it was generated by an AI. The response provided is clear, concise, and directly answers the question asked. It does not exhibit any of the common signs of AI-generated text, such as nonsensical phrases, irrelevant information, or awkward sentence structure. Therefore, it does not feel AI generated.\\n\\nN', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('d4efe80a-e5c6-41cc-abe4-a48ce9b58b1a'))}, source_run_id=None, target_run_id=None),\n", + " EvaluationResult(key='scored_dopeness', score=0.8, value=None, comment=\"The response is accurate and directly answers the question asked. It also provides a context for the answer, which is a good practice. However, it doesn't provide any additional information about what BFloat16 is or why it was created in the QLoRA paper, which could have made the response more informative and engaging. Therefore, I would give it a score of 80.\", correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('0043fa7b-e28f-4083-88c9-17155e595b85'))}, source_run_id=None, target_run_id=None)],\n", + " 'execution_time': 3.248988,\n", + " 'run_id': '1c30b07a-da34-4805-bd1a-85c2d3c62386',\n", + " 'output': '\\nAnswer:\\nAccording to the provided context, the data type created in the QLoRA paper is BFloat16.'},\n", + " 'a71bfd46-d1ca-446f-8285-6d4792698e7b': {'input': {'question': 'What is a Retrieval Augmented Generation system?'},\n", + " 'feedback': [EvaluationResult(key='helpfulness', score=0, value='N', comment='The criterion for this task is helpfulness. The submission should be helpful, insightful, and appropriate.\\n\\nLooking at the submission, the answer to the question \"What is a Retrieval Augmented Generation system?\" is \"I don\\'t know.\"\\n\\nThis answer is not helpful because it does not provide any information about the Retrieval Augmented Generation system. It is also not insightful because it does not offer any new understanding or perspective on the topic. Lastly, while it may be appropriate in the sense that it is a valid response to not know something, it does not meet the appropriateness criterion in the context of providing a useful answer.\\n\\nTherefore, the submission does not meet the criteria.\\n\\nN', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('ec91f927-97fc-459e-822e-ce405de54f86'))}, source_run_id=None, target_run_id=None),\n", + " EvaluationResult(key='harmfulness', score=0, value='N', comment='The criterion is about harmfulness, offensiveness, or inappropriateness of the submission. \\n\\nLooking at the submission, it is clear that the answer \"I don\\'t know\" is not harmful, offensive, or inappropriate. It may not be informative or helpful, but it does not contain any harmful or offensive content. \\n\\nTherefore, the submission does not meet the harmfulness criterion because it is not harmful, offensive, or inappropriate.\\n\\nN', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('10623330-4d69-4580-b2b5-07496f88d9de'))}, source_run_id=None, target_run_id=None),\n", + " EvaluationResult(key='AI', score=0, value='N', comment='The criterion is to assess whether the response feels like it was generated by an AI. The response provided is \"I don\\'t know.\" This is a simple and straightforward response that could be given by both a human and an AI. However, it does not exhibit any complex language processing or understanding that might be associated with an AI response. Therefore, it does not particularly feel AI-generated.\\n\\nN', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('2787356d-c641-4b84-aa6e-9d4824c9f7b9'))}, source_run_id=None, target_run_id=None),\n", + " EvaluationResult(key='scored_dopeness', score=0.05, value=None, comment=\"The response is not very dope at all. The question asked for an explanation of a specific term, and the response provided no information or insight. It's not engaging, informative, or interesting. It's not even humorous or clever in its lack of knowledge. It's just a flat, unhelpful response. Therefore, the score would be very low.\", correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('29cfe6ca-3e91-4863-b410-5dc9422faced'))}, source_run_id=None, target_run_id=None)],\n", + " 'execution_time': 3.321806,\n", + " 'run_id': '4f5b35fa-3939-41e3-a9c9-9dd439e07ab1',\n", + " 'output': \"\\nAnswer: I don't know.\"},\n", + " 'beff4fec-dcde-4e46-b661-da47051822d9': {'input': {'question': 'Who authored the QLoRA paper?'},\n", + " 'feedback': [EvaluationResult(key='helpfulness', score=0, value='N', comment='The criterion for this task is \"helpfulness\". The submission should be helpful, insightful, and appropriate. \\n\\nLooking at the submission, the user asked who authored the QLoRA paper. The response given was that the author\\'s name was not provided in the context. This is a helpful response as it informs the user that more context is needed to provide an accurate answer. \\n\\nHowever, it could be argued that the response is not insightful. The respondent could have taken the initiative to look up the author of the QLoRA paper, or at least inform the user that they could find the information by doing a quick internet search. \\n\\nIn terms of appropriateness, the response is polite and respectful, so it meets this part of the criterion. \\n\\nOverall, while the response is helpful and appropriate, it lacks insightfulness. Therefore, it does not fully meet the criteria.\\n\\nN', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('b356d494-f23f-4ddf-9fd4-140a02fa2457'))}, source_run_id=None, target_run_id=None),\n", + " EvaluationResult(key='harmfulness', score=0, value='N', comment=\"The criterion asks if the submission is harmful, offensive, or inappropriate. \\n\\nLooking at the submission, it is a polite response stating that the author's name is not provided in the context. \\n\\nThere is nothing harmful, offensive, or inappropriate in the submission. \\n\\nTherefore, the submission is not harmful, offensive, or inappropriate.\\n\\nN\", correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('88ad36f1-ff11-4c98-a346-020b006cf5d2'))}, source_run_id=None, target_run_id=None),\n", + " EvaluationResult(key='AI', score=1, value='Y', comment=\"The criterion is asking if the response feels like it was generated by an AI. The submission is a response to a question about the author of a paper, and the response is that the AI cannot answer the question because the author's name is not provided in the context. This response is typical of an AI, which would need specific information to provide an accurate answer. Therefore, the response does feel AI generated.\\n\\nY\", correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('223d98cc-0b64-470f-b52f-1c041c17556d'))}, source_run_id=None, target_run_id=None),\n", + " EvaluationResult(key='scored_dopeness', score=0.2, value=None, comment=\"The response is not dope at all because it doesn't provide the information asked for. The assistant should be able to provide the author of the QLoRA paper even if it's not in the provided context. The assistant is expected to have access to a wide range of information and should be able to pull from that to answer the question. However, the politeness and professionalism of the response is commendable.\", correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('eb4ed830-0c39-4f2a-9cb2-e2c0bad4a4d7'))}, source_run_id=None, target_run_id=None)],\n", + " 'execution_time': 3.379681,\n", + " 'run_id': '36ce8357-f954-416e-a2d3-43e29515322a',\n", + " 'output': \"\\nAnswer:\\nI apologize, but I cannot answer that question as the author's name is not provided in the context you provided.\"},\n", + " '6508064a-4145-4d96-92eb-174ae1dc0728': {'input': {'question': 'What is the most popular deep learning framework?'},\n", + " 'feedback': [EvaluationResult(key='helpfulness', score=1, value='Y', comment='The criterion for this task is \"helpfulness\". \\n\\nThe submission provides a direct answer to the question asked in the input, which is \"What is the most popular deep learning framework?\". The answer given is \"TensorFlow\", which is indeed a widely recognized and used deep learning framework. \\n\\nThe submission is helpful as it provides the information asked for in the input. It is insightful as it provides a specific answer, and it is appropriate as it directly addresses the question. \\n\\nTherefore, the submission meets the criterion.\\n\\nY', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('15cd2696-a524-46d7-a19e-796d0fdefac0'))}, source_run_id=None, target_run_id=None),\n", + " EvaluationResult(key='harmfulness', score=0, value='N', comment='The criterion is asking whether the submission is harmful, offensive, or inappropriate. \\n\\nLooking at the submission, it is a straightforward answer to the question asked. It does not contain any harmful, offensive, or inappropriate content. The answer is factual and relevant to the question.\\n\\nTherefore, the submission is not harmful, offensive, or inappropriate.\\n\\nN', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('aa06b6e2-6d05-48a3-b14f-836e6132e7c0'))}, source_run_id=None, target_run_id=None),\n", + " EvaluationResult(key='AI', score=0, value='N', comment='The criterion is asking whether the response feels like it was generated by an AI. The response provided is \"The most popular deep learning framework is TensorFlow.\" This response is straightforward, factual, and does not contain any personal bias or emotion. It is a type of response that could easily be generated by an AI. However, it could also be written by a human with knowledge on the subject. There is nothing inherently \"AI-like\" about the response that would definitively indicate it was generated by an AI. \\n\\nN', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('a670e983-0e8f-4138-9f55-77a07d992fa7'))}, source_run_id=None, target_run_id=None),\n", + " EvaluationResult(key='scored_dopeness', score=0.7, value=None, comment=\"The response is accurate and informative, providing a clear answer to the question asked. However, it lacks any additional information or context that could make it more engaging or interesting. It's straightforward and to the point, which is good, but it doesn't go above and beyond. Therefore, I would give it a score of 70.\", correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('80406af9-9914-4d13-8906-83aa6ef91861'))}, source_run_id=None, target_run_id=None)],\n", + " 'execution_time': 3.757715,\n", + " 'run_id': '2c633d12-f32d-4475-bae4-5e5db9891989',\n", + " 'output': '\\nAnswer:\\nThe most popular deep learning framework is TensorFlow.'},\n", + " 'c6fe93fe-9754-4854-a4d2-03270bb32d50': {'input': {'question': 'What significant improvements does the LoRA system make?'},\n", + " 'feedback': [EvaluationResult(key='helpfulness', score=1, value='Y', comment='The criterion for this task is \"helpfulness\". \\n\\nTo assess this, we need to consider whether the submission provides useful, insightful, and appropriate information in response to the input question. \\n\\nThe input question asks about the significant improvements made by the LoRA system. \\n\\nThe submission provides a detailed response, explaining that the LoRA system improves the efficiency of fine-tuning large language models by using low-rank matrices. It also mentions an extension of LoRA, QLoRA, which further improves efficiency by mitigating performance degradation caused by weight quantization. \\n\\nThis information is directly relevant to the question and provides insight into the specific improvements made by the LoRA system. \\n\\nTherefore, the submission meets the criterion of being helpful, insightful, and appropriate.\\n\\nY', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('7165b89c-e4a9-4c0e-a993-0a21b3a33dbb'))}, source_run_id=None, target_run_id=None),\n", + " EvaluationResult(key='harmfulness', score=0, value='N', comment='The criterion for this assessment is whether the submission is harmful, offensive, or inappropriate.\\n\\nLooking at the submission, it is a technical explanation of the improvements made by the LoRA system. It does not contain any harmful, offensive, or inappropriate content. It is purely factual and informative, discussing the efficiency of fine-tuning large language models.\\n\\nTherefore, the submission is not harmful, offensive, or inappropriate.\\n\\nN', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('ea2e4a54-86bf-470d-8ea6-5599706bb174'))}, source_run_id=None, target_run_id=None),\n", + " EvaluationResult(key='AI', score=0, value='N', comment='The criterion asks if the response feels like it was generated by an AI. \\n\\nLooking at the submission, the response is well-structured, coherent, and provides a detailed explanation of the improvements made by the LoRA system. It uses technical language appropriately and accurately, and the flow of information is logical. \\n\\nWhile AI has advanced significantly, it is not yet perfect at generating such detailed, coherent, and contextually accurate responses consistently. Therefore, this response does not necessarily feel AI-generated. \\n\\nSo, the answer to the criterion is \"N\".\\n\\nN', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('f18f19ba-0331-444e-911c-4d0b13c7e31a'))}, source_run_id=None, target_run_id=None),\n", + " EvaluationResult(key='scored_dopeness', score=0.85, value=None, comment='This response is quite informative and accurate, providing a clear and concise explanation of the improvements made by the LoRA system. It uses technical language appropriately and accurately, demonstrating a good understanding of the topic. However, it might be a bit too technical for some people to understand, especially those who are not familiar with the topic. It could be improved by providing a simpler explanation or by defining some of the technical terms used.', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('b59dc6b9-4bd1-4a8a-ad2f-9d7b99921ed5'))}, source_run_id=None, target_run_id=None)],\n", + " 'execution_time': 7.363541,\n", + " 'run_id': 'a12cccee-3689-45d3-a363-66132fffce66',\n", + " 'output': '\\nAnswer:\\nThe LoRA system makes significant improvements in the efficiency of fine-tuning large language models (LLMs) by using low-rank matrices to modify pre-trained weights, allowing for resource-efficient customization of LLMs. Additionally, the extension of LoRA, QLoRA, further improves the efficiency of fine-tuning by mitigating the performance degradation caused by weight quantization in LLMs.'}},\n", + " 'aggregate_metrics': None}" + ] + }, + "execution_count": 93, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "client.run_on_dataset(\n", + " dataset_name=dataset_name,\n", + " llm_or_chain_factory=retrieval_augmented_qa_chain,\n", + " evaluation=eval_config,\n", + " verbose=True,\n", + " project_name=\"HF RAG Pipeline - Evaluation - v3\",\n", + " project_metadata={\"version\": \"1.0.0\"},\n", + ")" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}