Name		Name	Last commit message	Last commit date
parent directory ..
assets		assets
data		data
src		src
README.md		README.md
requirements.txt		requirements.txt

README.md

Exploratory Data Analysis With Pandas Dataframe Agent

This repository contains the script for performing exploratory data analysis using Pandas Dataframe Agent from Langchain. Instead of manually doing EDA, the agent takes a prompt, decides on the code to do data analysis, and returns the answer. The use case also includes EDA for multiple dataframes.

What you need to follow this tutotial:

Understand how Python works
Understand how LangChain works
OpenAI API key (GPT4 is not required)

What Are LangChain Agents?

Agents use an LLM to determine which actions to take and in what order. They can have access to a series of tools and can decide which ones to call according to given user input (prompt)¹.

Text in, text out. Prompt in, answer out.

There are two types of agents:

action agents and
plan-and-execute agents

Action agents are more straightforward, executing one action at a time.
Plan-and-execute agents first plans what actions to take, and then execute them one at a time.
For small tasks, use action agents.
For longer tasks, use plan-and-execute agents to maintain objective and focus, and then execute using action agents.

High-level pseudocode of an Action Agent:

The user gives a prompt and lists available tools
The agent decides which tool to use based on the prompt
A tool is called, return output to agent
The agent decides the next step to take
Stops iteration when the agent has enough information. Responds to the user with an answer

High-level pseudocode of a Plan-and-Execute Agent:

The user gives a prompt and lists available tools
The planner lists out the steps to take
The executor goes through the steps and execute action one by one

What Are the Different Types of Agents?

Agent types available in LangChain²:

zero-shot-react-description
- This agent uses the ReAct framework to determine which tool to use based solely on the tool’s description.
- Must include tool description.
- Most examples use this agent type.
What is the ReAct framework?³
- Synergize reasoning and acting in language models.
- Reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with and gather additional information from external sources such as knowledge bases or environments.
Example: https://python.langchain.com/en/latest/modules/agents/getting_started.html
react-docstore
- This agent uses the ReAct framework to interact with a docstore.
- Must include a Search tool (search for a document), and a Lookup tool (look for a term in found document).
What is a docstore?⁴
- It can be 1) a document in the form of dictionary loaded in-memory, or 2) the Wikipedia API.
Example: https://python.langchain.com/en/latest/modules/agents/agents/examples/react.html
self-ask-with-search
- This agent utilizes a single tool that should be named Intermediate Answer. This tool should be able to lookup factual answers to questions.
- Uses Google search API as the tool.
- What is the self-ask model?⁵ The model explicitly asks itself (and then answers) follow-up questions before answering the initial question.
Example: https://python.langchain.com/en/latest/modules/agents/agents/examples/self_ask_with_search.html
conversational-react-description
- It uses the ReAct framework to decide which tool to use and uses memory to remember the previous conversation interactions.
- The prompt is designed to make the agent helpful and conversational.
Example: https://python.langchain.com/en/latest/modules/agents/agents/examples/conversational_agent.html

What Are Tools?

Tools are ways that an agent can use to interact with the outside world⁶.

Tools categorized by usage:

Data access: ArXiv, PubMed, Wikipedia, Google Places, OpenWeatherMap
Web search: Bing, Brave, DuckDuckGo, Google, Google Serper API, Metaphor Search, SearxNG, Serp API, YouTube
LLMs & ML functions: ChatGPT, Gradio, HuggingFace
Automation: Apify, AWS Lambda API, Shell, File System, IFTTT WebHooks, Python REPL, Twilio, Zapier, GraphQL, Requests, SceneXplain, Wolfram Alpha
Special: Human as a tool

Use case (Part 1): Use the serpapi⁷ tool to ask a question, and the llm-math tool for calculation (with zero shot agent type).

What Are Toolkits?

Toolkits are groups of tools designed for a specific use case⁸. For example, for an agent to interact with a SQL database in the best way it may need access to one tool to execute queries and another tool to inspect tables.

Pandas Dataframe Agent⁹: an agent used to interact with a pandas dataframe. It is mostly optimized for question answering. This agent calls the Python agent under the hood, which executes LLM generated Python code.

It can be used for a single dataframe or for multiple dataframes, for usage like:

Question and answers as part of exploratory data analysis
Compare differences between two dataframes

Use case (Part 2): Exploratory data analysis with data science salaries dataset

Dataset: Salaries of different data science fields from 2020 to 2023¹⁰
Basic data exploration (shape, isnull, columns, nunique)
Multiple-steps data exploration (sort, slice, filter, multiple-step calculations, groupby)
Multiple dataframes (comparison, multiple-step calculations)

Observations from the use case:

The pandas dataframe agent is useful for data exploration, where it is only Q&A involved. It is not ready for data manipulation or any action that changes the dataframe. The agent managed to come up with the code to manipulate data, but it does not apply the changes. For example, actions like replace values, create new dataframe or save plots do not work, so better do that manually.
It is useful to double check the answers manually. Two incorrect answers found and demonstrated in the use case. You might get a different outcome than mine.
The agent can recognize data values based on context without explicit description. Such as: FT = Full-time, job_title = job designation.
It does not remember the conversation as much. When there are more than 2 steps in a prompt, it does not handle it perfectly, something might slip through the cracks.
One can save time in data manipulation by prompting the code and applying that manually, instead of looking up for the code from Google search or documentation.

Possible extension of the use case: to train a neural net using the Python Agent Toolkit.

Setting Up the Experiment

You can find the code for setting up and running the experiment in the pandas_agent.py file in this repository. Note: You'll need Python 3.9 or higher and an up-to-date version of Langchain package to use the create_pandas_dataframe_agent function.

Below is a brief description of the key steps in the code:

Load the OpenAI and Serp API tokens from the .env file.
Execute use case 1: ask a question with the serp api and llm-math tools.
Execute use case 2:
- Execute basic data exploration (shape, isnull, columns, nunique)
- Execute multiple-steps data exploration (sort, slice, filter, multiple-step calculations, groupby)
- Compare multiple dataframes (comparison, multiple-step calculations)
Compare the output from Step 2 and Step 3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas-agent

pandas-agent

README.md

Exploratory Data Analysis With Pandas Dataframe Agent

What Are LangChain Agents?

What Are the Different Types of Agents?

What Are Tools?

What Are Toolkits?

Setting Up the Experiment

Files

pandas-agent

Directory actions

More options

Directory actions

More options

Latest commit

History

pandas-agent

Folders and files

parent directory

README.md

Exploratory Data Analysis With Pandas Dataframe Agent

What Are LangChain Agents?

What Are the Different Types of Agents?

What Are Tools?

What Are Toolkits?

Setting Up the Experiment

Footnotes