Skip to content

Dhanush Submission #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions dhanush submission/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Gradient Works Exercise

## Part-1

1. How many companies are in the dataset? <br>
There are 75 companies in the dataset.
2. How many unique URLs are in the dataset? <br>
There are 530 unique URLs in the dataset. On analyzing the prefixes of the URLs, it is observed that the URLs are from 77 different domains. 2 of the domains seems to have duplicate URLs.

![Qn_2](assets/Qn_2.png)
3. What is the most common chunk type? <br>
The most common chunk type is `header` with 549 occurrences.

![Qn_3](assets/Qn_3.png)

4. What is the distribution of chunk types by company? <br>
Please refer to the jupyter notebook under the Notebooks folder for the distribution of chunk types by company.

## Part-2 RAG

### Architecture Diagram
![Architecture Diagram](assets/Architecture_diagram.png)


## Steps to run the code
1. Create a `.env` file and place your OPEN_AI API key in this format
```
OPENAI_API_KEY =
COHERE_API_KEY =
```
2. Run the `requirements.txt` file to install all the necessary libraries.
```
pip install -r requirements.txt
```
3. Run `chunking.py` first, as this converts the HTML content to text and saves the processed csv file.
4. Run `embedding.py` next to generate embeddings and store them as a numpy file.
5. The code is also exposed as an API using FASTAPI. To run the API server, run the following command inside the src folder.
```
uvicorn main:app --reload --port 8080
```
This will start the API server at http://localhost:8080. <br>
6. Run `chat.py` next, which opens Streamlit in your browser, allowing you to ask relevant questions based on the csv file provided.
```
streamlit run src/chat.py
```
### Demo
![Demo](assets/demo.png)

> [!NOTE]
The code is also available as a jupyter notebook under the notebooks folder.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added dhanush submission/assets/Qn_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added dhanush submission/assets/Qn_3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added dhanush submission/assets/demo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading