Skip to content

last-stand/webcrawler

Repository files navigation

CrawlCraft 🕷️ Webcrawler

This system capable of crawling data from a provided web URL, vectorizing the crawled data, and enabling users to submit queries to retrieve and visualize relevant information.

Requirements

  • Implement a web crawling mechanism in JavaScript or TypeScript to fetch and extract data from any provided URL or website.
  • Ensure that your solution is capable of handling dynamic content or elements loaded asynchronously on the web page.
  • Implement a method to convert the crawled textual data into vectorized representations. Choose an appropriate vectorization technique (eg. Word Embeddings) and store the data in a Vector DB.
  • Develop a system where the user submits text queries, vectorizing them using the same technique, and providing the top 3 relevant crawled data.

Architecture Diagram

Here's a description of the architecture diagram, incorporating the specified components:

  1. Data Ingestion

    • Puppeteer: Fetches web page content from given URL and passes it to the next component.
  2. Tokenization and OpenAI Embeddings

    • LangChain: Tokenizes text into meaningful units, generates OpenAI embeddings for tokens.
  3. Embedding Storage

    • Pinecone: Pinecone is used as a vector database. It stores OpenAI embeddings efficiently, enables fast similarity search.
  4. Query Processing and Retrieval from db:

    • Query Input: Receives user query.
    • Embedding Generation: Generates OpenAI embeddings for query.
    • Similarity Search: Uses Pinecone to find top 3 matching embeddings.
  5. Response Generation

    • Data Retrieval: Retrieves corresponding web page content based on matched embeddings.
    • OpenAI LLM: Processes retrieved content and query, generates informative and comprehensive answer.
  6. NodeJs Server

    • Express: This whole setup is running on the top of express server which is providing interface to use this system.

Visual Diagram (Conceptual)

View on Eraser

Key Considerations

  • Embedding Dimensionality: Choose appropriate dimensionality for embeddings, balancing accuracy and storage/search efficiency.
  • Similarity Metric: Select a suitable similarity metric for Pinecone search (e.g., cosine similarity).
  • Query Understanding: Ensure the LLM can effectively understand and interpret user queries.
  • Answer Generation: Refine LLM prompts and parameters for generating accurate and relevant answers.

Build and Run

  1. Prerequisites

    • Make sure you have node js installed with version >=18.0.0 and npm version >=8.0.0 <10.0.0
    • You create should create Pinecone vector databse account and generate api key pinecone-api-link. Pinecone also have free tier database which can be used by beginners witout need of adding credit card.
    • You create should create OpenAI account and generate api key openapi-link.
  2. Install node js dependencies

    • Go to the project's root folder and run
    $ npm install
  3. Configuration

    • Create .env file in project's root folder with given template below and provide api keys mentioned below. You can tweak config according to your need.
      ENV=dev
      PORT=8001
      URL=http://127.0.0.1:8001
      OPENAI_API_KEY=sk-YOUR_OPENAI_API_KEY
      PINECONE_API_KEY=YOUR_PINECONE_API_KEY
      PINECONE_ENVIRONMENT=gcp-starter
      PINECONE_INDEX=web-crawler
      
  4. Run Tests

    • To run tests use,
      $ npm run test
  5. Run App

    • To start express server just run

      $ npm start

      By default server will run on 8001 port. Open the link http://127.0.0.1:8001 in the browser.

    • To run express server in dev mode, run

      $ npm run dev

    This is how UI looks like in the browser. Screenshot

Releases

No releases published

Packages

No packages published