Skip to content

This repository contains a Python script that demonstrates how to perform web scraping using LangChain tools. In this example, we'll be scraping my college website https://www.mietjmu.in to extract information such as page titles, URLs, and content.

Notifications You must be signed in to change notification settings

mohammadshahidbeigh/web--scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraping with LangChain Tool

Introduction

This repository contains a Python script that demonstrates how to perform web scraping using LangChain tools. Web scraping is the process of extracting data from websites, and it can be useful for various purposes such as data analysis, content aggregation, and research.

In this example, we'll be scraping my college website (MIET Jammu) to extract information such as page titles, URLs, and content.

Setup

Before running the script, make sure you have Python installed on your system. You'll also need to install the following dependencies:

  • beautifulsoup4: A Python library for pulling data out of HTML and XML files.
  • langchain_community: LangChain community tools for web scraping.

You can install these dependencies using pip:

pip install beautifulsoup4 langchain_community

Usage

  1. Clone the Repository: Clone this repository to your local machine.

  2. Navigate to the Directory: Open a terminal and navigate to the directory where you cloned the repository.

  3. Run the Script: Run the Python script (scrape_website.py) using the following command:

python scrape_website.py

This script will scrape the MIET Jammu website and save the scraped data to a text file (scraped_data.txt).

Customization

You can customize the scraping process by modifying the script according to your requirements. Here are a few customization options:

  • URL: Change the URL in the script to scrape a different website.
  • Extractor Function: Customize the custom_extractor function to extract specific content from the web pages.
  • Formatting: Modify the formatting rules to format the scraped data according to your preferences.

Issues and Feedback

If you encounter any issues or have feedback regarding the scraping process or the script, please open an issue on this repository. We welcome contributions and suggestions for improvement!

About

This repository contains a Python script that demonstrates how to perform web scraping using LangChain tools. In this example, we'll be scraping my college website https://www.mietjmu.in to extract information such as page titles, URLs, and content.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages