Skip to content

carsonSgit/Ruby-WebScraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraper 🤖

Learning Ruby one step at a time...

Important

This is my introduction to Ruby! I have experience with similar languages (i.e. Python, Kotlin, JavaScript, etc.) but have never even seen the syntax before making this, so it may not be my best work 😢.

A wise man once told me, "Learn Ruby, you'll like it."

Overview 🌍

This simple web scraper takes any user-inputted URL, scrapes all hyperlinks from the given URL, and outputs them to a CSV. This can easily be integrated into a machine learning project to routinely update a CSV. All you need to do is update the content the Nokogiri doc object is looking for, just like any other scraper.

How to Use 🔧

  1. Install the necessary gems:

    These are the pre-built packages/libraries that have functionalities leveraged in this web scraper

    gem install nokogiri
    gem install csv
  2. Run the scraper script:

     require 'nokogiri'
     require 'open-uri'
     require 'csv'
     require 'uri'
    
     puts "Please input the URL you want to scrape: "
    
     url = ARGV[0] || gets.chomp
     output_file = ARGV[1] || 'scrapedData.csv'
    
     begin
     html = URI.open(url, "User-Agent" => "Mozilla/5.0")
     doc = Nokogiri::HTML(html)
    
     links = doc.css('a')
     filtered_links = links.select { |link| link['href'] =~ /^http/ }
    
     CSV.open(output_file, 'wb') do |csv|
         csv << ['Index', 'Title', 'Link']
         filtered_links.each_with_index do |link, index|
         title = link.text.strip.empty? ? "No Title" : link.text.strip
         absolute_link = URI.join(url, link['href']).to_s
         csv << [index + 1, title, absolute_link]
         end
     end
    
     puts "Total links found: #{filtered_links.size}"
     puts "Links saved to #{output_file}"
    
     rescue OpenURI::HTTPError => e
         puts "HTTP Error: #{e.message}"
     rescue CSV::MalformedCSVError => e
         puts "CSV Error: #{e.message}"
     rescue StandardError => e
         puts "An error occurred: #{e.message}"
     end

    Here are some sample URLs you could use:

    https://www.bbc.com/news
    https://www.theweathernetwork.com/en
    https://github.com/
    
  3. Check the generated scrapedData.csv file for the scraped hyperlinks.

Sample Outputs 📊

After running the scraper on a sample URL, your scrapedData.csv might look like this:

Index,Title,Link
1,Audio,https://www.bbc.co.uk/sounds
2,Weather,https://www.bbc.com/weather
3,Newsletters,https://www.bbc.com/newsletters

Resources Used 📚