Web Scraper 🤖

Learning Ruby one step at a time...

Important

This is my introduction to Ruby! I have experience with similar languages (i.e. Python, Kotlin, JavaScript, etc.) but have never even seen the syntax before making this, so it may not be my best work 😢.

A wise man once told me, "Learn Ruby, you'll like it."

Overview 🌍

This simple web scraper takes any user-inputted URL, scrapes all hyperlinks from the given URL, and outputs them to a CSV. This can easily be integrated into a machine learning project to routinely update a CSV. All you need to do is update the content the Nokogiri doc object is looking for, just like any other scraper.

How to Use 🔧

Install the necessary gems:

These are the pre-built packages/libraries that have functionalities leveraged in this web scraper
```
gem install nokogiri
gem install csv
```

Run the scraper script:

 require 'nokogiri'
 require 'open-uri'
 require 'csv'
 require 'uri'

 puts "Please input the URL you want to scrape: "

 url = ARGV[0] || gets.chomp
 output_file = ARGV[1] || 'scrapedData.csv'

 begin
 html = URI.open(url, "User-Agent" => "Mozilla/5.0")
 doc = Nokogiri::HTML(html)

 links = doc.css('a')
 filtered_links = links.select { |link| link['href'] =~ /^http/ }

 CSV.open(output_file, 'wb') do |csv|
     csv << ['Index', 'Title', 'Link']
     filtered_links.each_with_index do |link, index|
     title = link.text.strip.empty? ? "No Title" : link.text.strip
     absolute_link = URI.join(url, link['href']).to_s
     csv << [index + 1, title, absolute_link]
     end
 end

 puts "Total links found: #{filtered_links.size}"
 puts "Links saved to #{output_file}"

 rescue OpenURI::HTTPError => e
     puts "HTTP Error: #{e.message}"
 rescue CSV::MalformedCSVError => e
     puts "CSV Error: #{e.message}"
 rescue StandardError => e
     puts "An error occurred: #{e.message}"
 end

Here are some sample URLs you could use:
https://www.bbc.com/news
https://www.theweathernetwork.com/en
https://github.com/

Check the generated scrapedData.csv file for the scraped hyperlinks.

Sample Outputs 📊

After running the scraper on a sample URL, your scrapedData.csv might look like this:

Index,Title,Link
1,Audio,https://www.bbc.co.uk/sounds
2,Weather,https://www.bbc.com/weather
3,Newsletters,https://www.bbc.com/newsletters

Resources Used 📚

Ruby Docs: General Ruby docs (install, syntax, etc.).
Nokogiri: Web scraping parser.
Gets Method: How Ruby gets user input.
Ruby CSV: Ruby CSV documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
scrapedData.csv		scrapedData.csv
scraper.rb		scraper.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper 🤖

Overview 🌍

How to Use 🔧

Sample Outputs 📊

Resources Used 📚

About

Releases 1

Packages

Languages

carsonSgit/Ruby-WebScraper

Folders and files

Latest commit

History

Repository files navigation

Web Scraper 🤖

Overview 🌍

How to Use 🔧

Sample Outputs 📊

Resources Used 📚

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages