Skip to content

Simple ruby script to fetch and download web content of given urls

Notifications You must be signed in to change notification settings

devmtnaing/web-fetcher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is just simple little ruby script to fetch and download web content of given urls

Getting started

Usage

Basic command to fetch and download urls.

ruby fetch.rb http://google.com http://youtube.com

metadata tag

Use --metadata to record and display number of links, images, and last fetched time.

ruby fetch.rb http://google.com --metadata

save-assets tag

Use --save-assets to download assets (image, js, and css) to local folder. The downloaded html content will reference the assets from the local folder.

ruby fetch.rb http://google.com --save-assets

Running the script

How to run on local machine

Install required gems

bundle install

Reference Usage section for available commands

# Example
ruby fetch.rb http://google.com http://youtube.com

Make it an executable file

chmod +x fetch.rb

# Example
./fetch.rb http://google.com --metadata --save-assets

How to run on Docker

Build a docker image

docker build -t image_name .

Reference Usage section for available commands

docker run image_name http://google.com http://youtube.com

To check the downloaded content, sh into docker image.

docker run --it --entrypoint sh image_name

# fetch.rb is already excutable
# Example command inside image
./fetch.rb http://google.com --metadata --save-assets

Future development

Limitations

  • Not able to properly fetch react/angular powered web pages.
  • Not able to properly fetch web pages that have to trigger javascript to fully load its content. A bit different from react/angular web apps
  • --save-assets do not download base64 images
  • picture html tag are not properly rendered in downloaded html content despite images are already been downloaded.
  • image tags with data-src but without src are not able to be downloaded.

Interesting

  • While fetching medium, its web content are properly downloaded locally. However, accessing the downloaded html would only render 404 out of nothing, something content. Domain name issue?

Todo

  • Refactor the code into modules and classes with tests
  • Enable Limitations
  • Much improvment to be done around downloading assets workflow and refactor (DRY)

About

Simple ruby script to fetch and download web content of given urls

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published