This is just simple little ruby script to fetch and download web content of given urls
Basic command to fetch and download urls.
ruby fetch.rb http://google.com http://youtube.comUse --metadata to record and display number of links, images, and last fetched time.
ruby fetch.rb http://google.com --metadataUse --save-assets to download assets (image, js, and css) to local folder.
The downloaded html content will reference the assets from the local folder.
ruby fetch.rb http://google.com --save-assetsInstall required gems
bundle installReference Usage section for available commands
# Example
ruby fetch.rb http://google.com http://youtube.comMake it an executable file
chmod +x fetch.rb
# Example
./fetch.rb http://google.com --metadata --save-assetsBuild a docker image
docker build -t image_name .Reference Usage section for available commands
docker run image_name http://google.com http://youtube.comTo check the downloaded content, sh into docker image.
docker run --it --entrypoint sh image_name
# fetch.rb is already excutable
# Example command inside image
./fetch.rb http://google.com --metadata --save-assets- Not able to properly fetch react/angular powered web pages.
- Not able to properly fetch web pages that have to trigger javascript to fully load its content. A bit different from react/angular web apps
- --save-assets do not download base64 images
picturehtml tag are not properly rendered in downloaded html content despite images are already been downloaded.- image tags with
data-srcbut withoutsrcare not able to be downloaded.
- While fetching medium, its web content are properly downloaded locally. However, accessing the downloaded html would only render 404 out of nothing, something content. Domain name issue?
- Refactor the code into modules and classes with tests
- Enable Limitations
- Much improvment to be done around downloading assets workflow and refactor (DRY)