Author: will
This is a simple scraping script originally created for the other repository SaveWuhan/NewsCoverageOnWuhan. The script turns distracting web view into clean, formatted Markdown md file.
node is required for running this scripts.
$ yarn installor$ npm install$ node scrape url [options]
By default, the script will upload images onto cloudinary to ensure each image would remain rendered in case of the deletion of original works.
Ideally, you would have and .env file with following entries specified
# for imgur
export client_id=[your_imgur_client_id]
# for cloundiary
export cloud_name=[your_cloudinary_clound_name]
export cloud_api_key=[your_cloudinary_clound_api_key]
export cloud_api_secret=[your_cloudinary_clound_api_secret]and $ source .env before running the scripts.
- If you have no
imgurorcloudinarycredentials,-n,--no-replaceflags can be used to prevent any image uploading. However, if you do wish to contribute to SaveWuhan/NewsCoverageOnWuhan, we would have to require each image uploaded to either hosting service that is accessible outside China.- example:
$node scrape https://mp.weixin.qq.com/s/U4IrYQcPc6G-ce9X5eRE_g -n
- example:
- You can specify image host service using
--hostflag. Nonetheless, cloudinary is strongly recommended for its reliability and gracious rate limits.- options:
cloudinary,imgur - default:
cloudinary - example
$node scrape https://mp.weixin.qq.com/s/U4IrYQcPc6G-ce9X5eRE_g --host=imgur
- options: