Web Scraping of TED.com for complete Metadata, Transcript, Audio, Video, Images using Parallel Programming.
Environment: Google Colab with Google Drive without any Hardware Accelerator. Python: 3.6.9
I was looking for an interesting dataset for a personal Data Science project, and I'm a fan of TED. So, I looked for the TED dataset, found Rounka's but it is incomplete and outdated. Then, I scraped myself and made it super fast using Parallel Programming. Now, it downloads all Metadata along with the Transcript in 300 seconds of all 4609 Talks on the website*. This is the most comprehensive TED Talk dataset which includes media files (images, audio, and video) too!
*Scraped on 24-JUN-20. One can scrape entire TED.com using the code to get the latest dataset in 5 minutes.
Downloading media files take less than 2 hours in total - 2 minutes for photos of Speaker and Talk, 10 minutes for Audio, 1.5 hours for videos.
TED_Talk.xlsx and TED_Talk.csv contain Metadata and Transcript. Folder Names are intuitive. All media files are named by talk__id, except in PHOTO__SPEAKER files are named by speaker__id of the primary Speaker.
The code shows a way to scrape at scale.