Usage

A Web Scraper built with beautiful soup, that fetches udemy course information.

Usage

This section shows the basic usage of this script. Before this be sure to install this first before importing it in your file.

As a Module

Udemyscraper contains a UdemyCourse class which can be imported into your file it takes just one argument which is query which is the seach query. It has a method called fetch_course which you can call after creating a UdemyCourse object.

from udemyscraper import UdemyCourse

course = UdemyCourse('learn javascript')
course.fetch_course()

As a Script

In case you do not wish to use the module in your own python file but you just need to dump the data, udemyscraper.py file can be directly invoked and can also be executed along with a variety of arguments and options.

You can do so by running the udemyscraper.py file along with passing the required arguments.

python3 udemyscraper.py <command>

Here is an example of dumping the data as a json file.

python3 udemyscraper.py -d json -q "German course for beginners"

List of Commands

Installation

Virtual Environment

Before installing the dependencies it is recommended to setup a virtual environment.

Details

You can setup a virtual environment on your machine by using the virtualenv library and then activating it.

pip install virtualenv

virtualenv somerandomname

Activating for *nix

source somerandomname/bin/activate

Activating for Windows

somerandomname\Scripts\activate

Dependencies Installation

You are required to install all of the modules listed in requirements.txt file.

pip install -r requirements.txt

Browser Setup

A browser window may not pop-up as I have enabled the headless option so the entire process takes minimal resources.

This script works with firefox as well as chrome.

Chrome (or chromium)

To run this script you need to have chrom(ium) installed on the machine as well as the chromedriver binary which can be downloaded from this page. Make sure that the binary you have installed works on your platform/ architecture and the the driver version corresponds to the version of the browser you have downloaded.

I have already provided a windows binary of the driver in the repo itself which supports chrom(ium) 92. You can use that or you can get your specific driver from the link above.

To set chrome as default you can pass in an argument while initializing the class though it is set to chrome by default.

mycourse = UdemyCourse(browser_preference="CHROME")

Or you can pass in a argument while using as a script

python3 udemyscraper.py -b chrome

Firefox

In order to run this script this firefox, you need to have firefox installed as well as the gekodriver executable file in this directory or in your path. You can download the gekodriver from here. Or use the one provided with the source code.

To use firefox instead of chrome, you can pass in an argument while initializing the class:

mycourse = UdemyCourse(browser_preference="FIREFOX")

Or you can pass in a argument while using udemyscraper.py

python3 udemyscraper.py -b firefox

Suppressing Browser

Headless Disabled	Headless Enabled

19 Seconds	12 Seconds

In the above comparison you can clearly see that the image on the right (headless) completed way faster than the one with headless disabled. By suppressing the browser not only do you save time, but you also save system resources.

The headless option is enabled by default. But in case you want to disable it for debugging purposes, you may do so by passing the headless argument to false

mycourse = UdemyCourse(headless=False)

Or specify the same for udemyscraper.py

python3 udemyscraper.py -h false

Approach

It is fairly easy to webscrape sites, however, there are some sites that are not that scrape-friendly. Scraping sites, in itself is perfectly legal however there have been cases of lawsuits against web scraping, some companies *cough Amazon *cough consider web-scraping from its website illegal however, they themselves, web-scrape from other websites. And then there are some sites like udemy, that try to prevent people from scraping their site.

Using BS4 in itself, doesn't give the required results back, so I had to use a browser engine by using selenium to fetch the courses information. Initially, even that didn't work out, but then I realised the courses were being fetch asynchronously so I had to add a bit of delay. So fetching the data can be a bit slow initially.

Why not just use the Udemy's API?

Even I thought of that after some digging around as I did not know that such an API existed. However, this requires you to have a udemy account already. I might add the use of this Api in the future, but right now, I would like to keep things simple. Moreover, this kind of front-end webscraping does not require authentication.

Data

The following datatable contains all of the data that can be fetched.

Course Class

This is the data of the parent class which is the course class itself.

View Table

Name	Type	Description	Usage
`link`	URL (String)	url of the course.	`course.link`
`title`	String	Title of the course	`course.title`
`headline`	String	The headline usually displayed under the title	`course.headline`
`instructors`	String	Name of the instructor of the course	`course.instructors`
`rating`	Float	Rating of the course out of 5	`course.rating`
`no_of_ratings`	Integer	Number of rating the course has got	`course.no_of_ratings`
`duration`	String	Duration of the course in hours and minutes	`course.duration`
`no_of_lectures`	Integer	Gives the number of lectures in the course (lessons)	`course.no_of_lectures`
`no_of_sections`	Integer	Gives the number of sections in the courses	`course.no_of_lectures`
`tags`	List	Is the list of tags of the course (Breadcrumbs)	`course.tags[1]`
`price`	Float	Price of the course in local currency	`course.price`
`student_enrolls`	Integer	Gives the number of students enrolled	`course.student_enrolls`
`language`	String	Gives the language of the course	`course.language`
`objectives`	List	List containing all the objectives for the course	`course.objectives[2]`
`Sections`	List	List containing all the section objects for the course	`course.Sections[2]`
`requirements`	List	List containing all the requirements for the course	`course.requirements`
`description`	String	Gives the description paragraphs of the course	`course.description`
`target_audience`	List	List containing the points under Target Audience heading	`course.target_audience`
`banner`	String	URL for the course banner image	`course.banner`

Section Class

Name	Type	Description	Usage
`name`	String	Returns the name of the section of the course	`course.Sections[4].name`
`duration`	String	The duration of the specific section	`course.Sections[4].duration`
`Lessons`	List	List with all the lesson objects for the section	`course.Sections[4].Lessons[2]`
`no_of_lessons`	Integer	Gives the number of lessons in the particular Section	`course.Sections[4].no_of_lessons`

Lesson Class

Name	Type	Description	Usage
`name`	String	Gives the name of the lesson	`course.Sections[4].Lessons[2].name`
`demo`	Boolean	Whether the lesson can be previewed or not	`course.Sections[4].Lessons[2].demo`
`duration`	String	The duration of the specific lesson	`course.Sections[4].Lessons[2].duration`
`type`	String	Tells what type of lesson it is. (Video, Article, Quiz)	`course.Sections[4].Lessons[2].type`

Output/ Dumping data

Quick Display

When executing the file as a script, this is the default output mode and perhaps the most breif one.

Details

(env) F:\Github\udemy-web-scraper> python udemyscraper.py -q "Learn Python" --quiet -n
===================== Fetched Course ===================== 

Learn Python Programming Masterclass

This Python For Beginners Course Teaches You The Python 
Language Fast. Includes Python Online Training With Python 3

URL: https://udemy.com/course/python-the-complete-python-developer-course/
Instructed by Tim Buchalka
4.5 out of 5 (79,526)
Duration: 64h 33m
469 Lessons and 25 Sections

The quick_display fucntion can also be called when using udemyscraper as a module.

from udemyscraper import *

# Assuming you have already created a course object and fetched the data
quick_display(course)

Converting to Dictionary

The entire course object is converted into a dictionary by using nested object to dictionary conversion iterations.

Details

from udemyscraper import course_to_dict
# Assuming you have already created a course object and fetched the data
dictionary_course = course_to_dict(course)

Note : This way of returning data does not work when evoked directly due to obvious reasons.

Dumping as JSON

Currently, the script can convert the entire course into a dictionary, parse it into a json file and then dump it to a json file. You can do this by calling the course_to_json() function.

Details

from udemyscraper import course_to_json

# Assuming you have already created a course object and fetched the data
course_to_json(course)

This will dump the data to object.json file in the same directory. You can also specify the name of the file by passing in the corresponding argument

course_to_json(course, 'course.json')

The object will now be stored on course.json file.

Here is an example of how the file will look like. (The file has been trunacted)

Dumping as CSV

Currently not implemented yet.

Dumping as XML

Currently in development. Kindly check out Pull Request #21 for more information and progress.

For Jellyfin users

Jellyfin metadata uses XML structure for its .nfo files. For images, we only have one resource which is the poster of the file. It might be possible to write a custom XML structure for jellyfin. Currently in development.

Contributing

Issues and PRs as well as discussions are always welcomes, but please make an issue of a feature/code that you would be modifying before starting a PR.

Currently there are lots of features I would like to add to this script. You can check this page what the current progress is.

For further instructions, do read contributing.md.

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
docs		docs
src/udemyscraper		src/udemyscraper
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table Of Contents

Usage

As a Module

As a Script

List of Commands

Installation

Virtual Environment

Dependencies Installation

Browser Setup

Chrome (or chromium)

Firefox

Suppressing Browser

Approach

Why not just use the Udemy's API?

Data

Course Class

Section Class

Lesson Class

Output/ Dumping data

Quick Display

Converting to Dictionary

Dumping as JSON

Dumping as CSV

Dumping as XML

For Jellyfin users

Contributing

About

Uh oh!

Releases

Packages

Languages

License

mikpim01/udemy-web-scraper

Folders and files

Latest commit

History

Repository files navigation

Table Of Contents

Usage

As a Module

As a Script

List of Commands

Installation

Virtual Environment

Dependencies Installation

Browser Setup

Chrome (or chromium)

Firefox

Suppressing Browser

Approach

Why not just use the Udemy's API?

Data

Course Class

Section Class

Lesson Class

Output/ Dumping data

Quick Display

Converting to Dictionary

Dumping as JSON

Dumping as CSV

Dumping as XML

For Jellyfin users

Contributing

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages