This project is fairly straightforward with regards to requirements on the user's machine, but there are a few baselines that are required to be hit:
- The project requires Google Chrome to work.
- The project requires ChromeDriver, maintained by Chronium, to be installed in the root directory of the project in order to enable scraping (see Step 2 under Installation Instructions, below).
- The project requires a working installation of Python to scrape new course content. The file
requirements.txt
includes the packages necessary for the script to run. If you plan to scrape new course content into the project ElasticSearch index, please ensure your Python environment satisfies these requirements. (TODO - Create requirements.txt file for Python packages) - As the extension is not deployed to the Google Chrome Web Store, it requires a local copy of the codebase on the user's computer (see Step 1 under Installation Instructions, below).
Installing the extension is quite simple; all you need to do is download the code from GitHub and then activate the extension in Chrome. A step-by-step guide for the above is below.:
- Pull the code from GitHub to
desiredDirectory
using your shell:
cd desiredDirectory
git clone https://github.com/christianopperman/CS410_Fall2023_CourseProject_TeamCAHJ.git
- Install the appropriate ChromeDriver for your computer's enviornment from this linke, unzip it, and move the
Google Chrome for Testing
application to theCS410__Fall2023_CourseProject_TeamCAHJ
directory created in Step 1, above. - Open Google Chrome.
- Go to the Extensions page on Google Chrome by following this link.
- Activate Developer Mode by toggling the switch in the upper right corner labeled
Developer mode
.
- Load the extension from the codebase pulled to your computer in Step 1 by clicking the
Load unpacked
button in the top left corner:
- Select the
desiredDirectory/CS410_Fall2023_CourseProject_TeamCAHJ/ChromeExtension
directory in the popup and clickSelect
- The extension should now be available to you in your Google Chrome Extensions list.
As mentioned in Requirements above, in order to scrape your own Coursera course transcripts into the extension, you will need a working version of Python that satisfies the required packages outlined in the CourseraTranscriptScraper\requirements.txt
file.
Once you have that, scraping a new course into ElasticSearch is very easy:
- Navigate to
desiredDirectory/CS410_Fall2023_CourseProject_TeamCAHJ/CourseraTranscriptScraper
in your shell - Call the course scraper script with, with the following command line arguments:
python scrape_coursera_course.py -c "course_url" -u "coursera_username" -p "coursera_password" [-e]
-
Required Arguments
- -c : The link to the landing page of the Coursera course you'd like to scrape
- -u : The username to your Coursera account which has access to the course you'd like to scrape
- -p : The password to your Coursera account which has access to the course you'd like to scrape
-
Optional Arguments:
- -e : A boolean flag. If included, the script will automatically push the scraped course transcriptions to ElasticSearch after saving them to disk. If not included, the transcriptions will be saved to disk but not pushed to ElasticSearch.
- -o : The output path to write the transcriptions to, if you would like to save the transcriptions to a specific filename.
- Once you run the above command, a window will pop up and automatically log you into Coursera. It is likely that you will be required to complete a CAPTCHA.
- Once you complete the CAPTCHA, return to your shell and press Enter, as prompted.
- The script will begin scraping, as evidenced by the pop-up window navigating between video pages in the course and the
Retrieved
messages in the shell window. - The script will write any scraped transcriptions to the filepath specified by the
-o
command line argument, if present, and tosubtitles.json
if not. - If the
-e
flag was passed to the script, the script will automatically push the scraped course's transcriptions to ElasticSearch.