Skip to content

GrimmXoXo/goodreads-scraper

 
 

Repository files navigation

Goodreads Scraper

These Python scripts can be used to collect book reviews and metadata from Goodreads.

We were motivated to develop this Goodreads Scraper because the Goodreads API is difficult to work with and does not provide access to the full text of reviews. The Goodreads Scraper instead uses the web scraping libraries Beautiful Soup and Selenium to collect data.

We used this Goodreads Scraper to collect data for our article, "The Goodreads ‘Classics’: A Computational Study of Readers, Amazon, and Crowdsourced Amateur Criticism." To allow others to reproduce (approximately) the data we used in the essay, we include a file with 144 Goodreads book IDs for the 144 classics that we analyzed (goodreads_classics.txt). You can use these IDs to collect corresponding reviews and metadata with the Goodreads Scraper as described below.

Note: Updates to the Goodreads website may break this code. We don't guarantee that the scraper will continue to work in the future, but feel free to post an issue if you run into a problem.



Update (Fall 2022)

Goodreads recently updated their book pages with a new layout. We've heard that our scraper is still functioning, but we haven't fully tested or updated the scraper to account for these changes.



What You Need

To run these scripts, you will need Python 3.

You will also need the following Python libraries:

You can install these Python libraries by running pip install -r requirements.txt

Finally, you will need a web browser — either Chrome or Firefox. We have found that the Goodreads Scraper tends to function better with Firefox.



Tutorial

We recommend running these Python scripts from the command line, as the usage instructions below describe. However, we have also created a Jupyter notebook tutorial that demonstrates how to use the Goodreads Scraper scripts. Please note that these scripts may not work consistently from a Jupyter notebook environment and that the tutorial is mostly intended for demonstration purposes.

Scraping Goodreads Book Metadata

You can use the Python script get_books.py to collect metadata about books on Goodreads, such as the total number of Goodreads reviews and ratings, average Goodreads rating, and most common Goodreads "shelves" for each book.

get_books.py

Input

This script takes as input a list of book IDs, stored in a plain text file with one book ID per line. Book IDs are unique to Goodreads and can be found at the end of a book's URL. For example, the book ID for Little Women (https://www.goodreads.com/book/show/1934.Little_Women) is 1934.Little_Women.

Output

This script outputs a JSON file for each book with the following information:

  • book ID and title
  • book ID
  • book title
  • year the book was first published
  • format info
  • cover page
  • publication_type
  • title
  • author
  • number of pages in the book
  • genres
  • total number of ratings
  • total number of reviews
  • average rating
  • rating distribution

This script also outputs an aggregated JSON file with information about all the books that have been scraped. To output an aggregated CSV file in addition to a JSON file, use the flag --format CSV.

Usage

python get_books.py --book_ids_path your_file_path --output_directory_path your_directory_path --format JSON (default) or CSV

Example

python get_books.py --book_ids_path most_popular_classics.txt --output_directory_path goodreads_project/classic_book_metadata --format CSV



Scraping Goodreads Book Reviews

You can use the Python script get_reviews.py to collect reviews and review metadata about books on Goodreads, including the text of the review, star rating, username of the reviewer, number of likes, and categories or "shelves" that the user has tagged for the book.

get_reviews.py

Input

This script takes as input a list of book IDs, stored in a plain text file with one book ID per line. Book IDs are unique to Goodreads and can be found at the end of a book's URL. For example, the book ID for Little Women (https://www.goodreads.com/book/show/1934.Little_Women) is 1934.Little_Women.

Output

This script outputs a JSON file for each book with the following information:

  • book ID and title
  • book ID
  • book title
  • review URL
  • review ID
  • date
  • rating
  • username of the reviewer
  • text
  • number of likes the review received from other users
  • shelves to which the reviewer added the book

This script also outputs an aggregated JSON file with information about all the reviews for all the books that have been scraped. To output an aggregated CSV file in addition to a JSON file, use the flag --format CSV.

Goodreads only allows the first 10 pages of reviews to be shown for each book. There are 30 reviews per page, so you should expect a maximum of 300 reviews per book. By default, the reviews are sorted by their popularity. They can also be sorted chronologically to show either the newest or oldest reviews.

We also select a filter to only show English language reviews.

Usage

python get_reviews.py --book_ids_path your_file_path --output_directory_path your_directory_path --browser your_browser_name --sort_order your_sort_order --rating_filter your_rating_filter --format JSON (default) or CSV

sort_order can be set to default,newest or oldest.

rating_filter can be omitted or set to any number in the range 1-5.

browser can be set to chrome or firefox.

format can be set to JSON (default) or CSV.

Example

python get_reviews.py --book_ids_path most_popular_classics.txt --output_directory_path goodreads_project/classic_book_reviews --sort_order default --rating_filter 5 --browser chrome



Test

You can run the provided test script to check that everything is working correctly.

./test_scripts.sh

This will create a directory called test-output in which you'll find the scraped books and reviews.



Extracting Goodreads Book IDs

The get_book_ids.py script allows you to extract Goodreads book IDs based on specified criteria. Below is a description of how to use the script along with its functionality:

Usage

To use the get_book_ids.py script, follow these steps:

  1. Parsing Command-line Arguments: The script utilizes command-line arguments for customization. Run the script with the following command:

    python get_book_ids.py [-c {yes,no}] [-id LIST_ID] [-t {yes,no}]
  2. Custom Scraping:

    • By default, the script performs random scraping of Goodreads book IDs from various collections.
    • Optionally, you can enable custom scraping mode using the -c or --custom-scrap argument. If custom scraping is enabled, you need to specify the list ID using the -id or --list-id argument. For example:
      python get_book_ids.py -c yes -id 12345
    • The --list-id argument must be provided when custom scraping is 'yes'. If not provided or if the ID is not an integer, an error message will be displayed.
  3. Database Conversion:

    • After scraping book IDs, you have the option to convert the collected data from the SQLite database (books_id.db) to a text file (book_ids.txt).
    • Use the -t or --txt-convert argument to specify whether to convert the database to a text file. For example:
      python get_book_ids.py -t yes
    • If the database file (books_id.db) does not exist, an error message will be displayed.
  4. Running the Script:

    • Once you have set your desired options, execute the script to perform the scraping and database conversion tasks. For example:
      python get_book_ids.py
  5. Output:

    • The script outputs the scraped book IDs and related information either to the database alone or, if specified, additionally to a text file containing the IDs.From the Text file we can use the ids to do further scraping with the help of get_books.py.

Example

Here's an example of how to use the script for custom scraping and database conversion:

python get_book_ids.py -c yes -id 12345 -t yes

Credits

This code is written by Maria Antoniak and Melanie Walsh. The code is licensed under a GNU General Public License v3.0.

If you use this scraper, we'd love to hear about your project and how you use the code.

If you use this scraper as part of an academic publication, you can credit us by citing the following paper.

Walsh, Melanie, and Maria Antoniak. "The Goodreads ‘Classics’: A Computational Study of Readers, Amazon, and Crowdsourced Amateur Criticism." Journal of Cultural Analytics 4 (2021): 243-287.

We used a function written by Omar Einea, licensed under GPL v3.0, for the Goodreads review sorting.

About

A Python scraper for Goodreads books and reviews.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 72.1%
  • Python 27.7%
  • Shell 0.2%