sci-fi-books-evolution

Overview

Inspired by this project (video, page) about the evolution in the plot of sci-fi movies and series, and this comment, I decided to do a similar analysis but for sci-fi novels.

You can read in detail all the process of the data and analysis of the results here.

But, in short, I scraped data about thousands of sci-fi books from Goodreads lists and shelves, and the books' Wikipedia articles, cleaned and reduced the data, selected the top 200 novels per decade (or all the novels if fewer than 200 per decade), and fed that into GPT-5 (in the first tries, I used GPT-4o and Gemini 2.0 Flash) via the OpenAI API, asking about many plot-related things. Then, I aggregated the results in figures to see how things changed over time.

What the code does

1. Scraping the data

I intended to scrape data directly from the science fiction book shelf on Goodreads, but it didn't work, and for some error, it only shows books until page 25. I got around that by downloading the pages and reading them locally.

1_Goodreads_scraper_SHELF.py reads the HTML files in the folder Data/Saved_pages and searches for links to the book pages and downloads data for book's title, author, year published, number of pages, if it is part of a series, average rate, number of ratings, genres, synopsis and the longest of the first two reviews, saving everything in the sci-fi_books_SHELF.csv file at the end.

You will have to download and put the HTML files in the Saved_pages folder by hand. I didn't include them here because they are big and use too much space.

2_Goodreads_scraper_LISTS.py does the same, but from a list of URLs for the initial pages of thematic book lists, saving the progress in a JSON file and everything in the sci-fi_books_LISTS.csv file at the end.

All data recovered and processed is stored in the Data/Brute folder.

2. Cleaning the data

3_Data_reducer_FILTERED.py merges the datafiles, creating sci-fi_books_BRUTE.csv, and cleans them (many duplicates, bad data, and books not of interest), creating the sci-fi_books_FILTERED.csv file.

4_Data_reducer_TOP.py selects the top 200 novels per decade (by the number of user ratings) and saves them as sci-fi_books_TOP.csv. It also creates the sci-fi_books_TEST.csv file, with just a small selection of the novels to test the AI performance.

If this is your first time running everything, you can proceed to step 3. But if you have already run everything to the end and are just adding some books from the scraper (or deleting books via the filtering), that may change which books are in the top 200 per decade. Some rows from the AI answers' CSV may need to be excluded, and/or new ones may need to be processed. For this, run 5_Data_fixer.py now.

Many of the novels in the top sample have a Wikipedia article with a plot section that gives much more details about the plot than Goodreads synopses and reviews, so they are preferable.

6_Wikipedia_plot_scraper.py searches the wikipedia after the novels listed in sci-fi_books_TOP.csv and creates the sci-fi_books_TOP_Wiki.csv and sci-fi_books_TEST_Wiki.csv files with the plot sections found in the articles.

All the files are stored in the Data/Filtered folder.

3. Prompting the LLM

7_AI_asker_AI_ANSWERS.py reads the sci-fi_books_TOP_Wiki.csv file (or sci-fi_books_TEST_Wiki.csv) and, for every novel in the file, sends the prompt with the novel's data to OpenAI's API for GPT-5 and receives a text answer, parses it, and saves it in the sci-fi_books_AI_ANSWERS.csv file in the Data/Answers folder.

In the Data/Variability_in_Answers folder, there are already 15 answer files to be used with the Figure_maker.py script to estimate the model change in answer for each novel and question.

8_AI_asker_AI_ANSWERS_GENDER.py reads the sci-fi_books_TOP.csv file and, for every different author in the file, sends the prompt with the author's name to GPT-5 and receives the author's gender, and saves it in the sci-fi_books_AI_ANSWERS_GENDER.csv file.

You will need a working API key set in your environment and credits in your OpenAI account for them to work.

Older results and some code for GPT-4o and Gemini 2.0 Flash are stored in separate folders (Old_GPT-4o and Old_Gemini_2.0_flash).

4. Plotting the results

9_Figure_maker.py reads the sci-fi_books_AI_ANSWERS.csv and sci-fi_books_AI_ANSWERS_GENDER.csv files, and the files in the Data/Variability_in_Answers folder, and makes figures from them, saving all of them in the Figures folder.

Example of a figure

The figure below is one example of the output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

sci-fi-books-evolution

Overview

What the code does

1. Scraping the data

2. Cleaning the data

3. Prompting the LLM

4. Plotting the results

Example of a figure

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
Data		Data
Old_GPT-4o		Old_GPT-4o
Old_Gemini_flash_2.0		Old_Gemini_flash_2.0
.gitattributes		.gitattributes
.gitignore		.gitignore
1_Goodreads_scraper_SHELF.py		1_Goodreads_scraper_SHELF.py
2_Goodreads_scraper_LISTS.py		2_Goodreads_scraper_LISTS.py
3_Data_reducer_FILTERED.py		3_Data_reducer_FILTERED.py
4_Data_reducer_TOP.py		4_Data_reducer_TOP.py
5_Data_fixer.py		5_Data_fixer.py
6_Wikipedia_plot_scraper.py		6_Wikipedia_plot_scraper.py
7_AI_asker_AI_ANSWERS.py		7_AI_asker_AI_ANSWERS.py
8_AI_asker_AI_ANSWERS_GENDER.py		8_AI_asker_AI_ANSWERS_GENDER.py
9_Figure_maker.py		9_Figure_maker.py
LICENSE		LICENSE
README.md		README.md
palette.png		palette.png

License

ferdesmello/sci-fi-books-evolution

Folders and files

Latest commit

History

Repository files navigation

sci-fi-books-evolution

Overview

What the code does

1. Scraping the data

2. Cleaning the data

3. Prompting the LLM

4. Plotting the results

Example of a figure

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages