Contributors

Members: Akshay Kumar, Emilia Dwyer, Zhuo (Lily) Wang, Muskan Jain
Section: 002

Description

Mahalo is designed for publishers/writers looking for guidance on how to describe their book on Amazon to garner the most positive reviews. It is meant to be a preliminary tool to facilitate initial brainstorming sessions when constructing a new book’s landing page. The tool provides Word Clouds printed by category and topic for the best selling books. Categories and topics are generated from a static dataframe provided by Kaggle and a corresponding scrape of data by unique key, ASIN.

Branch Setup

Master:
- FinalCode.ipynb: final code to run
- AZscraping.ipynb: working file before savings out as FinalCode.ipynb
- ##.png: contains exported WordCloud images
- batch#.csv: batches of AZ scrape that were merged into final df_processed.csv
- requirements.txt: package requirements for FinalCode.ipynb
- amazon_com_extras.csv: kaggle dataframe

Run Instructions

Use 'Final Code.ipynb' file on master branch.

You have a paramount decision to make that will define who you are as a human. You can either hit Ctrl+Enter or Shift+Enter to run code cells. Choose wisely.

You do not need to run the code from "Combine in all AZscrape batches 1-7" and above, as this scraping and merger was already performed and saved out in df_processed.csv. Steps as follows:

Change patha variable in "Read in processed df" to where you download the input file (df_processed.csv)
Begin running at MD reading "Read in processed df"
Learn about super cool discoveries

The following is a high level overview of how the code runs:

Read in download Kaggle df and instantiate a new file in your directory in the variable my_csvfile
Run scraper for entirety of Kaggle df and save to new file to include amazon's data (note: the team ran in batches and merged all files in the end)
Using processed csv clean df to prepare for analysis
Produce WordCloudfor entire df
Using cleaned data to present a plot of gamma distribution for 'Rating'
To get a sense of data, produce a bargraph for top and bottom 15 publishers
Finding the proportion of books in each category through a pie chart analysis
Topic Modeling for each category (df grouped by rating) {three topics per category} with a WordCloud representation
Create LSI model from entire df
Create LDA model from entire df to compute conditional probabilities for topic word set

Key Takeaways

Books rated between 4 and 5 had the following common words that people should generally use for their book descriptions (especially if they plan to use Amazon as a platform to sell their books) -
- American, People, Family, Year, Reader, Novel
The distribution of book ratings follows a Gamma function.
After reviewing the data for the publishers, we realized that
- Wiley, Simon and Schuster, St. Martin's Press and Harper Collins all have more than 10 titles to their name
- We had quite a few publishers that had an average rating of 5, like 1984 Publishing, Advantage Media Group, Archway, Beltz etc., but this was owing to the rating not being weighted.
As indicated by our Pie Chart, you can notice that the distribution of books over categories is :
- Based on the data we analyzed, 51.8% books belonged to category 1, i.e. Rating >= 4.5
Our tool will export the word cloud per topic per Category as an image which you can use to identify which words can be used to write descriptions that may get you a better rating, in our case we used our categories to plot the distribution of words over 3 topics.
Using Latent Semantic Indexing, our tool is able to suggest a Category of rating for given book descriptions (in our case we used 2 descriptions, both of which were assigned a category of 5 and 1 respectively, i.e. a rating between 2.5 and 3, and a rating between 5 and 4.5.
Finally, we ran the LDA model on the entire dataset(disregarding categories) to identify words distributed over 3 topics,
- Topic 1 : Time, World, Book, Stori, Life, Histori, Year, Power, American, Polit, Photograph
- Topic 2 : Book, Life, Time, World, Stori, Live, Love, Like, Work, Author, Bestsel, Best
- Topic 3 : Book, Time, Stori, World, Nation, Park, York, Life, Love, Year, Power, Famili, American

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
01.png		01.png
02.png		02.png
03.png		03.png
11.png		11.png
12.png		12.png
13.png		13.png
21.png		21.png
22.png		22.png
23.png		23.png
AZscraping.ipynb		AZscraping.ipynb
AmazonScraping.ipynb		AmazonScraping.ipynb
Descriptions_scraping.ipynb		Descriptions_scraping.ipynb
FinalCode.ipynb		FinalCode.ipynb
LDA_and_Visualization.ipynb		LDA_and_Visualization.ipynb
LSI Modeling_Akshay.ipynb		LSI Modeling_Akshay.ipynb
README.md		README.md
Topic Modeling for categories.ipynb		Topic Modeling for categories.ipynb
amazon_com_extras.csv		amazon_com_extras.csv
book.png		book.png
df_processed.csv		df_processed.csv
df_processed2.csv		df_processed2.csv
emilia20.csv		emilia20.csv
file1_Akshay.csv		file1_Akshay.csv
file1_muskan.csv		file1_muskan.csv
file2_lily.csv		file2_lily.csv
file2_lily2.csv		file2_lily2.csv
file2_lily_from3001to20000.csv		file2_lily_from3001to20000.csv
file2_muskan_48000-63000.csv		file2_muskan_48000-63000.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contributors

Description

Branch Setup

Run Instructions

Key Takeaways

About

Releases

Packages

Languages

thatIslily/NLTK_PythonProject_Mahalo

Folders and files

Latest commit

History

Repository files navigation

Contributors

Description

Branch Setup

Run Instructions

Key Takeaways

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages