Skip to content

Mahalo is designed for publishers/writers looking for guidance on how to describe their book on Amazon to garner the most positive reviews. It is meant to be a preliminary tool to facilitate initial brainstorming sessions when constructing a new book’s landing page. The tool provides Word Clouds printed by category and topic for the best selling…

Notifications You must be signed in to change notification settings

thatIslily/NLTK_PythonProject_Mahalo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

alt text

Contributors

Members: Akshay Kumar, Emilia Dwyer, Zhuo (Lily) Wang, Muskan Jain
Section: 002

Description

Mahalo is designed for publishers/writers looking for guidance on how to describe their book on Amazon to garner the most positive reviews. It is meant to be a preliminary tool to facilitate initial brainstorming sessions when constructing a new book’s landing page. The tool provides Word Clouds printed by category and topic for the best selling books. Categories and topics are generated from a static dataframe provided by Kaggle and a corresponding scrape of data by unique key, ASIN.

Branch Setup

  • Master:
    • FinalCode.ipynb: final code to run
    • AZscraping.ipynb: working file before savings out as FinalCode.ipynb
    • ##.png: contains exported WordCloud images
    • batch#.csv: batches of AZ scrape that were merged into final df_processed.csv
    • requirements.txt: package requirements for FinalCode.ipynb
    • amazon_com_extras.csv: kaggle dataframe

Run Instructions

Use 'Final Code.ipynb' file on master branch.

You have a paramount decision to make that will define who you are as a human. You can either hit Ctrl+Enter or Shift+Enter to run code cells. Choose wisely.

You do not need to run the code from "Combine in all AZscrape batches 1-7" and above, as this scraping and merger was already performed and saved out in df_processed.csv. Steps as follows:

  • Change patha variable in "Read in processed df" to where you download the input file (df_processed.csv)
  • Begin running at MD reading "Read in processed df"
  • Learn about super cool discoveries

The following is a high level overview of how the code runs:

  1. Read in download Kaggle df and instantiate a new file in your directory in the variable my_csvfile
  2. Run scraper for entirety of Kaggle df and save to new file to include amazon's data (note: the team ran in batches and merged all files in the end)
  3. Using processed csv clean df to prepare for analysis
  4. Produce WordCloudfor entire df
  5. Using cleaned data to present a plot of gamma distribution for 'Rating'
  6. To get a sense of data, produce a bargraph for top and bottom 15 publishers
  7. Finding the proportion of books in each category through a pie chart analysis
  8. Topic Modeling for each category (df grouped by rating) {three topics per category} with a WordCloud representation
  9. Create LSI model from entire df
  10. Create LDA model from entire df to compute conditional probabilities for topic word set

Key Takeaways

  1. Books rated between 4 and 5 had the following common words that people should generally use for their book descriptions (especially if they plan to use Amazon as a platform to sell their books) -
    • American, People, Family, Year, Reader, Novel
  2. The distribution of book ratings follows a Gamma function.
  3. After reviewing the data for the publishers, we realized that
    • Wiley, Simon and Schuster, St. Martin's Press and Harper Collins all have more than 10 titles to their name
    • We had quite a few publishers that had an average rating of 5, like 1984 Publishing, Advantage Media Group, Archway, Beltz etc., but this was owing to the rating not being weighted.
  4. As indicated by our Pie Chart, you can notice that the distribution of books over categories is :
    • Based on the data we analyzed, 51.8% books belonged to category 1, i.e. Rating >= 4.5
  5. Our tool will export the word cloud per topic per Category as an image which you can use to identify which words can be used to write descriptions that may get you a better rating, in our case we used our categories to plot the distribution of words over 3 topics.
  6. Using Latent Semantic Indexing, our tool is able to suggest a Category of rating for given book descriptions (in our case we used 2 descriptions, both of which were assigned a category of 5 and 1 respectively, i.e. a rating between 2.5 and 3, and a rating between 5 and 4.5.
  7. Finally, we ran the LDA model on the entire dataset(disregarding categories) to identify words distributed over 3 topics,
    • Topic 1 : Time, World, Book, Stori, Life, Histori, Year, Power, American, Polit, Photograph
    • Topic 2 : Book, Life, Time, World, Stori, Live, Love, Like, Work, Author, Bestsel, Best
    • Topic 3 : Book, Time, Stori, World, Nation, Park, York, Life, Love, Year, Power, Famili, American

About

Mahalo is designed for publishers/writers looking for guidance on how to describe their book on Amazon to garner the most positive reviews. It is meant to be a preliminary tool to facilitate initial brainstorming sessions when constructing a new book’s landing page. The tool provides Word Clouds printed by category and topic for the best selling…

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published