From 3594659ca55dfbf326e73f33fa3e4db032475732 Mon Sep 17 00:00:00 2001 From: "Stephen Howell (MSFT)" <38020233+stephen-howell@users.noreply.github.com> Date: Fri, 25 Jun 2021 00:51:03 +0100 Subject: [PATCH] Moving last section to start of next lesson for flow --- 6-NLP/4-Hotel-Reviews-1/README.md | 103 +----------------------------- 1 file changed, 1 insertion(+), 102 deletions(-) diff --git a/6-NLP/4-Hotel-Reviews-1/README.md b/6-NLP/4-Hotel-Reviews-1/README.md index 052b6b5db3..7876ce93d5 100644 --- a/6-NLP/4-Hotel-Reviews-1/README.md +++ b/6-NLP/4-Hotel-Reviews-1/README.md @@ -380,108 +380,7 @@ Treat the following questions as coding tasks and attempt to answer them without You may have noticed that there are 127 rows that have both "No Negative" and "No Positive" values for the columns `Negative_Review` and `Positive_Review` respectively. That means that the reviewer gave the hotel a numerical score, but declined to write either a positive or negative review. Luckily this is a small amount of rows (127 out of 515738, or 0.02%), so it probably won't skew our model or results in any particular direction, but you might not have expected a data set of reviews to have rows with no reviews, so it's worth exploring the data to discover rows like this. -### Modifying the dataframe - -Now that you've explored the dataset, you can see some issues with it. Some columns are are filled with useless information, others are just incorrect. If they are correct, it's unclear how they were calculated, and answers cannot be independently verified by your own calculations. - -Next, you will add columns that will be useful later, change the values in other columns, and drop certain columns completely. - -Follow these steps in order: - -1. `Hotel_Name`, `Hotel_Address`, `lat` (latitude), `lng` (longitude) - - 1. Drop lat and lng - - 2. Replace Hotel_Address values with the following values (if the address contains the same of the city and the country, change it to just the city and the country). - - These are the only cities and countries in the dataset: - - Amsterdam, Netherlands - - Barcelona, Spain - - London, United Kingdom - - Milan, Italy - - Paris, France - - Vienna, Austria - - ```python - def replace_address(row): - if "Netherlands" in row["Hotel_Address"]: - return "Amsterdam, Netherlands" - elif "Barcelona" in row["Hotel_Address"]: - return "Barcelona, Spain" - elif "United Kingdom" in row["Hotel_Address"]: - return "London, United Kingdom" - elif "Milan" in row["Hotel_Address"]: - return "Milan, Italy" - elif "France" in row["Hotel_Address"]: - return "Paris, France" - elif "Vienna" in row["Hotel_Address"]: - return "Vienna, Austria" - - # Replace all the addresses with a shortened, more useful form - df["Hotel_Address"] = df.apply(replace_address, axis = 1) - # The sum of the value_counts() should add up to the total number of reviews - print(df["Hotel_Address"].value_counts()) - ``` - - Now you can query country level data: - - ```python - display(df.groupby("Hotel_Address").agg({"Hotel_Name": "nunique"})) - ``` - - | Hotel_Address | Hotel_Name | - | ---------------------: | ---------: | - | Amsterdam, Netherlands | 105 | - | Barcelona, Spain | 211 | - | London, United Kingdom | 400 | - | Milan, Italy | 162 | - | Paris, France | 458 | - | Vienna, Austria | 158 | - - - -2. Hotel Meta-review columns: `Average_Score`, `Total_Number_of_Reviews`, `Additional_Number_of_Scoring` - -* Drop `Additional_Number_of_Scoring` -* Replace `Total_Number_of_Reviews` with the total number of reviews for that hotel that are actually in the dataset - -* Replace `Average_Score` with our own calculated score - - ```python - # Drop `Additional_Number_of_Scoring` - df.drop(["Additional_Number_of_Scoring"], axis = 1, inplace=True) - # Replace `Total_Number_of_Reviews` and `Average_Score` with our own calculated values - df.Total_Number_of_Reviews = df.groupby('Hotel_Name').transform('count') - df.Average_Score = round(df.groupby('Hotel_Name').Reviewer_Score.transform('mean'), 1) - ``` - -**Review columns** - -- Drop `Review_Total_Negative_Word_Counts`, `Review_Total_Positive_Word_Counts`, `Review_Date` and `days_since_review` -- Keep `Reviewer_Score`, `Negative_Review`, and `Positive_Review` as they are, -- Keep `Tags` - - We'll be doing some additional filtering operations on the tags in the next lesson. - -**Reviewer columns** - -- Drop `Total_Number_of_Reviews_Reviewer_Has_Given` -- Keep `Reviewer_Nationality` - -Finally, save the dataset as it is now with a new name. - -```python -df.drop(["Review_Total_Negative_Word_Counts", "Review_Total_Positive_Word_Counts", "days_since_review", "Total_Number_of_Reviews_Reviewer_Has_Given"], axis = 1, inplace=True) - -# Saving new data file with calculated columns -print("Saving results to Hotel_Reviews_Filtered.csv") -df.to_csv(r'Hotel_Reviews_Filtered.csv', index = False) -``` +Now that you have explored the dataset, in the next lesson you will filter the data and add some sentiment analysis. --- ## 🚀Challenge