diff --git a/bookpages/(Page 3) Prediction model.Rmd b/bookpages/(Page 3) Prediction model.Rmd deleted file mode 100644 index 6d179f1..0000000 --- a/bookpages/(Page 3) Prediction model.Rmd +++ /dev/null @@ -1,447 +0,0 @@ ---- -jupyter: - jupytext: - text_representation: - extension: .Rmd - format_name: rmarkdown - format_version: '1.2' - jupytext_version: 1.15.2 - kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -# Page 3 Data exploration -### Introduction - - -#### Data sourcing -The Dataset for this project contains listing data for Airbnb for London, up to September 2022. The dataset used is a precleaned dataset from Kaggle, that was sourced from http://insideairbnb.com. - -#### Cleaning -The dataset comes in a relatively clean state, as such cleaning is a straight forward task with minor alterations made mostly for clarity and tailoring the dataset to the projects needs. -(For specifics please see go to the function and examine the function "data_setup()" found in "\projtools\data_tools.py") - -Overview --Setting the 'id' column to the pandas index --Renaming columns for clarity --Removing special characters --Dropping unnecessary columns --Converts prices from $ to £ --Filling NaN values --Reorders columns to group related data - - -#### Description -The columns contain data pertaining to the property, host, location and reviews. Please see the table below for a list of the relevant columns and their corresponding explanation. There are 48 Columns and 69351 rows/entries. - - -```{python} -from projtools import data_tools as dt -from projtools import plotting_tools as pt -from projtools import quartile_prices as qp -from projtools import k_best_neighbours_prediction as k -from projtools import ttest_for_errors as ttest -data = dt.data_setup() -data.head() -``` - -```{python} -var1 = 'bedrooms' -var2 = 'maximum_nights_avg_bookings' -var3 = 'host_listings_count' -varZ = 'price_per_night_£' -location = 'Kensington and Chelsea' -ID_Z = 15400 -``` - -```{python} -k.prediction_func_prediction(var1, var2, var3, varZ, ID_Z) -``` - -```{python} -k.prediction_func_location(var1, var2, var3, varZ, ID_Z, location) -``` - -```{python} -this is error -``` - -```{python} - -``` - -```{python} - -``` - -```{python} - -``` - -```{python} - -``` - -```{python} - -``` - -```{python} - -``` - -```{python} -#Run these cells to get started -from projtools import data_tools as dt -from projtools import plotting_tools as pt -from projtools import quartile_prices as qp -``` - -```{python} -#Full pandas dataframe -data = dt.data_setup() -data.head() -``` - -### Descriptive Stats - -```{python} -data.describe() -``` - -### Standout Stats - -1. **Hosting Listings**: - - There is a significant difference between the maximum and the upper quartile, indicating a concentration of listings among a few hosts. - - {cite}`temperton_2020`, reported that data from London City Hall in May 2019 showed 1% of Airbnb hosts were responsible for 15% of active listings on Airbnb. Since this dataset is from September 2022, we can compare this figure to assess changes. - -2. **Calculated Host Listings Count**: - - A slight difference exists between the calculated host listings count and calculated host listings count for entire homes, suggesting that most properties are entire homes rather than individual rooms. - - According to {cite}`temperton_2020`, there were "80,770 properties in London listed on Airbnb." However, this dataset contains 69,351 entries, suggesting a potential decline in listings. Airbnb claim in the article that each listing may represent multiple rooms within a single property, impacting the total property count thus the reported count is unreliable. - -3. **Maximum Nights & Maximum Nights Average Bookings**: - - Despite London's 90-day stay limit, the data indicates that the second quartile for maximum nights corresponds to exactly one year. - - The first quartile appears to be around 90 days, but the second and third quartiles extend to just over three years. - - Data compiled by Camden Council and cited by {cite}`temperton_2020` suggests that 48% of properties in camden exceed the 90-day legal limit. The dataset implies a potentially higher percentage across London. - -4. **Price per Night £**: - - The mean price per night is £165.66, with the 1st and 2nd quartiles at £51.41 and £93.48, respectively, suggesting the availability of more affordable properties. - - According to City Hall, properties in London listed on the short-term rental market make an average of £109 per night. In contrast, if they were rented out to long-term tenants, they could yield an average of £58 per night {cite}`temperton_2020`. - - -### Further things to explore from "Airbnb has devoured London – and here’s the data that proves it" Temperton 2020 - -5. **Location**: Westminster is the most affected area in london by short-term rentals, in January 2020 there were 8,836 short-term rental listings in the borough. - -```{python} -data['neighbourhood_location'].value_counts().head(5) - -``` - -This statement is corroborated by the data, while there are less rentals in Westminster than in January 2020 Westminster is the most popular borough for listings. - - -## Exploration - - - - - -### Host listing concentration -**Statement**: In 2019 1% of hosts accounted for 15% of listings. -**Question**: What percentage of listings do the top 1% of hosts account for in 2022? - -```{python} -#Method - -#Finding the how many listings each host has in London -host_listing_counts = data.groupby('host_id')['calculated_host_listings_count'].first() - -# Setting total listings to the length of the dataset -total_listings = len(data) - -# Identify top 1% of hosts by listing -top_1_percent_threshold = host_listing_counts.quantile(0.99) -top_1_percent_hosts_listings = host_listing_counts[host_listing_counts >= top_1_percent_threshold] - -# Calculate total listings accounted for by the top 1% and calculate this as a percentage -total_listings_top_1_percent_hosts = top_1_percent_hosts_listings.sum() -percent_listings_top_1 = (total_listings_top_1_percent_hosts / total_listings) * 100 - -# Count the number of hosts with 1 listing and percentage -hosts_with_1_listing = (host_listing_counts == 1).sum() -percentage_hosts_with_1_listing = (hosts_with_1_listing / total_listings) * 100 -``` - -```{python} -#Graph -pt.Distribution_of_Listings_Among_Hosts(host_listing_counts) -#Results -print(f"{percentage_hosts_with_1_listing:.2f}% of hosts have 1 listing") -print(f"The Top 1% of Hosts have over {top_1_percent_threshold} listings") -print(f"The max number of listings for a single host is : {top_1_percent_hosts_listings.max()}") - -#Graph -pt.Market_Share_by_Host_Groups(host_listing_counts, top_1_percent_threshold, total_listings) -#Results -print(f"The Top 1% of hosts control {percent_listings_top_1:.2f}% of the market") -``` - -**Statement :** In 2019 1% of hosts accounted for 15% of listings. -**Question :** What percentage of listings do the top 1% of hosts account for in 2022? - -In 2022 1% of hosts accounted for 17% of the London market on Airbnb this is a 2% increase from the reports from city hall in 2019. Additionally we can see 54% of hosts have only 1 listing. - -To be in the Top 1% you need to have over 9 listings, however there is a large spread within the Top 1%. The minimum number of listings within the top 1% is 9 and the maximum number is 285. The scale of these operations is not comparable, thus it is worth grouping the hosts by the scale of operation. - -```{python} -#Grouping hosts by scale of operation -host_listings_group = dt.Categorize_Host_Listings(host_listing_counts) - -# Counting the number of hosts in each group and percentage -host_listings_group_counts = host_listings_group.value_counts() -host_listings_group_percentages = (host_listings_group_counts / len(host_listing_counts)) * 100 - -#Adding as a column to the main df -data = dt.Add_host_group(data, host_listings_group) - -#Results -for category in host_listings_group_counts.index: - count = host_listings_group_counts[category] - percentage = host_listings_group_percentages[category] - print(f"{category} {count} : {percentage:.2f}%") - -``` - -### Number of properties in London listed on Airbnb - -**Statement :** The number of listings does not represent how much housing stock is listed on Airbnb -**Question :** How many unique locations are listed on Airbnb? - -```{python} -#Method -#Create a column combining latitude and longitude -data['coords'] = list(zip(data['latitude'], data['longitude'])) - -#Count the number of unique combinations of coordinates -unique_properties = data['coords'].nunique() - -#Results -print(f"Estimated number of properties from longitude and latitude: {unique_properties}") -print(f"Number of listings in London on Airbnb: {len(data)}") - -``` - -This method works out how many unique combinations there are of latitude and longitude within the Dataset. The result suggests that there are marginally lower unique properties that total listings on the platform, therefore the claim by Airbnb that the listings are inflated appears to be false. - - -### Number of properties being used for short term rental and long term rental -**Statement :** The maximum number of nights legally allowed for short term rental in london is 90 days. {cite}`temperton_2020` More than 10,000 Airbnb listings in London are in breach of the city’s 90-day limit on short-term rentals. -**Question** How many listings are in breach of this limit? - -```{python} -#Method -# Count listings with 'maximum_nights' over 90 days -listings_over_90_days_max_nights = (data['maximum_nights'] > 90).sum() - -# Count listings with 'maximum_nights_avg_bookings' over 90 days -listings_over_90_days_avg_bookings = (data['maximum_nights_avg_bookings'] > 90).sum() - -# Calculate percentages -percent_over_90_days_max_nights = (listings_over_90_days_max_nights / total_listings) * 100 -percent_over_90_days_avg_bookings = (listings_over_90_days_avg_bookings / total_listings) * 100 - -# Ignore entries in 'maximum_nights_avg_bookings' over 1500 to remove anomalous results for legibility -max_night_avg_bookings = data[data['maximum_nights_avg_bookings'] <= 1500] - -# Count listings with 'maximum_nights_avg_bookings' under 90 days between 90 and 365 and over 365 -under_90_days = (data['maximum_nights_avg_bookings'] <= 90).sum() -between_91_and_365_days = ((data['maximum_nights_avg_bookings'] > 90) & (data['maximum_nights_avg_bookings'] <= 365)).sum() -above_365_days = (data['maximum_nights_avg_bookings'] > 365).sum() - -``` - -```{python} -#Graph -pt.Plot_Average_Max_Nights_Booked_Distribution(max_night_avg_bookings) - -#Results -print(f"Listings with maximum nights over 90 days: {listings_over_90_days_max_nights} {percent_over_90_days_max_nights:.2f}%") -print() -print("GRAPH DATA") -print(f"Listings with average maximum nights booked over 90 days: {listings_over_90_days_avg_bookings} {percent_over_90_days_avg_bookings:.2f}%") -print() -print(f"Listings with maximum nights less than 90 days: {under_90_days}") -print(f"Listings with maximum nights between 90 and 365 days: {between_91_and_365_days}") -print(f"Listings with maximum nights greater than 365 days: {above_365_days}") -``` - -The data here suggests that approximately 2/3rds of listings are in breach of the 90 day limit, and that 3/4s of the listings on the platform are on average booked for over 90 days. This seems to suggest that in 2022 Airbnb is predominantly being used for long term rental rather than short term rental as the platform is intended. - -The graphs first peak is at 10 days at about 10000. Roughly 17500 properties out of the total fall below the 90 day mark, which is the cut off point for short term rental. The next peak is around 1 year with approximately 9500 and the final and largest peak is aroud 3 years roughly being 38,000. This corroborates the claims that Airbnb and the short term rental market is being used as an alternative to the housing rental market. - -We can now add whether a property is being used as a short term or long term to the data. - -```{python} -#Adding Long term rental column -data = dt.Add_long_term_rental(data) -``` - -### Price per night, how much are people being charged on the platform? -**Statement :** It is more lucrative to rent out property on Airbnb than it is on the housing market. -**Question :** How much does it cost to rent a property long term on Airbnb compared to the housing market? - -```{python} -#Method -# Filter the data for listings with maximum nights over 90 days but less than 1500 -long_term_listings = data[(data['maximum_nights_avg_bookings'] > 90) & (data['maximum_nights_avg_bookings'] < 1500)] - -# Calculate the mean price for long_term_listings -mean_price = long_term_listings['price_per_night_£'].mean() - -``` - -```{python} -#Graph -pt.Price_Per_Night_Distribution(long_term_listings) -#Results -print(f'Mean price per night: £{mean_price:.2f}') -``` - -By mean average, it does seem to be more expensive to rent on Airbnb than it is to rent on the housing market. The average price per night for the rental housing market was £85.57 per night {cite}statista-london-rental, whereas the average price per night on Airbnb for the same time period was double that at £179.29 per night. - -Interestingly it also appears that the largest spread of prices is at the 1 year and 3 year mark. - - -### Invetigation of the 1 year (365days) and 3 year 1 month (1125 day) let duration -**Statement** The graph shows higher prices are charged at 365 and 1125 days - -**Question** Is this actually the case or is this misrepresentation and why might prices be higher if they are? - - -Not only does the graph above clearly highlight the improper use of the platform to offer long term rentals rather than short term holiday lets but it also suggests that hosts listing for durations of 1 year and 3 years and 1 month, were charging significantly more than those at any other duration. - -Hovewever the graph deosnt appear to suggest that all the listings that have longer durations charge more. It only appears that it is exactly 365 and 1125 were this phenomena occurs. - -Thus, we did some prelimenary analysis and ivestigation to see if this really is the case. - -```{python} -# to begin we filtered the data to get the price data for 1 year and 3 years 1 month - -long_term_listings = data[(data['maximum_nights_avg_bookings'] > 90) & (data['maximum_nights_avg_bookings'] < 1500)] - -one_yr = long_term_listings[long_term_listings['maximum_nights_avg_bookings'] == 365] -one_yr_price_nights = one_yr[['price_per_night_£']] - -three_yr = long_term_listings[long_term_listings['maximum_nights_avg_bookings'] == 1125] -three_yr_price_nights = three_yr[['price_per_night_£']] - -# we needed to also do this for the data around the year and three year point to haves some comparise data - -listings_around_one_year = long_term_listings[(long_term_listings['maximum_nights_avg_bookings'] > 330) & (long_term_listings['maximum_nights_avg_bookings'] < 400) & (long_term_listings['maximum_nights_avg_bookings'] != 365)] -listings_around_three_year = long_term_listings[(long_term_listings['maximum_nights_avg_bookings'] > 1090) & (long_term_listings['maximum_nights_avg_bookings'] < 1160) & (long_term_listings['maximum_nights_avg_bookings'] != 1125)] - -# to see if there was a difference or in the data on average we found the means of 1, 3 and around each of these years - -one_year_mean = one_yr_price_nights['price_per_night_£'].mean() -three_yr_one_month_mean = three_yr_price_nights['price_per_night_£'].mean() -around_one_mean = listings_around_one_year['price_per_night_£'].mean() -around_three_year_one_month_mean = listings_around_three_year['price_per_night_£'].mean() -print("One year mean", one_year_mean) -print("Three year one month mean", three_yr_one_month_mean) -print("Around one year mean", around_one_mean) -print("Around three year one month mean", around_three_year_one_month_mean) -print("One year diff.", one_year_mean - around_one_mean) -print("Three year diff.", three_yr_one_month_mean - around_three_year_one_month_mean) -``` - -This brief analysis brigns an interesting condradiction and also suprising result - -On the one hand it suggests that the listigns listings for a year are more expensive than those around it agreeign with our intitial obseravation. On the other hand the calculations also reveal that the propeties listed for three years were listed at substantially lower prices on avergae than those around the 3 year mark. - -To see what was really going on other statistical methods needed to be employed as the high outliers observed on the graph above could be significantly influencing the mean results. - -To do this we employed the use of quatiles to represent how the data is spread within each group - -```{python} -# for clarification of the working within this funciton that provide the graph below look at the quartile_prices.py file in projtools - -qp.plot_scatter_quartiles_multiple_dfs([one_yr, three_yr, listings_around_one_year, listings_around_three_year], 'price_per_night_£', ['365 Days', 'Around 365', '1125 Days', 'Around 1125']) -qp.plot_scatter_quartiles_multiple_dfs_whole([one_yr, three_yr, listings_around_one_year, listings_around_three_year], 'price_per_night_£', ['365 Days', 'Around 365', '1125 Days', 'Around 1125']) -qp.plot_scatter_quartiles_multiple_dfs_zoomed([one_yr, three_yr, listings_around_one_year, listings_around_three_year], 'price_per_night_£', ['365 Days', 'Around 365', '1125 Days', 'Around 1125']) -``` - -## General observations: - -### Data Spread -- There is an increasing spread of data as you move from the first quartile (Q1) to the fourth quartile (Q4). This suggests that the variability in price per night is greater at higher price levels. - -### Price Range -- In Q1, the prices are tightly grouped and lower, indicating that a significant number of listings are available at lower prices. -- In Q2 and Q3, the prices are more spread out, showing a greater variation in pricing. -- In Q4, the prices are very spread out, with some listings being priced significantly higher than others. This is where the luxury or high-end listings are likely to be. - -## Comparison between the data groups: - -### One Year Listings -- This group seems to have a higher concentration of points across all quartiles, suggesting a wide range of prices for one-year listings. -- The mean price being the highest in this group indicates that, on average, listings for exactly one year are more expensive than listings for durations that are around one year or three years one month. - -### Around One Year Listings -- The distribution of prices for listings around one year appears slightly lower than exactly one year, aligning with your analysis that the mean price for this group is lower than the one-year listings. -- This group also has a wide spread in Q4, but not as many high-priced outliers as the one-year listings, suggesting that while there are premium options, they are not as extreme in price. - -### Three Years One Month Listings -- The three year one month listings appear to have a more concentrated distribution in the lower quartiles (Q1 and Q2) and fewer points in the higher quartiles (Q3 and Q4). This indicates that the 1125 day listings are generally cheaper than one-year listings. -- The lower mean price suggests that longer-term listings are priced more competitively, potentially to attract tenants willing to commit to a longer stay. - -### Around Three Year One Month Listings -- The distribution of prices for listings around 1125 seems to be slightly higher than exactly three years, which may indicate a little more variability in the prices for listings that don't commit to an exact 1125 day let. -- There's also a presence in the higher quartiles, but not as pronounced as the one-year listings. These lisitngs woudl also include the 3 year term which could attract higher prices due to its more circular nature. - -## What does this all suggest - -From these observations, it seems there is a trend where the exact one-year term listings are priced higher on average, which could be due to a variety of factors such as higher demand for one-year leases or perceived convenience. In contrast, the longer-term 1125 day listings are generally less expensive, which could be an incentive to attract longer-term tenants, offering a discount for the commitment to a longer stay. - -The data suggests that there is a premium associated with the percieved wholeness and convenience of exact one year terms, while longer or less exact terms seem to offer more competitive pricing. This could reflect a range of different strategies by landlords or property managers letting for different durations of time. - -## Refletction on intial hypotheses - -Our intial refelection on the data provided on the graph was that the prices of the listings for 365 day and 1125 day lets where significantly higher than all of the other listings. However, our analysis proved this to be wrong and it was in fact the case that although on the 1-year litings where slighlty higher than around 1-year listings this was not the case for the 1125 day let. The 1125 day let seemed to have more variation in the values which is too be expected as it has the larger number of listings in this group, however they were most concentrated to the lower end and therfore were on averge more competitively priced. We also came to the conclusion that landlords were able to charge higher higher prices for the 365 day term because of the percieved wholeness of the value and convenience of the exact one year term. - - - - -# Summary - -```{python} - -``` - -Airbnb is being used in london predominantly for long term rentals and it is using up a significant portion of the housing stock to do this. There are a variety of different actors on the platform with the majority being individuals who only have 1 property, however there are individuals/groups who use the platform as a large scale property businesses. - -Further avenues to explore will involve --Predicting which properties are likely to be used for long term and short term rentals --Predicting the price of a Listing - -```{python} - -``` - -```{python} - -``` - -```{python} - -``` - -```{python} - -``` - -```{python} - -```