Skip to content

Latest commit

 

History

History
1220 lines (874 loc) · 32 KB

README.md

File metadata and controls

1220 lines (874 loc) · 32 KB

Portugal Hotels Booking Demand

Summary

This jupyter notebook has one dataset that has two different hotels from Portugal. This data will be analyzed to find any trends or patterns with guests booking into either hotel to try and find a way to minimize the amount of canceled bookings. A machine learning model will also be developed to attempt at predicting if a guest will cancel there booking before checking in.

There is a resort hotel in this dataset, found in the Algarve region of Portugal (southern Portugal), and a city hotel found in the captial Lisbon. Data was acquired directly from hotel's Property Managment System (PMS) SQL according to the paper which the data is originally from. The article is called, "Hotel Booking Demand Datasets", written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019. Found at https://www.sciencedirect.com/science/article/pii/S2352340918315191#bib5.

Dataset can also be found on Kaggle at https://www.kaggle.com/jessemostipak/hotel-booking-demand

Import Python Librarys and Modules

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

from matplotlib import rcParams
rcParams['figure.figsize'] = 10,8
sns.set_theme()

Reading and cleaning data

df = pd.read_csv('hotel_bookings.csv')
df.head()
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... No Deposit NaN NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... No Deposit 304.0 NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... No Deposit 240.0 NaN 0 Transient 98.0 0 1 Check-Out 2015-07-03

5 rows × 32 columns

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal                            119390 non-null  object 
 13  country                         118902 non-null  object 
 14  market_segment                  119390 non-null  object 
 15  distribution_channel            119390 non-null  object 
 16  is_repeated_guest               119390 non-null  int64  
 17  previous_cancellations          119390 non-null  int64  
 18  previous_bookings_not_canceled  119390 non-null  int64  
 19  reserved_room_type              119390 non-null  object 
 20  assigned_room_type              119390 non-null  object 
 21  booking_changes                 119390 non-null  int64  
 22  deposit_type                    119390 non-null  object 
 23  agent                           103050 non-null  float64
 24  company                         6797 non-null    float64
 25  days_in_waiting_list            119390 non-null  int64  
 26  customer_type                   119390 non-null  object 
 27  adr                             119390 non-null  float64
 28  required_car_parking_spaces     119390 non-null  int64  
 29  total_of_special_requests       119390 non-null  int64  
 30  reservation_status              119390 non-null  object 
 31  reservation_status_date         119390 non-null  object 
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB

We can see above there are many columns with missing values, and this will be addressed below


df['country'].isna().value_counts()
False    118902
True        488
Name: country, dtype: int64

Dropping the agent column entirely since it only has a integer value for the listing agent and no information about the company the agent works for or the country of origin for the agency.

There were 4 four rows with NaN value in the column 'children' and it has been assumed that these rooms did not have any children and a value of zero has been put there it its place. Its dtype was then converted to 'int64'.

Any rows that did not have a country of origin has been dropped as this seems to be questionable data (though these guests could likely be from Portugal and simpy did not enter their country of origin). There were only 488 rows dropped.

One row has a value for 'adr' greater than 4000. This means the average daily rate as defined by dividing the sum of all lodging transactions by the total number of staying nights was greater than €4,000. This is only on one row and has been dropped since it is an extreme outlier.

Some rooms did not have any adults or children registered for that booking, and is likely some data had been incomplete when being filled in, same with rows that did not have a country of origin.

Lastly, all null values for 'company' column has been filled in with integer value zero.

df.drop('agent', axis=1, inplace=True)
df.loc[(df[df['children'].isna()].index.values),'children'] = 0
df.children = df.children.astype('int64')
df.drop(df[df['country'].isna()].index.values,axis=0, inplace=True)

df.drop(df[df['adr'] > 4000].index.values,axis=0, inplace=True) # add to note above
df.drop(df[df['adults']==0].index.values,axis=0, inplace=True)
df['company'].fillna(value=0,inplace=True)

EDA & Visualizing hotel data

We will begin by talking about each column in more depth:

  • hotel- Either resort hotel, Algarve, or City hotel, Lisbon.
  • is_canceled- If a guest has cancelled a booking or not before checking into a hotel, value of 1 or 0 respectively.
  • lead_time- The day a guest made their booking, ie number of days before guest is expected to arrive.
  • arrive_date_year- The year the guest is expected to arrive at a hotel, from 2016-2017.
  • arrival_date_month- The month of the year the guest is expected to arrive at a hotel.
  • arrival_date_week_number- The week number (52 weeks in a year) the guest is expected to arrive at a hotel.
  • arrival_date_day_of_month- The day of the month the guest is expected to arrive at a hotel.
  • stays_in_weekend_nights- The number of nights the guest is going to stay during the weekend.
  • stays_in_week_nights- The number of nigths the guest is going to stay during the week.
  • adults- Number of adults booked to stay in the room for the duration of their time in the hotel.
  • children- Number of children booked to stay in the room for the duration of their time in the hotel.
  • babies- Number of babies booked to stay in a room for the duration of their time in the hotel.
  • meal- Type of meal booked. Categories are presented in standard hospitality meal packages:
    • Undefined/SC – no meal package
    • BB – Bed & Breakfast
    • HB – Half board (breakfast and one other meal – usually dinner)
    • FB – Full board (breakfast, lunch and dinner)
  • country- Country of origin. Categories are represented in the ISO 3155–3:2013 format.
  • market_segment- Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”.
  • distribution_channel-Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”.
  • is_repeated_guest- Value indicating if the booking name was from a repeated guest (1) or not (0).
  • previous_cancellations- Number of previous bookings that were cancelled by the customer prior to the current booking.
  • previous_bookings_not_canceled- Number of previous bookings not cancelled by the customer prior to the current booking.
  • reserved_room_type- Code of room type reserved. Code is presented instead of designation for anonymity reasons.
  • assigned_room_type- Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons.
  • booking_changes- Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation.
  • deposit_type- Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories:
    • No Deposit – no deposit was made
    • Non Refund – a deposit was made in the value of the total stay cost
    • Refundable – a deposit was made with a value under the total cost of stay.
  • agent- ID of the travel agency that made the booking.
  • company- ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons.
  • days_in_waiting_list- Number of days the booking was in the waiting list before it was confirmed to the customer.
  • customer_type- Type of booking, assuming one of four categories:
    • Contract - when the booking has an allotment or other type of contract associated to it
    • Group – when the booking is associated to a group
    • Transient – when the booking is not part of a group or contract, and is not associated to other transient booking
    • Transient-party – when the booking is transient, but is associated to at least other transient booking
  • adr- Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights.
  • required_car_parking_spaces- Number of car parking spaces required by the customer.
  • total_of_special_requests- Number of special requests made by the customer (e.g. twin bed or high floor).
  • reservation_status- Reservation last status, assuming one of three categories:
    • Canceled – booking was canceled by the customer
    • Check-Out – customer has checked in but already departed
    • No-Show – customer did not check-in and did inform the hotel of the reason why
  • reservation_status_date- Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

Resort & City Hotel

Below we see in figure 1 that there are many more city hotel bookings then there are resort hotel bookings in this dataset.

sns.countplot(x='hotel', data=df, palette='Set2');
plt.title('Number of bookings for Resort and City hotel')
txt_1='Fig.1 - Resort hotel is in Algarve region of Portugal and the city hotel is in Lisbon, the capital of Portugal'
plt.figtext(0.5, -0.1, txt_1, wrap=True, horizontalalignment='center', fontsize=12);

png

round(df['is_canceled'].value_counts()[1]/df['is_canceled'].value_counts()[0],4)*100
58.84
sns.countplot(x='is_canceled', data=df);
plt.title('Number of bookings canceled for both hotels')
txt_2='Fig.2 - About 58.85% of bookings were canceled, from both hotels, before the guests checked in.'
plt.figtext(0.5, -0.1, txt_2, wrap=True, horizontalalignment='center', fontsize=12);

png

round(df[df['hotel'] == 'City Hotel']['is_canceled'].value_counts()[1]/df[df['hotel'] == 'City Hotel']['is_canceled'].value_counts()[0],4)*100
71.61
sns.countplot(x='is_canceled', data=df[df['hotel'] == 'City Hotel'], palette='rocket');
plt.title('Number of bookings canceled for Lisbon Hotel')
txt_3='Fig.3 - About 71.61% of bookings were canceled for Lisbon hotel before the guests checked in.'
plt.figtext(0.5, -0.1, txt_3, wrap=True, horizontalalignment='center', fontsize=12);

png

round(df[df['hotel'] == 'Resort Hotel']['is_canceled'].value_counts()[1]/df[df['hotel'] == 'Resort Hotel']['is_canceled'].value_counts()[0],4)*100
38.43
sns.countplot(x='is_canceled', data=df[df['hotel'] == 'Resort Hotel'], palette='mako');
plt.title('Number of bookings canceled for Algarve Hotel')
txt_4='Fig.4 - About 38.43% of bookings were canceled for Algarve hotel before the guests checked in.'
plt.figtext(0.5, -0.1, txt_4, wrap=True, horizontalalignment='center', fontsize=12);

png

Most of the bookings for this dataset are from the city hotel (Lisbon) which also has a higher chance of a guest canceling before they check in at 71.61%, where as the resort hotel (Algarve) only has around 38.43% chance of a guest canceling there booking. This suggests most people are looking around at multiple hotels to stay at in a city, but slightly more committed to pulling trigger for a resort hotel. That being said we would need more hotels from cities and then more resort hotels to confirm this theory.

Countries Analysis

Now we will investigate what country has the most guests booking rooms, and if some countries guests are more likely to cancel.

df['country'].value_counts().head(10).plot.bar();
plt.title('Number of Bookings from Top 10 Countries for Both Hotels')
txt_5='Fig.5 - Most of the bookings are clearly from Portugal'
plt.figtext(0.5, -0.01, txt_5, wrap=True, horizontalalignment='center', fontsize=12);

png

We can see in Figure 5 that most guests are coming from this host country, Portugal, at 56.76%. Only one country is on the list that is not in Europe and that is Brazil.

tmp = df.groupby('country')['is_canceled'].sum()/df.groupby('country')['is_canceled'].count()
tmp.sort_values(ascending=False).loc[df['country'].value_counts().head(10).index.values]
country
PRT    0.567580
GBR    0.202313
FRA    0.185813
ESP    0.254271
DEU    0.167102
ITA    0.353945
IRL    0.246291
BEL    0.202494
BRA    0.372514
NLD    0.182426
Name: is_canceled, dtype: float64
df[df['hotel'] == 'City Hotel']['country'].value_counts().head(10).plot.bar();
plt.title('Number of Bookings from Top 10 Countries - Lisbon')
txt_6='Fig.6 - City hotel bookings from top 10 countries.'
plt.figtext(0.5, -0.01, txt_6, wrap=True, horizontalalignment='center', fontsize=12);

png

We can see in Figure 6 that most guests are also from Portugal when staying in Lisbon.

tmp = df[df['hotel'] == 'City Hotel'].copy()
tmp_ = tmp.groupby('country')['is_canceled'].sum()/tmp.groupby('country')['is_canceled'].count()
tmp_.sort_values(ascending=False).loc[tmp['country'].value_counts().head(10).index.values]
country
PRT    0.650777
FRA    0.195870
DEU    0.176170
GBR    0.294407
ESP    0.288017
ITA    0.378986
BEL    0.219382
BRA    0.405724
USA    0.264633
NLD    0.206329
Name: is_canceled, dtype: float64
df[df['hotel'] == 'Resort Hotel']['country'].value_counts().head(10).plot.bar();
plt.title('Number of Bookings from Top 10 Countries - Algarve')
txt_7='Fig.7 - Resort hotel bookings from top 10 countries staying.'
plt.figtext(0.5, -0.01, txt_7, wrap=True, horizontalalignment='center', fontsize=12);

png

We can see in Figure 6 that most guests are also from Portugal when staying in Lisbon.

tmp = df[df['hotel'] == 'Resort Hotel'].copy()
tmp_ = tmp.groupby('country')['is_canceled'].sum()/tmp.groupby('country')['is_canceled'].count()
tmp_.sort_values(ascending=False).loc[tmp['country'].value_counts().head(10).index.values]
country
PRT    0.422086
GBR    0.130779
ESP    0.215116
IRL    0.199446
FRA    0.131056
DEU    0.121363
CN     0.135211
NLD    0.108949
USA    0.150313
ITA    0.174292
Name: is_canceled, dtype: float64
sns.set(rc={'figure.figsize':(20,16)})
df_tmp = df[df['arrival_date_year'] == 2015]
sns.barplot(x='arrival_date_month',y='lead_time',hue='is_canceled',data=df_tmp);
plt.title('Lead time for All Bookings for Each Month for the Year 2015')
txt_8='Fig.8 - Number of days guests booked in advance for each month in the year 2015. Each month has data for number of guests who cancelled and how many checked in.'
plt.figtext(0.5, 0.05, txt_8, wrap=True, horizontalalignment='center', fontsize=12);

png

df_tmp.groupby('arrival_date_month')[['is_canceled','lead_time']].mean().sort_values(by='is_canceled', ascending=False)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
is_canceled lead_time
arrival_date_month
July 0.455664 126.365545
August 0.412174 99.364457
September 0.408733 123.068253
October 0.348851 102.595650
December 0.335517 52.683793
November 0.209483 48.476724
df_tmp = df[df['arrival_date_year'] == 2016]
sns.barplot(x='arrival_date_month',y='lead_time',hue='is_canceled',data=df_tmp);
plt.title('Lead time for All Bookings for Each Month for the Year 2016')
txt_9='Fig.9 - Number of days guests booked in advance for each month in the year 2016. Each month has data for number of guests who cancelled and how many checked in.'
plt.figtext(0.5, 0.05, txt_9, wrap=True, horizontalalignment='center', fontsize=12);

png

df_tmp.groupby('arrival_date_month')[['is_canceled','lead_time']].mean().sort_values(by='is_canceled', ascending=False)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
is_canceled lead_time
arrival_date_month
October 0.406736 140.018620
June 0.396780 120.053409
April 0.381014 86.188379
September 0.375627 149.689578
November 0.368682 91.964350
December 0.363114 90.105016
August 0.360902 121.638306
May 0.350348 114.914197
February 0.346383 39.144672
July 0.327988 123.523506
March 0.308271 57.713659
January 0.251467 32.959819
df_tmp = df[df['arrival_date_year'] == 2017]
sns.barplot(x='arrival_date_month',y='lead_time',hue='is_canceled',data=df_tmp);
plt.title('Lead time for All Bookings for Each Month for the Year 2017')
txt_10='Fig.10 - Number of days guests booked in advance for each month in the year 2017. Each month has data for number of guests who cancelled and how many checked in.'
plt.figtext(0.5, 0.05, txt_10, wrap=True, horizontalalignment='center', fontsize=12);

png

df_tmp.groupby('arrival_date_month')[['is_canceled','lead_time']].mean().sort_values(by='is_canceled', ascending=False)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
is_canceled lead_time
arrival_date_month
May 0.437510 120.224933
April 0.434852 103.667789
June 0.431911 136.141491
July 0.373424 152.974026
August 0.368731 137.798579
January 0.341350 53.410768
March 0.337710 82.825288
February 0.327076 56.514801
first_year = len(df[df['arrival_date_year'] == 2015])
second_year = len(df[df['arrival_date_year'] == 2016])
third_year = len(df[df['arrival_date_year'] == 2017])

print('Number of guests who stayed in either hotel is {}, in the year 2015'.format(first_year))
print('Number of guests who stayed in either hotel is {}, in the year 2016'.format(second_year))
print('Number of guests who stayed in either hotel is {}, in the year 2017'.format(third_year))
Number of guests who stayed in either hotel is 21863, in the year 2015
Number of guests who stayed in either hotel is 56435, in the year 2016
Number of guests who stayed in either hotel is 40604, in the year 2017

The data is not equally distributed throughout the years, and we note in the plots above that year 2016 is the only year with data from all 12 months.

sns.displot(x='lead_time', hue='is_canceled', multiple="stack", kde=True, data=df[df['arrival_date_year'] == 2015], height=10, aspect=16/10);

png

sns.displot(x='lead_time', hue='is_canceled', multiple="stack", kde=True, data=df[df['arrival_date_year'] == 2016], height=10, aspect=16/10);

png

sns.displot(x='lead_time', hue='is_canceled', multiple="stack", kde=True, data=df[df['arrival_date_year'] == 2017], height=10, aspect=16/10);

png

sns.boxplot(x='assigned_room_type',y='adr',data=df);

png

df_adr = df.copy()
df_adr['adr_adj'] = df_adr['adr']/(df_adr['adults']+df_adr['children'])
df_adr['adr_adj_wb'] = df_adr['adr']/(df_adr['adults']+df_adr['children']+df_adr['babies'])

df_adr.drop(df_adr[df_adr['adr_adj']>400].index.values,axis=0,inplace=True)
sns.boxplot(x='assigned_room_type',y='adr_adj',data=df_adr);

png

sns.countplot(data=df[~df['company'].isna()], x="is_canceled")
<matplotlib.axes._subplots.AxesSubplot at 0x1413a3880>

png

sns.countplot(data=df[df['company'].isna()], x="is_canceled")
<matplotlib.axes._subplots.AxesSubplot at 0x1414696a0>

png

df[~df['company'].isna()].is_canceled.value_counts() #show guests staying on behalf of a company or organization
0    5435
1    1167
Name: is_canceled, dtype: int64
x = df[~df['company'].isna()].is_canceled.value_counts()[1]/df[~df['company'].isna()].is_canceled.value_counts()[0]
round(x*100,2)
21.47

We see above that only 21.47% of guests cancel when they register with a company or organization.

df[df['company'].isna()].is_canceled.value_counts() #guests on vacation
0    69016
1    42890
Name: is_canceled, dtype: int64
x = df[df['company'].isna()].is_canceled.value_counts()[1]/df[df['company'].isna()].is_canceled.value_counts()[0]
round(x*100,2)
62.15

When a guest is booking to stay for a personal vaction they have a 62.15% of canceling a reservation.

Dummy Variables

df_columns = ['hotel','arrival_date_month','meal','country','market_segment','distribution_channel',
              'reserved_room_type','assigned_room_type','deposit_type','customer_type']
df_ = df.copy()
df_.drop(['reservation_status','reservation_status_date'],axis=1,inplace=True)
data = pd.get_dummies(df_, prefix=df_columns, columns=df_columns)

Models

X_train, X_test, Y_train, Y_test= train_test_split(data.drop('is_canceled',axis=1), data['is_canceled'], random_state=42, test_size=0.2)

Random Forest Classifier

rf = RandomForestClassifier()
parameter_rf = {
    'n_estimators':[10,50,100,150,200],
    'criterion':('gini','entropy'),
    'max_depth':[None,1,2,3,4,5],
    'min_samples_split':[2,3,4],
    'min_samples_leaf':[1,2,3]
}

clf_rf = GridSearchCV(rf, parameters_rf, cv=5, verbose=10, n_jobs=-1)
clf_rf.fit(X_train, Y_train)
rf_tmp = RandomForestClassifier()
#cv_results = cross_validate(rf_tmp, X_train, Y_train, cv=5, verbose=10, n_jobs=-1)
rf_tmp.fit(X_train,Y_train)
RandomForestClassifier()
rf_tmp.score(X_test,Y_test)
0.8906843304362501
feats = {}
for feature, importance in zip(data.drop('is_canceled',axis=1).columns, rf_tmp.feature_importances_):
    feats[feature] = importance

importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'importances'}).sort_values(by='importances', ascending=False)
importances.head(10)
importances
lead_time 0.102637
deposit_type_Non Refund 0.080077
adr 0.071284
deposit_type_No Deposit 0.057994
country_PRT 0.057188
arrival_date_day_of_month 0.053326
total_of_special_requests 0.052536
arrival_date_week_number 0.046571
stays_in_week_nights 0.037800
previous_cancellations 0.028475

Logisitic Regression

lr = LogisticRegression()
"""parameter_lr = {
    'penalty':('l2', 'none'),
    'tol':[1e-5,1e-4,1e-3],
    'C':[0.1,1.0,2.0],
    'solver':('lbfgs','sag','saga'),
    'max_iter':[1000]
}"""
parameter_lr = {
    'penalty':['none'], 
    'tol':[1e-4],
    'C':[1.0],
    'solver':['sag'],
    'max_iter':[1000]
}

clf_lr = GridSearchCV(lr, parameter_lr, cv=5, verbose=10, n_jobs=-1)
clf_lr.fit(X_train, Y_train)
Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  5.5min remaining:  8.2min
[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:  5.5min remaining:  3.6min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  5.5min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  5.5min finished
/Users/DavidH/anaconda2/envs/py382/lib/python3.8/site-packages/sklearn/linear_model/_sag.py:329: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn("The max_iter was reached which means "





GridSearchCV(cv=5, estimator=LogisticRegression(), n_jobs=-1,
             param_grid={'C': [1.0], 'max_iter': [1000], 'penalty': ['none'],
                         'solver': ['sag'], 'tol': [0.0001]},
             verbose=10)
clf_lr.best_params_
{'C': 1.0, 'max_iter': 1000, 'penalty': 'none', 'solver': 'sag', 'tol': 0.0001}
clf_lr.best_score_
0.804579898626818
clf_lr.best_estimator_.score(X_test,Y_test)
0.8094675554805502