Skip to content

kev065/dsc-introduction-to-keras-lab

Repository files navigation

Keras - Lab

Introduction

In this lab you'll once again build a neural network, but this time you will be using Keras to do a lot of the heavy lifting.

Objectives

You will be able to:

  • Build a neural network using Keras
  • Evaluate performance of a neural network using Keras

Required Packages

We'll start by importing all of the required packages and classes.

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import random
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
from sklearn import preprocessing
from keras.preprocessing.text import Tokenizer
from keras import models
from keras import layers
from keras import optimizers

Load the data

In this lab you will be classifying bank complaints available in the 'Bank_complaints.csv' file.

# Import data
df = None

# Inspect data
print(df.info())
df.head()

As mentioned earlier, your task is to categorize banking complaints into various predefined categories. Preview what these categories are and what percent of the complaints each accounts for.

# Your code here

Preprocessing

Before we build our neural network, we need to do several preprocessing steps. First, we will create word vector counts (a bag of words type representation) of our complaints text. Next, we will change the category labels to integers. Finally, we will perform our usual train-test split before building and training our neural network using Keras. With that, let's start munging our data!

One-hot encoding of the complaints

Our first step again is to transform our textual data into a numerical representation. As we saw in some of our previous lessons on NLP, there are many ways to do this. Here, we'll use the Tokenizer() class from the preprocessing.text sub-module of the Keras package.

As with our previous work using NLTK, this will transform our text complaints into word vectors. (Note that the method of creating a vector is different from our previous work with NLTK; as you'll see, word order will be preserved as opposed to a bag of words representation). In the below code, we'll only keep the 2,000 most common words and use one-hot encoding.

# As a quick preliminary, briefly review the docstring for keras.preprocessing.text.Tokenizer
Tokenizer?
# ⏰ This cell may take about thirty seconds to run

# Raw text complaints
complaints = df['Consumer complaint narrative'] 

# Initialize a tokenizer 
tokenizer = Tokenizer(num_words=2000) 

# Fit it to the complaints
tokenizer.fit_on_texts(complaints) 

# Generate sequences
sequences = tokenizer.texts_to_sequences(complaints) 
print('sequences type:', type(sequences))

# Similar to sequences, but returns a numpy array
one_hot_results= tokenizer.texts_to_matrix(complaints, mode='binary') 
print('one_hot_results type:', type(one_hot_results))

# Useful if we wish to decode (more explanation below)
word_index = tokenizer.word_index 

# Tokens are the number of unique words across the corpus
print('Found %s unique tokens.' % len(word_index)) 

# Our coded data
print('Dimensions of our coded results:', np.shape(one_hot_results)) 

Decoding Word Vectors

As a note, you can also decode these vectorized representations of the reviews. The word_index variable, defined above, stores the mapping from the label number to the actual word. Somewhat tediously, we can turn this dictionary inside out and map it back to our word vectors, giving us roughly the original complaint back. (As you'll see, the text won't be identical as we limited ourselves to top 2000 words.)

Python Review / Mini Challenge

While a bit tangential to our main topic of interest, we need to reverse our current dictionary word_index which maps words from our corpus to integers. In decoding our one_hot_results, we will need to create a dictionary of these integers to the original words. Below, take the word_index dictionary object and change the orientation so that the values are keys and the keys values. In other words, you are transforming something of the form {A:1, B:2, C:3} to {1:A, 2:B, 3:C}.

# Your code here
reverse_index = None

Back to Decoding Our Word Vectors...

comment_idx_to_preview = 19
print('Original complaint text:')
print(complaints[comment_idx_to_preview])
print('\n\n')

# The reverse_index cell block above must be complete in order for this cell block to successively execute 
decoded_review = ' '.join([reverse_index.get(i) for i in sequences[comment_idx_to_preview]])
print('Decoded review from Tokenizer:')
print(decoded_review)

Convert the Products to Numerical Categories

On to step two of our preprocessing: converting our descriptive categories into integers.

product = df['Product']

# Initialize
le = preprocessing.LabelEncoder() 
le.fit(product)
print('Original class labels:')
print(list(le.classes_))
print('\n')
product_cat = le.transform(product)  

# If you wish to retrieve the original descriptive labels post production
# list(le.inverse_transform([0, 1, 3, 3, 0, 6, 4])) 

print('New product labels:')
print(product_cat)
print('\n')

# Each row will be all zeros except for the category for that observation 
print('One hot labels; 7 binary columns, one for each of the categories.') 
product_onehot = to_categorical(product_cat)
print(product_onehot)
print('\n')

print('One hot labels shape:')
print(np.shape(product_onehot))

Train-test split

Now for our final preprocessing step: the usual train-test split.

random.seed(123)
test_index = random.sample(range(1,10000), 1500)

test = one_hot_results[test_index]
train = np.delete(one_hot_results, test_index, 0)

label_test = product_onehot[test_index]
label_train = np.delete(product_onehot, test_index, 0)

print('Test label shape:', np.shape(label_test))
print('Train label shape:', np.shape(label_train))
print('Test shape:', np.shape(test))
print('Train shape:', np.shape(train))

Building the network

Let's build a fully connected (Dense) layer network with relu activation in Keras. You can do this using: Dense(16, activation='relu').

In this example, use two hidden layers with 50 units in the first layer and 25 in the second, both with a 'relu' activation function. Because we are dealing with a multiclass problem (classifying the complaints into 7 categories), we use a use a 'softmax' classifier in order to output 7 class probabilities per case.

# Initialize a sequential model
model = None

# Two layers with relu activation



# One layer with softmax activation 

Compiling the model

Now, compile the model! This time, use 'categorical_crossentropy' as the loss function and stochastic gradient descent, 'SGD' as the optimizer. As in the previous lesson, include the accuracy as a metric.

# Compile the model

Training the model

In the compiler, you'll be passing the optimizer (SGD = stochastic gradient descent), loss function, and metrics. Train the model for 120 epochs in mini-batches of 256 samples.

Note:Your code may take about one to two minutes to run.

# Train the model 
history = None

Recall that the dictionary history has two entries: the loss and the accuracy achieved using the training set.

history_dict = history.history
history_dict.keys()

Plot the results

As you might expect, we'll use our matplotlib for graphing. Use the data stored in the history_dict above to plot the loss vs epochs and the accuracy vs epochs.

# Plot the loss vs the number of epoch
# Plot the training accuracy vs the number of epochs

It seems like we could just keep on going and accuracy would go up!

Make predictions

Finally, it's time to make predictions. Use the relevant method discussed in the previous lesson to output (probability) predictions for the test set.

# Output (probability) predictions for the test set 
y_hat_test = None

Evaluate Performance

Finally, print the loss and accuracy for both the train and test sets of the final trained model.

# Print the loss and accuracy for the training set 
results_train = None
results_train
# Print the loss and accuracy for the test set 
results_test = None
results_test

We can see that the training set results are really good, and the test set results seem to be even better. In general, this type of result will be rare, as train set results are usually at least a bit better than test set results.

Summary

Congratulations! In this lab, you built a neural network thanks to the tools provided by Keras! In upcoming lessons and labs we'll continue to investigate further ideas regarding how to tune and refine these models for increased accuracy and performance.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published