This project demonstrates how to build and train a Long Short-Term Memory (LSTM) neural network to predict the next word in a sequence. The model is trained on a small corpus of text from a Frequently Asked Questions (FAQ) page.
The primary goal of this project is to create a language model that can understand the context of a given sequence of words and predict the most likely word to follow. This is a fundamental task in Natural Language Processing (NLP) with applications in text autocompletion, chatbots, and language translation.
The process involves:
- Text Preprocessing: Cleaning and preparing the raw text data.
- Tokenization: Converting words into numerical representations.
- Sequence Generation: Creating input-output pairs for the model to learn from.
- Model Building: Designing a sequential model with Embedding and LSTM layers.
- Training: Training the model on the prepared data.
- Prediction: Using the trained model to generate new words based on an input seed text.
- Python 3.x
- TensorFlow & Keras: For building and training the neural network.
- NumPy: For numerical operations and data manipulation.
- Tokenization: The entire text corpus is processed using
tensorflow.keras.preprocessing.text.Tokenizer. This builds a vocabulary of all unique words and assigns a unique integer index to each word. - Sequence Generation: The model is trained to predict a word based on the words that came before it. To create training samples, we iterate through each sentence in the corpus and create n-gram sequences. For a sentence like "what is the course fee", the following sequences are generated:
[what, is][what, is, the][what, is, the, course][what, is, the, course, fee]
- Padding: Neural networks require inputs of a fixed length. Since our generated sequences have varying lengths, we use
pad_sequencesto pad them with zeros at the beginning (padding='pre'). All sequences are padded to the length of the longest sequence in the dataset. - Splitting Features and Labels: For each padded sequence, the last word is treated as the label (y), and the words preceding it are the features (X).
- Example: For the sequence
[0, 0, what, is, the, course],X=[0, 0, what, is, the]y=[course]
- Example: For the sequence
- One-Hot Encoding: The target variable
yis categorical (one word out of the entire vocabulary). It is converted into a one-hot encoded vector usingto_categorical.
A Sequential model is built with the following layers:
-
Embedding Layer:
Embedding(vocab_size, 100, input_length=max_len-1)- This layer converts the integer-encoded word indices into dense vectors of a fixed size (100 in this case). It helps the model capture semantic relationships between words.
vocab_sizeis the total number of unique words plus one (for the padding token).input_lengthis the length of the input sequences (max_len - 1).
-
LSTM Layers:
LSTM(150, return_sequences=True)followed byLSTM(150)- These are the core recurrent layers that process the sequence of word embeddings. They are capable of learning long-term dependencies in the data.
- Note: When stacking LSTM layers, the first LSTM layer must have
return_sequences=Trueso that it outputs a 3D tensor (the full sequence of hidden states) for the next LSTM layer to process.
-
Dense Layer:
Dense(vocab_size, activation='softmax')- This is the final output layer. It has
vocab_sizeneurons, one for each word in the vocabulary. - The
softmaxactivation function outputs a probability distribution over the entire vocabulary, indicating the likelihood of each word being the next word.
- This is the final output layer. It has
- Compilation: The model is compiled using the
adamoptimizer andcategorical_crossentropyas the loss function, which is suitable for multi-class classification problems. - Fitting: The model is trained by calling
model.fit(X, y, epochs=100). It learns to minimize the loss by adjusting its internal weights over 100 epochs.
Make sure you have Python installed. Then, install the required libraries:
pip install tensorflow numpy- Clone the repository:
git clone [https://github.com/prakhar14-op/next-word-predictor-using-lstm.git](https://github.com/prakhar14-op/next-word-predictor-using-lstm.git) cd next-word-predictor-using-lstm - Open the Jupyter Notebook (lstm_project.ipynb) and run the cells sequentially.
- The final cells in the notebook demonstrate how to use the trained model to predict the next 10 words for a given seed text.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
model = Sequential()
model.add(Embedding(283, 100, input_length=56))
# Add return_sequences=True to the first LSTM layer
model.add(LSTM(150, return_sequences=True))
model.add(LSTM(150))
model.add(Dense(283, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()m