- Problem Description
- Data Source & Description
- Deep Learning Algorithm for Talking Points
- Build the Neural Network
- Generated Talking Points
- Sentiment Analysis on Stock Data
- Converting the final result into CSV file
- Future Improvements
We have gathered the data for training our model from Kaggle's dataset US Financial News Articles
The data set is rich with metadata, containing the source of the article, the time it was published to the author details and Images related to every article. Excellent for text analysis and combined with any other related entity dataset, it could give some astounding results.
The main Zip file contains 5 other folders , each for every month in 2018 from January to May.
JSON files have articles based on the following criteria:
News publishers: Bloomberg.com, CNBC.com, reuters.com, wsj.com, fortune.com Language: Only English Country: United States News Category: Financial news only
The source for the articles (Archive source: httsps://Webhose.io/archive )
The deep-learning algorithm is implemented in Python using Pytorch library.
Word embedding is a collective term for models that learn to map a set of words or phrases in a vocabulary to vectors of numerical values. These vectors are called embeddings. We can use neural network to learn word-embeddings.
For example, lets say the word "stock" is encoded as the integer "958", then to get hidden-layer values for "stock", we just take 958th row of the embeddng weight matrix. This process is known as Embedding Look-up.
The Word2Vec algorithm finds much more efficient representations by finding vectors that represent the words. These vectors also contain semantic information about the words. For example consider the below excerts from BBC on impact Corona pandemic on US economy-
"Investors fear the spread of the corona pandemic will destroy economic growth."
"More than 30m people have been unemployeed due to the corona pandemic. "
"Oil prices have crashed since demand for oil has dried up as corona pandemic lockdown across the world have kept people inside."
Words that show up in similar contexts, such as "destroy economic growth", "unemployeed", and "oil prices have crashed" will have vectors near each other. Different words will be further away from one another, and relationships can be represented by distance in vector space. An example is shown below -
We have implemented an RNN using PyTorch's Module class along with LSTM. The following functions completed our RNN model -
__init__
- The initialize function.init_hidden
- The initialization function for an LSTM hidden stateforward
- Forward propagation function.
The output of this model is the last batch of word scores after a complete sequence has been processed. That is, for each input sequence of words, we only want to output the word scores for a single, most likely, next word.
Our neural network architerure w.r.t charecter-wise RNN is better depicted below -
The train function gives us the ability to set the number of epochs, the learning rate, and other parameters.
Below we're using an Adam optimizer and cross entropy loss since we are looking at word class scores as output. We calculate the loss and perform backpropagation, as usual!
A couple of details about training:
- Within the batch loop, we detach the hidden state from its history; this time setting it equal to a new tuple variable because an LSTM has a hidden state that is a tuple of the hidden and cell states.
- We use
clip_grad_norm_
to help prevent exploding gradients.
Following are the list of hyper-parameter values that we have used while training our model -
S.No. | Hyper-parameter | Description | Value |
---|---|---|---|
1. | sequence_length |
length of a sequence | 10 |
2. | batch_size |
batch size | 128 |
3. | num_epochs |
number of epochs to train for | 10 |
4. | learning_rate |
learning rate for an Adam optimizer | 0.001 |
5. | vocab_size |
number of uniqe tokens in our vocabulary | len(vocal_to_int) |
6. | output_size |
desired size of the output | vocab_size |
7. | embedding_dim |
embedding dimension; smaller than the vocab_size | 200 |
8. | hidden_dim |
hidden dimension of our RNN | 250 |
9. | n_layers |
number of layers/cells in your RNN | 2 |
10. | show_every_n_batches |
number of batches at which the neural network should print progress |
500 |
We need to feed the model a prime word on the basis of which the model then generates the talking points. For example -
Upon feeding the model the word "corona", the model generates the following talking points -
corona. m. est
** the dollar slipped 0. 2 percent to 3. 5 percent.
in the past decade, the dow jones industrial average was up 0. 2 percent at $1, 326. 00 a barrel.
a weaker u. s. dollar slipped from a basket of major currencies in 2018.
the dow jones industrial average index was down 0. 3 percent to the lowest level in the past five months, ” the official said.
the ministry is not immediately available to comment
Sentiment analysis on stock data can be added to one's advantage. If you look for an extreme example of how social media influences stock market, take a look at Kylie Jenners’ tweet about Snapchat.
Shortly after the message was posted online, the price of Snap, Snapchat parent company, fell by 8.5 percent.
After generating the talking points and predicting the sentiment, we save the entire result to a csv file so that it can be called from an API.
We need to address the following points for better model accuracy -
- Better data-set related to stock news
- Traning on large volumes of data.
- Training for larger epochs
- Exploring BiLSTM