Releases: zhongkaifu/RNNSharp
RNNSharp 2.1.0.0 release
RNNSharp
RNNSharp is a toolkit of deep recurrent neural network which is widely used for many different kinds of tasks, such as sequence labeling, sequence-to-sequence and so on. It's written by C# language and based on .NET framework 4.6 or above version.
This page introduces what is RNNSharp, how it works and how to use it. To get the demo package, you can access release page.
Overview
RNNSharp supports many different types of deep recurrent neural network (aka DeepRNN) structures.
For network structure, it supports forward RNN and bi-directional RNN. Forward RNN considers histrocial information before current token, however, bi-directional RNN considers both histrocial information and information in future.
For hidden layer structure, it supports LSTM and Dropout. Compared to BPTT, LSTM is very good at keeping long term memory, since it has some gates to contorl information flow. Dropout is used to add noise during training in order to avoid overfitting.
In terms of output layer structure, simple, softmax, sampled softmax and recurrent CRFs[1] are supported. Softmax is the tranditional type which is widely used in many kinds of tasks. Sampled softmax is especially used for the tasks with large output vocabulary, such as sequence generation tasks (sequence-to-sequence model). Simple type is usually used with recurrent CRF together. For recurrent CRF, based on simple outputs and tags transition, it computes CRF output for entire sequence. For sequence labeling tasks in offline, such as word segmentation, named entity recognition and so on, recurrent CRF has better performance than softmax, sampled softmax and linear CRF.
Here is an example of deep bi-directional RNN-CRF network. It contains 3 hidden layers, 1 native RNN output layer and 1 CRF output layer.
Here is the inner structure of one bi-directional hidden layer.
Here is the neural network for sequence-to-sequence task. "TokenN" are from source sequence, and "ELayerX-Y" are auto-encoder's hidden layers. Auto-encoder is defined in feature configuration file. <s> is always the beginning of target sentence, and "DLayerX-Y" means the decoder's hidden layers. In decoder, it generates one token at one time until </s> is generated.
Supported Feature Types
RNNSharp supports many different feature types, so the following paragraph will introduce how these feaures work.
Template Features
Template features are generated by templates. By given templates and corpus, these features can be automatically generated. In RNNSharp, template features are sparse features, so if the feature exists in current token, the feature value will be 1 (or feature frequency), otherwise, it will be 0. It's similar as CRFSharp features. In RNNSharp, TFeatureBin.exe is the console tool to generate this type of features.
In template file, each line describes one template which consists of prefix, id and rule-string. The prefix indicates template type. So far, RNNSharp supports U-type feature, so the prefix is always as "U". Id is used to distinguish different templates. And rule-string is the feature body.
Unigram
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[-1,0]/%x[0,0]
U05:%x[0,0]/%x[1,0]
U06:%x[-1,0]/%x[1,0]
U07:%x[-1,1]
U08:%x[0,1]
U09:%x[1,1]
U10:%x[-1,1]/%x[0,1]
U11:%x[0,1]/%x[1,1]
U12:%x[-1,1]/%x[1,1]
U13:C%x[-1,0]/%x[-1,1]
U14:C%x[0,0]/%x[0,1]
U15:C%x[1,0]/%x[1,1]
The rule-string has two types, one is constant string, and the other is variable. The simplest variable format is {“%x[row,col]”}. Row specifies the offset between current focusing token and generate feature token in row. Col specifies the absolute column position in corpus. Moreover, variable combination is also supported, for example: {“%x[row1, col1]/%x[row2, col2]”}. When we build feature set, variable will be expanded to specific string. Here is an example in training data for named entity task.
Word | Pos | Tag |
---|---|---|
! | PUN | S |
Tokyo | NNP | S_LOCATION |
and | CC | S |
New | NNP | B_LOCATION |
York | NNP | E_LOCATION |
are | VBP | S |
major | JJ | S |
financial | JJ | S |
centers | NNS | S |
. | PUN | S |
! | PUN | S |
p | FW | S |
' | PUN | S |
y | NN | S |
h | FW | S |
44 | CD | S |
University | NNP | B_ORGANIZATION |
of | IN | M_ORGANIZATION |
Texas | NNP | M_ORGANIZATION |
Austin | NNP | E_ORGANIZATION |
According above templates, assuming current focusing token is “York NNP E_LOCATION”, below features are generated:
U01:New
U02:York
U03:are
U04:New/York
U05:York/are
U06:New/are
U07:NNP
U08:NNP
U09:are
U10:NNP/NNP
U11:NNP/VBP
U12:NNP/VBP
U13:CNew/NNP
U14:CYork/NNP
U15:Care/VBP
Although U07 and U08, U11 and U12’s rule-string are the same, we can still distinguish them by id string.
Context Template Features
Context template features are based on template features and combined with context. In this example, if the context setting is "-1,0,1", the feature will combine the features of current token with its previous token and next token. For instance, if the sentence is "how are you". the generated feature set will be {Feature("how"), Feature("are"), Feature("you")}.
Pretrained Features
RNNSharp supports two types of pretrained features. The one is embedding features, and the other is auto-encoder features. Both of them are able to present a given token by a fixd-length vector. This feature is dense feature in RNNSharp.
For embedding features, they are trained from unlabled corpus by Text2Vec project. And RNNSharp uses them as static features for each given token. However, for auto-encoder features, they are trained by RNNSharp as well, and then they can be used as dense features for other trainings. Note that, the token's granularity in pretrained features should be consistent with training corpus in main training, otherwise, some tokens will mis-match with pretrained feature.
Likes template features, embedding feature also supports context feature. It can combine all features of given contexts into a single embedding feature. For auto-encoder features, it does not support it yet.
Run Time Features
Compared with other features generated offline, this feature is generated in run time. It uses the result of previous tokens as run time feature for current token. This feature is only available for forward-RNN, bi-directional RNN does not support it.
Source Sequence Encoding Feature
This feature is only for sequence-to-sequence task. In sequence-to-sequence task, RNNSharp encodes given source sequence into a fixed-length vector, and then pass it as dense feature to generate target sequence.
Configuration File
The configuration file describes model structure and features. In console tool, use -cfgfile as parameter to specify this file. Here is an example for sequence labeling task:
#Working directory. It is the parent directory of below relatived paths.
CURRENT_DIRECTORY = .
#Model type. Sequence labeling (SEQLABEL) and sequence-to-sequence (SEQ2SEQ) are supported.
MODEL_TYPE = SEQLABEL
#Model direction. Forward and BiDirectional are supported
MODEL_DIRECTION = BiDirectional
#Model file path
MODEL_FILEPATH = Data\Models\ParseORG_CHS\model.bin
#Hidden layers settings. LSTM and Dropout are supported. Here are examples of these layer types.
#Dropout: Dropout:0.5 -- Drop out ratio is 0.5 and layer size is the same as previous layer.
#If the model has more than one hidden layer, each layer settings are separated by comma. For example:
#"LSTM:300, LSTM:200" means the model has two LSTM layers. The first layer size is 300, and the second layer size is 200.
HIDDEN_LAYER = LSTM:200
#Output layer settings. Simple, Softmax ands sampled softmax are supported. Here is an example of sampled softmax:
#"SampledSoftmax:20" means the output layer is sampled softmax layer and its negative sample size is 20.
#"Simple" means the output is raw result from output layer. "Softmax" means the result is based on "Simple" result and run softmax.
OUTPUT_LAYER = Simple
#CRF layer settings
#If this option is true, output layer type must be "Simple" type.
CRF_LAYER = True
#The file name for template feature set
TFEATURE_FILENAME = Data\Models\ParseORG_CHS\tfeatures
#The context range for template feature set. In below, the context is current token, next token and next after next token
TFEATURE_CONTEXT = 0,1,2
#The feature weight type. Binary and Freq are supported
TFEATURE_WEIGHT_TYPE = Binary
#Pretrained features type: 'Embedding' and 'Autoencoder' are supported.
#For 'Embedding', the pretrained model is trained by Text2Vec, which looks like word embedding model.
#For 'Autoencoder', the pretrained model is trained by RNNSharp itself. You need to train an auto encoder-decoder model by RNNSharp at first, and then use this pretrained model for your task.
PRETRAIN_TYPE = Embedding
#The following settings are for pretrained model in 'Embedding' type.
#The embedding model generated by Txt2Vec (https://github.com/zhongkaifu/Txt2Vec). If it is raw text format, we should use WORDEMBEDDING_RAW_FILENAME instead of WORDEMBEDDING_FILENAME as keyword
WORDEMBEDDING_FILENAME = Data\WordEmbedding\wordvec_chs.bin
#The context range of word embedding. In below example, the context is current token, previous token and next token
#If more than one token are combined, this feature would use a plenty of memory.
WORDEMBEDDING_CONTEXT: -1,0,1
#The column index applied word embedding feature
WORDEMBEDDING_COLUMN = 0
#The following setting is for pretrained model in 'Autoencoder' type.
#The feature configuration file for pretrained...
RNNSharp v2.0.0.0 Release Page
RNNSharp
RNNSharp is a toolkit of deep recurrent neural network which is widely used for many different kinds of tasks, such as sequence labeling. It's written by C# language and based on .NET framework 4.6 or above version.
This page will introduces you about what is RNNSharp, how it works and how to use it. To get the demo package, please access release page and download the package.
Overview
RNNSharp supports many different types of deep recurrent neural network (aka DeepRNN) structures.In the aspect of historical memory, it supports BPTT(BackPropagation Through Time) and LSTM(Long Short-Term Memory) structures. And in respect of output layer structure, RNNSharp supports native output layer and recurrent CRFs[1]. In additional, RNNSharp also support forward RNN and bi-directional RNN structures.
For BPTT and LSTM, BPTT-RNN is usually called as "simple RNN", since the structure of its hidden layer node is very simple. It's not good at preserving long time historical memory. LSTM-RNN is more complex than BPTT-RNN, since its hidden layer node has inner-structure which helps it to save very long time historical memory. In general, LSTM has better performance than BPTT on longer sequences.
For native RNN output, many widely experiments and applications have proved that it has better results than tranditional algorithms, such as MMEM, for online sequence labeling tasks, such as speech recognition, auto suggestion and so on.
For RNN-CRF, based on native RNN outputs and their transition, we compute CRF output for entire sequence. Compred with native RNN, RNN-CRF has better performance for many different types of sequence labeling tasks in offline, such as word segmentation, named entity recognition and so on. With the similar feature set, it has better performance than linear CRF.
For bi-directional RNN, the output result combines the result of both forward RNN and backward RNN. It usually has better performance than single-directional RNN.
Here is an example of deep bi-directional RNN-CRF network. It contains 3 hidden layers, 1 native RNN output layer and 1 CRF output layer.
Here is the inner structure of one bi-directional hidden layer.
Supported Feature Types
RNNSharp supports four types of feature set. They are template features, context template features, run time feature and word embedding features. These features are controlled by configuration file, the following paragraph will introduce what these features are and how to use them in configuration file.
Template Features
Template features are generated by templates. By given templates, according corpus, the features are generated automatically. The template feature is binary feature. If the feature exists in current token, its value will be 1, otherwise, the value will be 0. It's similar as CRFSharp features. In RNNSharp, TFeatureBin.exe is the console tool to generate this type of features.
In template file, each line describes one template which consists of prefix, id and rule-string. The prefix indicates template type. So far, RNNSharp supports U-type feature, so the prefix is always as "U". Id is used to distinguish different templates. And rule-string is the feature body.
Unigram
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[-1,0]/%x[0,0]
U05:%x[0,0]/%x[1,0]
U06:%x[-1,0]/%x[1,0]
U07:%x[-1,1]
U08:%x[0,1]
U09:%x[1,1]
U10:%x[-1,1]/%x[0,1]
U11:%x[0,1]/%x[1,1]
U12:%x[-1,1]/%x[1,1]
U13:C%x[-1,0]/%x[-1,1]
U14:C%x[0,0]/%x[0,1]
U15:C%x[1,0]/%x[1,1]
The rule-string has two types, one is constant string, and the other is macro. The simplest macro format is {“%x[row,col]”}. Row specifies the offset between current focusing token and generate feature token in row. Col specifies the absolute column position in corpus. Moreover, combined macro is also supported, for example: {“%x[row1, col1]/%x[row2, col2]”}. When we build feature set, macro will be replaced as specific string. Here is an example of training data:
Word | Pos | Tag |
---|---|---|
! | PUN | S |
Tokyo | NNP | S_LOCATION |
and | CC | S |
New | NNP | B_LOCATION |
York | NNP | E_LOCATION |
are | VBP | S |
major | JJ | S |
financial | JJ | S |
centers | NNS | S |
. | PUN | S |
! | PUN | S |
p | FW | S |
' | PUN | S |
y | NN | S |
h | FW | S |
44 | CD | S |
University | NNP | B_ORGANIZATION |
of | IN | M_ORGANIZATION |
Texas | NNP | M_ORGANIZATION |
Austin | NNP | E_ORGANIZATION |
According above templates, assuming current focusing token is “York NNP E_LOCATION”, below features are generated:
U01:New
U02:York
U03:are
U04:New/York
U05:York/are
U06:New/are
U07:NNP
U08:NNP
U09:are
U10:NNP/NNP
U11:NNP/VBP
U12:NNP/VBP
U13:CNew/NNP
U14:CYork/NNP
U15:Care/VBP
Although U07 and U08, U11 and U12’s rule-string are the same, we can still distinguish them by id string.
In feature configuration file, keyword TFEATURE_FILENAME is used to specify the file name of template feature set in binary format
Context Template Features
Context template features are based on template features and combined with context. Here is an example, if the settings is "-1,0,1", this feature will combine the features of current token with its previous token and next token. For instance, if the sentence is "how are you". the generated feature set will be {Feature("how"), Feature("are"), Feature("you")}.
In feature configuration file, keyword TFEATURE_CONTEXT is used to specify the tokens' context range for the feature.
Word Embedding Features
Word embedding features are used to describe the features of given token. It's very useful when we only have small labeled corpus, but have lots of unlabeled corpus. This feature is generated by Txt2Vec project. With lots of unlabeled corpus, Txt2Vec is able to generate vectors for each token. Note that, the token's granularity between word embedding feature and RNN training corpus should be consistent, otherwise, tokens in training corpus are not able to be matched with the feature. For more detailed information about how to generate word embedding features, please visit Txt2Vec homepage.
In RNNSharp, this feature also supports context feature. It will combine all features of given contexts into a single word embedding feature.
In feature configuration, it has three keywords: WORDEMBEDDING_FILENAME is used to specify the encoded word embedding data file name generated by Txt2Vec. WORDEMBEDDING_CONTEXT is used to specify the token's context range. And WORDEMBEDDING_COLUMN is used to specify the column index applied the feature in corpus
Run Time Features
Compared with other features generated offline, this feature is generated in run time. It uses the result of previous tokens as run time feature for current token. This feature is only available for forward-RNN, bi-directional RNN does not support it.
In feature configuration, keyword RTFEATURE_CONTEXT is used to specify the context range of this feature.
Feature Configuration File
The file contains the configuration items for all kinds of features. They have been introduced in above sections, and here is an example. In console tool, use -ftrfile as parameter to specify feature configuration file
#The file name of template feature set
TFEATURE_FILENAME:tfeatures
#The context range of template feature set. In below example, the context is current token, next token and next after next token
TFEATURE_CONTEXT: 0,1,2
#The word embedding data file name generated by Txt2Vec
WORDEMBEDDING_FILENAME:word_vector.bin
#The context range of word embedding. In below example, the context is current token, previous token and next token
WORDEMBEDDING_CONTEXT: -1,0,1
#The column index for word embedding feature
WORDEMBEDDING_COLUMN: 0
#The context range of run time feature. In below exampl, RNNSharp will use the output of previous token as run time feature for current token
RTFEATURE_CONTEXT: -1
Training file format
Training corpus contains many records to describe what the model should be. For each record, it's included one or many tokens, and each token has one or many dimension features to describe itself.
In training file, each record can be represented as a matrix and ends with an empty line. In the matrix, each row describes one token and its features, and each column represents a feature in one dimension. In entire training corpus, the number of column must be fixed.
When RNNSharp encodes, if the column size is N, according template file describes, the first N-1 columns will be used as input data for binary feature set generation and model training. The Nth column (aka last column) is the answer of current token, which the model should output.
There is an example (a bigger training example file is at release section, you can see and download it there):
Word | Pos | Tag |
---|---|---|
! | PUN | S |
Tokyo | NNP | S_LOCATION |
and | CC | S |
New | NNP | B_LOCATION |
York | NNP | E_LOCATION |
are | VBP | S |
major | JJ | S |
financial | JJ | S |
centers | NNS | S |
. | PUN | S |
! | PUN | S |
p | FW | S |
' | PUN | S |
y | NN | S |
h | FW | S |
44 | CD | S |
University | NNP | B_ORGANIZATION |
of | IN | M_ORGANIZATION |
Texas | NNP | M_ORGANIZATION |
Austin | NNP | E_ORGANIZATION |
In above example, we designed output answer as "POS_TYPE" format. POS means the position of the term in the chunk or named entity, TYPE means the output type of the term.
The example is for labeling named entities in records. It has two records and each token has three columns. The first column is the term of a token, the second column is correpsonding token’s pos-tag, and the third column indicates the named entity type of the token. The first and the second ...