-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training takes too long!! #34
Comments
Hi @andy-soft, For your labeling task, how many categories do you want to label ? Could you please share the configuration file your are using with me ? Then I will estimate if current performance is reasonable. Currently, RNNSharp doesn't support GPU training. It supports CPU training with SIMD instruction only, so you need to have a powerful CPU with new SIMD instruction set, such as AVX, AVX2 and so on. I did use RNNSharp for sequence label tasks on inflectional languages, such as English, such as pos-tag, named entity and so on. Usually, the labeling categories is no more than 50. If labeling categories is too much, it will definitely affect performances, and you should optimize them, such as splitting them to a few of basic units for labeling. If it's really hard to reduce the number of them, you could use SampledSoftmax as output layer type. For each token, It randomly samples some categories plus categories on current labeling sentence for training, instead of the entire categories set. |
Hello, thanks for the reply.
Actually I was only testing the labels on the english NER labels (I think
there are only 5-6 entity types, and with the BIO labels there are at most
12-15 labels). My Level of classification is about the same, may be at most
20 total labels. (B I O S, with 10-15 entity types)
the problem is the many labels of each word, the variability is huge, more
than 900 different POS labels, (EAGLES 2 version)
I can sub-sample them, creating a 2 level problem, but I don't know how the
LSTM will behave on this. and even how to compose the output layer. Too
much complex is the problem, I hope to discover how to combine so much
labels and POS parts into one problem. As I told you before. But thank you
for answering and doing such a good programming,
I used another word2vec encoder from google, and got a severe low
performance, then y Refactored it, but something went wrong and the
resulting system trains well, but the cosine distance is too close to one
always, and could not find the problem I might have introduced. I speed the
thing 20 times, by programming well some parts. May be I will help you
also, I am some sort of skilled programmer under C# (30 years programming)
& thanks again!
…On Fri, Apr 21, 2017 at 9:22 PM, Zhongkai Fu ***@***.***> wrote:
Hi @andy-soft <https://github.com/andy-soft>,
For your labeling task, how many categories do you want to label ? Could
you please share the configuration file your are using with me ? Then I
will estimate if current performance is reasonable. Currently, RNNSharp
doesn't support GPU training. It supports CPU training with SIMD
instruction only, so you need to have a powerful CPU with new SIMD
instruction set, such as AVX, AVX2 and so on.
I did use RNNSharp for sequence label tasks on inflectional languages,
such as English, such as pos-tag, named entity and so on. Usually, the
labeling categories is no more than 50. If labeling categories is too much,
it will definitely affect performances, and you should optimize them, such
as splitting them to a few of basic units for labeling. If it's really hard
to reduce the number of them, you could use SampledSoftmax as output layer
type. For each token, It randomly samples some categories plus categories
on current labeling sentence for training, instead of the entire categories
set.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ANCPcZoGdi_nQOkZ5NDlNoD9IjFYWh5iks5ryUhUgaJpZM4NE58d>
.
--
ing. Andrés T. Hohendahl
director
PandoraBox
www.pandorabox.com.ar
web.fi.uba.ar/~ahohenda
|
It's really appreciated if you could make contribution for RNNSharp. :) For word2vec, you can try my version: https://github.com/zhongkaifu/Txt2Vec It has higher performance than original word2vec and supports incremental training. For "the problem is the many labels of each word, the variability is huge, more |
Hi Zhongkai
I tried your txt2vector version, and the performance of yours is similar
(even slightly less) than word2vec (a C# port I've modified) the Threaded
part and the logger may slow down the system, even the double[] vector
calculus makes it a bit slower, I guess.
I will send you my modified version, I have made some optimizations
manually inside the code (well commented)
The 900 Label problem lies on that Spanish, as well as many european langs,
do swallow many aspects of the statements, like numerosity, gender,
diminutives, augmentatives, in the case of verbs even the person and mode
are inside the morphology of each word, worst of all, people use many
prefixes to modify the semantics-only of the word, which results in a new
word for the vocabulary, if it takes only the written form, not the
decomposed one which I can split using my libraries based upon a huge
Spanish word corpus (>300Mw) I've collected over the last 12 years. Even
each prefix and suffix, adds semantic information upon the word, and I
guess this can be used to train several "aspects" of a RNN, allowing a
better comprehension of the phrase, for NLU based systems.
For example the way a Named entity like a place is prepended and the way a
sentence construct treats it is similar to those of an organization, but
slightly different to a proper name, or person. But a person is also
addressed inside a sentence as something different, and the "semantic"
properties of the verbs involved as well as the adjectives, do possess
information to determine even if an adjective or a simple noun, or pronoun
is referring to a person, so the Anaphora detection should be done inside
this "smarter" Named Entity Detector which I am seeking to build, may be
with several parallel stages, trained upon "nameability" or "placeability"
(sorry for teh OOV, but this is what I want to mean).
For example, in Spanish the word: "hiperrecontrabuenísimo" is an OOV (not
in conventional dictionaries) but for a native speaker, it means clearly
the prefix+suffix+root meanings, this is "hiper" (augmentative) "recontra"
(another augmentative) buen (good) + (ísimo) another augmentative, so this
word as well as many others of this type, are used in colloquial
chat/conversation, but neither ends in any dictionary ever!
So my idea is to train a network capable of extracting sense-relationship
embedded inside syntax relations, like verb (root) to direct object (root)
relation with semantic features.
The tags have a string representation, just search for EAGLES format, it;s
a extended POS tag, much more complete than english tags, (Penn treebank
and others alike) the length is variable and they are like this: NCMS for
Common Noun Masculine Singular
you can imagine the thousand combinations for each POS class, as well as
the sub-classifications.
Just this, hope you understood my ideas, If you have any idea or question,
it will be addressed, and responded quickly!
best regards
Andrés
…On Sat, Apr 22, 2017 at 12:45 AM, Zhongkai Fu ***@***.***> wrote:
It's really appreciated if you could make contribution for RNNSharp. :)
For word2vec, you can try my version: https://github.com/zhongkaifu/
Txt2Vec It has higher performance than original word2vec and supports
incremental training.
For "the problem is the many labels of each word, the variability is huge,
more
than 900 different POS labels, (EAGLES 2 version)", could you please make
a specified example about it ? Sorry that I don't understand about it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ANCPcetxlLeVjPt2oLaYXvcFnPMuZGmNks5ryXfAgaJpZM4NE58d>
.
--
ing. Andrés T. Hohendahl
director
PandoraBox
www.pandorabox.com.ar
web.fi.uba.ar/~ahohenda
|
Hi Andrés, Thanks for your explanation in details. It's really helpful. hiper \t S_Aug1 So, Label "Aug1Aug2CorePartAug3" is split into four basic tags. Or you could try character level labeling, such as h \t B_Aug1 By this way, it will significantly reduce the number of output categories. Thanks |
In addition, did you try the latest RNNSharp code (check out from master branch) ? It's much faster than the released version, since I have not updated release package yet. |
No, I'll try tomorrow,
thanks
I am still trying to understand all you told me,
but too tired now!
Until tomorrow!
…On Sat, Apr 22, 2017 at 8:00 PM, Zhongkai Fu ***@***.***> wrote:
In addition, did you try the latest RNNSharp code (check out from master
branch) ? It's much faster than the released version, since I have not
updated release package yet.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ANCPcYUlPJgVHZuxi3vB5n1SlztVrIaAks5ryoZ_gaJpZM4NE58d>
.
--
ing. Andrés T. Hohendahl
director
PandoraBox
www.pandorabox.com.ar
web.fi.uba.ar/~ahohenda
|
I am training a NER for spanish, with a small corpus 8k tokens, and it
takes too long, adn got stuck, Token error rate don't lowers, less than 9%
info,31/5/2017 3:02:41 p. m. Progress = 8K/8,323K
info,31/5/2017 3:02:41 p. m. Train cross-entropy = 0,320236720036713
info,31/5/2017 3:02:41 p. m. Error token ratio = 7,50999862854771%
info,31/5/2017 3:02:41 p. m. Error sentence ratio = 64,1830228778597%
info,31/5/2017 3:03:06 p. m. Iter 33 completed
info,31/5/2017 3:03:06 p. m. Sentences = 8323, time escape =
00:09:20.6314274s, speed = 14,8457606784532
info,31/5/2017 3:03:06 p. m. In training: log probability =
-27240,7618759363, cross-entropy = 0,321621874395561, perplexity =
1,24973470831842
info,31/5/2017 3:03:06 p. m. Verify model on validated corpus.
info,31/5/2017 3:03:06 p. m. Start validation ...
info,31/5/2017 3:03:24 p. m. In validation: error token ratio =
7,13251598951747% error sentence ratio = 66,4469347396177%
info,31/5/2017 3:03:24 p. m. In training: log probability =
-5154,61463169968, cross-entropy = 0,313802466020867, perplexity =
1,24297946835413
it is training since yesterday, when will it stop?
this was the bat:
SET CorpusPath=.\Data\Corpus\NER_ES
SET ModelsPath=.\Data\Models\NER_ES
SET BinPath=..\Bin
REM Build template feature set from training corpus
%BinPath%\TFeatureBin.exe -mode build -template %CorpusPath%\template.txt
-inputfile %CorpusPath%\train.txt -ftrfile %ModelsPath%\tfeatures -minfreq 1
REM Encoding LSTM-BiRNN-CRF model
%BinPath%\RNNSharpConsole.exe -mode train -trainfile %CorpusPath%\train.txt
-validfile %CorpusPath%\valid.txt -cfgfile .\config_ner_enu.txt -tagfile
%CorpusPath%\tags.txt -alpha 0.1 -maxiter 0 -savestep 200K
I can send you the training samples, also.
should I buy a CUDA thing!?
best
thanks
On Sat, Apr 22, 2017 at 8:37 PM, Andres Hohendahl <
andres.hohendahl@gmail.com> wrote:
… No, I'll try tomorrow,
thanks
I am still trying to understand all you told me,
but too tired now!
Until tomorrow!
On Sat, Apr 22, 2017 at 8:00 PM, Zhongkai Fu ***@***.***>
wrote:
> In addition, did you try the latest RNNSharp code (check out from master
> branch) ? It's much faster than the released version, since I have not
> updated release package yet.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#34 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ANCPcYUlPJgVHZuxi3vB5n1SlztVrIaAks5ryoZ_gaJpZM4NE58d>
> .
>
--
ing. Andrés T. Hohendahl
director
PandoraBox
www.pandorabox.com.ar
web.fi.uba.ar/~ahohenda
--
ing. Andrés T. Hohendahl
director
PandoraBox
www.pandorabox.com.ar
web.fi.uba.ar/~ahohenda
|
According RNN output lines, you are still using older RNNSharp, please sync the latest source code (not released demo package, since I have not updated it yet), build it and train your model. It's okay you can send training example, configuration file and command line you ran to me. |
Thanks Zhonkaifu
I'll download the latest and build them ASAP, & tell you the result
Also I saw some of my mistakes:
I missed an absolute reference on the configuration files, pointing towards
the english version "xxx_enu" and I need to train a text2vec , now I
replaced it as:
WORDEMBEDDING_FILENAME = D:\RNNSharpDemoPackage\WordEmbedding\wordvec_es.bin
And I am going to generate a embedding file for Spanish also!
Andrés
PD:
The bad run, (based on english resources.. bad!) ended it's training 2 days
after, and generated a 1.0 GB model file. I think this lengthy files need
to be pruned somehow, it's not practical to bavea 1.0 Gb parameter file!
The resulting model would be memory + resources hungry ¿might not be useful
for production?
…On Wed, May 31, 2017 at 3:20 PM, Zhongkai Fu ***@***.***> wrote:
According RNN output lines, you are still using older RNNSharp, please
sync the latest source code (not released demo package, since I have not
updated it yet), build it and train your model.
It's okay you can send training example, configuration file and command
line you ran to me.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ANCPcVDZCEDg37ES4t3ZMhMnOiD8J5OIks5r_a9tgaJpZM4NE58d>
.
--
ing. Andrés T. Hohendahl
director
PandoraBox
www.pandorabox.com.ar
web.fi.uba.ar/~ahohenda
|
Hi, now I am training on real words, with new routines (just downloaded)
I trained a Spanish corpus of 380 megabytes of raw text, using your
text2vec and created the *.bin for the training, then redirected all the
documents in the configuration, corrected the files with the new syntax, on
the github, and started to train it, but it is still running fo over 1 day,
¿when should it stop?
¿is it not worth to be able to break the training and resume later, using
some kind of console, or whatever? if I abort the training, I loose all the
work done!
Any clue? (I guessed it will stop after the 20th iteration, but the show
still goes on...)
The config file is this:
-----------------------------------------------------------------------
#Working directory
CURRENT_DIRECTORY = .
#Model type. Sequence labeling (SEQLABEL) and sequence-to-sequence
(SEQ2SEQ) are supported.
MODEL_TYPE = SEQLABEL
#!Model direction. Forward and BiDirectional are supported
#!MODEL_DIRECTION = BiDirectional
#Network type. Four types are supported:
#For sequence labeling tasks, we could use: Forward, BiDirectional,
BiDirectionalAverage
#For sequence-to-sequence tasks, we could use: ForwardSeq2Seq
#BiDirectional type concatnates outputs of forward layer and backward layer
as final output
#BiDirectionalAverage type averages outputs of forward layer and backward
layer as final output
NETWORK_TYPE = BiDirectional
#Model file path
MODEL_FILEPATH = Data\Models\NER_ES\model.bin
#Hidden layers settings. LSTM and Dropout are supported. Here are examples
of these layer types
#Dropout: Dropout:0.5 -- Drop out ratio is 0.5
#If the model has more than one hidden layer, each layer settings are
separated by comma. For example:
#"LSTM:300, LSTM:200" means the model has two LSTM layers. The first layer
size is 300, and the second layer size is 200
HIDDEN_LAYER = LSTM:200
#Output layer settings. Simple, softmax ands sampled softmax are supported.
Here is an example of sampled softmax:
#"SampledSoftmax:20" means the output layer is sampled softmax layer and
its negative sample size is 20
#"Simple" means the final result is the raw output of the layer.
#"Softmax" means the final result is based on "Simple" layer and run softmax
OUTPUT_LAYER = Simple
#CRF layer settings
CRF_LAYER = True
#The file name for template feature set
TFEATURE_FILENAME = Data\Models\NER_ES\tfeatures
#The context range for template feature set. In below, the context is
current token, next token and next after next token
TFEATURE_CONTEXT = 0,1,2
TFEATURE_WEIGHT_TYPE = Binary
PRETRAIN_TYPE = Embedding
#The word embedding data file name generated by Txt2Vec (
https://github.com/zhongkaifu/Txt2Vec)
WORDEMBEDDING_FILENAME = Data\WordEmbedding\wordvec_es.bin
#The context range for word embedding.
WORDEMBEDDING_CONTEXT = 0
#The column index applied word embedding feature
WORDEMBEDDING_COLUMN = 0
#The run time feature
#RTFEATURE_CONTEXT: -1,-2,-3
----------------------------------------------------------------
The bat file is here:
SET CorpusPath=.\Data\Corpus\NER_ES
SET ModelsPath=.\Data\Models\NER_ES
SET BinPath=..\Bin
REM Encoding LSTM-BiRNN-CRF model
%BinPath%\RNNSharpConsole.exe -mode train -trainfile %CorpusPath%\train.txt
-validfile %CorpusPath%\valid.txt -cfgfile .\config_ner_es.txt -tagfile
%CorpusPath%\tags.txt -alpha 0.1 -maxiter 0 -savestep 200K
-----------------------------------------------------------------------------------------------------------
As I have set maxiter to 0 it will stop when the system no longer betters
its training?
¿how much does your trainer last for english corpus on the downloaded
sample? and on wnay kind of PC / Ram, processor / OS .
Actually my process is consuming 1,9 Gbytes of RAM 100% CPU (2 cores) on a
W10 x64 dual core G2020 (2.66Ghz pentium), not any beast or numeric
workhorse no GPU installed.
The process now has this lectures on the logfile:
info,2/6/2017 2:58:13 p. m. End 28 iteration. Time duration =
00:13:43.3734915
info,2/6/2017 2:58:13 p. m.
info,2/6/2017 2:58:13 p. m. Verify model on validated corpus.
info,2/6/2017 2:58:29 p. m. Progress = 1K/1,517K
info,2/6/2017 2:58:29 p. m. Error token ratio = 3,60980155684684%
info,2/6/2017 2:58:29 p. m. Error sentence ratio = 47,7%
info,2/6/2017 2:58:37 p. m. End model verification.
info,2/6/2017 2:58:37 p. m.
info,2/6/2017 2:58:37 p. m.
info,2/6/2017 2:58:37 p. m. Start to training 29 iteration. learning rate =
0,00078125
info,2/6/2017 3:01:26 p. m. Progress = 2K/8,323K
info,2/6/2017 3:01:26 p. m. Error token ratio = 0,0864513976309284%
info,2/6/2017 3:01:26 p. m. Error sentence ratio = 1,65%
info,2/6/2017 3:01:26 p. m. Progress = 2K/8,323K
info,2/6/2017 3:01:26 p. m. Error token ratio = 0,0864513976309284%
info,2/6/2017 3:01:26 p. m. Error sentence ratio = 1,65%
info,2/6/2017 3:06:03 p. m. Progress = 5K/8,323K
info,2/6/2017 3:06:03 p. m. Error token ratio = 0,091533180778032%
info,2/6/2017 3:06:03 p. m. Error sentence ratio = 1,78%
info,2/6/2017 3:06:03 p. m. Progress = 5K/8,323K
info,2/6/2017 3:06:03 p. m. Error token ratio = 0,091533180778032%
info,2/6/2017 3:06:03 p. m. Error sentence ratio = 1,78%
info,2/6/2017 3:07:35 p. m. Progress = 6K/8,323K
info,2/6/2017 3:07:35 p. m. Error token ratio = 0,0901244410551492%
info,2/6/2017 3:07:35 p. m. Error sentence ratio = 1,81666666666667%
info,2/6/2017 3:09:01 p. m. Progress = 7K/8,323K
info,2/6/2017 3:09:01 p. m. Error token ratio = 0,0972418293405895%
info,2/6/2017 3:09:01 p. m. Error sentence ratio = 1,9%
info,2/6/2017 3:09:01 p. m. Progress = 7K/8,323K
info,2/6/2017 3:09:01 p. m. Error token ratio = 0,0972418293405895%
info,2/6/2017 3:09:01 p. m. Error sentence ratio = 1,9%
info,2/6/2017 3:10:41 p. m. Progress = 8K/8,323K
info,2/6/2017 3:10:41 p. m. Error token ratio = 0,100398246377297%
info,2/6/2017 3:10:41 p. m. Error sentence ratio = 1,9125%
info,2/6/2017 3:11:11 p. m. End 29 iteration. Time duration =
00:12:34.1796782
info,2/6/2017 3:11:11 p. m.
info,2/6/2017 3:11:11 p. m. Verify model on validated corpus.
info,2/6/2017 3:11:27 p. m. Progress = 1K/1,517K
info,2/6/2017 3:11:27 p. m. Error token ratio = 3,53827053870236%
info,2/6/2017 3:11:27 p. m. Error sentence ratio = 46,4%
info,2/6/2017 3:11:34 p. m. End model verification.
info,2/6/2017 3:11:34 p. m.
info,2/6/2017 3:11:34 p. m.
info,2/6/2017 3:11:34 p. m. Start to training 30 iteration. learning rate =
0,00078125
On Thu, Jun 1, 2017 at 3:49 PM, Andres Hohendahl <andres.hohendahl@gmail.com
… wrote:
Thanks Zhonkaifu
I'll download the latest and build them ASAP, & tell you the result
Also I saw some of my mistakes:
I missed an absolute reference on the configuration files, pointing
towards the english version "xxx_enu" and I need to train a text2vec , now
I replaced it as:
WORDEMBEDDING_FILENAME = D:\RNNSharpDemoPackage\
WordEmbedding\wordvec_es.bin
And I am going to generate a embedding file for Spanish also!
Andrés
PD:
The bad run, (based on english resources.. bad!) ended it's training 2
days after, and generated a 1.0 GB model file. I think this lengthy files
need to be pruned somehow, it's not practical to bavea 1.0 Gb parameter
file! The resulting model would be memory + resources hungry ¿might not be
useful for production?
On Wed, May 31, 2017 at 3:20 PM, Zhongkai Fu ***@***.***>
wrote:
> According RNN output lines, you are still using older RNNSharp, please
> sync the latest source code (not released demo package, since I have not
> updated it yet), build it and train your model.
>
> It's okay you can send training example, configuration file and command
> line you ran to me.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#34 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ANCPcVDZCEDg37ES4t3ZMhMnOiD8J5OIks5r_a9tgaJpZM4NE58d>
> .
>
--
ing. Andrés T. Hohendahl
director
PandoraBox
www.pandorabox.com.ar
web.fi.uba.ar/~ahohenda
--
ing. Andrés T. Hohendahl
director
PandoraBox
www.pandorabox.com.ar
web.fi.uba.ar/~ahohenda
|
First of all, your CPU has only two cores, this is the main reason why training is slowly. Secondly, I don't know if your CPU support AVX and AVX2 instructions which is for SIMD to speed up training. You could show a few of first log lines with me, and I will take a look. Finally, you could set TFEATURE_CONTEXT=0 to reduce the number of sparse features to speed up training. |
It just ended running, I'll attach the log file (compete)
My CPU is a G2020, here is the FULL Spec:
[image: Inline image 1]
I will probably make a FROM application to control the training-run
process, to visualize, and eventually stop and/or change parameters
on-the-run, may be this could be better to control, for guys who as me,
have slow CPU's
On the other hand, ¿have you seen the approach done by the people of
facebook, with fast-text? they claim to have lowered many orders of
magnitude the training by means of subsampling, I attach the link to the
git. (may be I can port it into C#, bettering the overall methods). I plan
to use this for real world application NER, in Spanish, which has as I told
you before, a very rich inflective mode on the words, and therefore the
words are agglutinative, unlike chinese, they contain lots of semantic,
grammatical and modal information inside their structure, generating a very
huge number of inflected word-forms, based upon only a small root-words
set. This makes a trainer collide if not using this info, I have built a
morphological analyzer which is able to "strip down" the words into a large
set of variables, some are semantic, some are gender, numer, root,
prefixes, suffixes, time/person/mode/colloquial in case of verbs, among
many others. ¿is there a way to train a RNN with this sparse-features
representing the same word and position of the word? I guess the feature
extractor should be tailored, I can do this!
Also I guess a NER chunker, could benefit from a multi-stage classifier,
one to detect the "entities boundaries" and other to classify the entity,
once you get the segment, you can re-classify it with another better and
simpler in-segment classifier.
For example, time-related entities, the variability in Spanish is so huge
that probably any classifier get nuts to set the boundaries, and also there
will never be enough samples of this named entities, due to the multiple
variations, to train a complete system.
I will try to do this, even with other named entities the problem is very
alike.
Thanks for your collaboration.
If you want to take a look to my work in NLP, I can attach you a PDF, in
English (but I don want it public on this thread) to a private email
address.
best regards
Andrés
…On Fri, Jun 2, 2017 at 4:43 PM, Zhongkai Fu ***@***.***> wrote:
First of all, your CPU has only two cores, this is the main reason why
training is slowly.
Secondly, I don't know if your CPU support AVX and AVX2 instructions which
is for SIMD to speed up training. You could show a few of first log lines
with me, and I will take a look.
Finally, you could set TFEATURE_CONTEXT=0 to reduce the number of sparse
features to speed up training.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ANCPcSJI4hsRDO4Gqtr5_uYuHPGM6hjOks5sAGXzgaJpZM4NE58d>
.
--
ing. Andrés T. Hohendahl
director
PandoraBox
www.pandorabox.com.ar
web.fi.uba.ar/~ahohenda
|
Hi Andrés It's really appreciated if you would like to contribute RNNSharp project. :) I cannot get your inline image for CPU G2020. According information at http://www.cpu-world.com/CPUs/Pentium_Dual-Core/Intel-Pentium%20G2020.html, it seems this CPU doesn't support AVX and AVX2 instructions, so RNNSharp cannot emit SIMD instruction to speed up. |
Hi Zhongkai Fu
I'll see to get a intel Core i5-3570 ASAP, if the speedup is worth (at
least 2 more cores!) i7 is the same, as hyperthreading does not help too
much, each thread has only half time slot!!!
I am also seeing the new AMD Ryzen 7 chips, but still fear their
compatibility with MS .net technologies for now, there have been some
awkward reports!
One more question
¿do you instruct explicitly in RNNSharp to issue SIMD instructions?
¿or is this handled by the internal .NET CLS JIT compiler? - I guess.
have to check if the .net runtime informs something on this!
I guessed (as in many information boards) Ivy Bridge series chips were AVX
compatible, but it seems intel has crippled them down in the chip, to
brighten their core i5/i7/i9 series, as he does to tailor our budgets.
thanks for the info!
Andrés
…On Sat, Jun 3, 2017 at 3:02 AM, Zhongkai Fu ***@***.***> wrote:
Hi Andrés
It's really appreciated if you would like to contribute RNNSharp project.
:)
I cannot get your inline image for CPU G2020. According information at
http://www.cpu-world.com/CPUs/Pentium_Dual-Core/Intel-Pentium%20G2020.html,
it seems this CPU doesn't support AVX and AVX2 instructions, so RNNSharp
cannot emit SIMD instruction to speed up.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ANCPcVSAjeTKf76TWQ8UoLY4kzvZNqRdks5sAPbpgaJpZM4NE58d>
.
--
ing. Andrés T. Hohendahl
director
PandoraBox
www.pandorabox.com.ar
web.fi.uba.ar/~ahohenda
|
I'm using System.Vectors which is a component of .NET core to emit SIMD instruction (AVX and AVX2) for RNNSharp. If that AMD CPU supports these AVX instructions, RNNSharp can leverage them as well. |
Hi there, I just got a CPU with 16 cores and 128Gbytes of RAM. Ready to train hard!! |
Cool! |
Greath, which are the files to download and test it?
Also ¿did you try to make int into service to set this to work as a REST or
something else? (to be consumed from other apps)
BTW I am going to send you some improvements over text2vec soon. (speed and
flexibility)
& + thanks
…On Sat, Aug 25, 2018 at 8:47 PM Zhongkai Fu ***@***.***> wrote:
Cool!
I recently introduced MKL into Seq2SeqSharp and got a significantly
improvement on performance, if you like, you could try it in RNNSharp.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ANCPcdsGOqxYE1JvbKiP_EwLIpu2LNImks5uUeIPgaJpZM4NE58d>
.
|
I just put to train the sample of English SeqClassif (NER) from your sample 143Mb flat text file, 2.2M words. |
Hello, I was wondering what are the training times for the demonstrations.
I just tried the english seq labeler, and it took 1 hour to process 10% of the corpus! (is this normal?)
It's known Deep Learning is CPU hungry, I have only 2 cores and 8GB RAM (sorry)
¿do I need to change the PC) or acquire a CUDA core to help computing?
¿Is there a way to stop learning manually, or programmatically after reaching certain error rate?
I am wondering if you ever tried sequence labelling on highly inflectional languages (like Spanish) which has lots of inflectional power (complexity) and the words as a whole string are useless, the vocabulary explodes into >300M words! and the "examples"found on text begins to be too sparse, even with negative sampling you never get certain combinations, because most verbs have over 200 versions of itself (inflections), including time-tense, person, gender, plurality, mode, etc. So there is need to train on higher level features, but not losing the "semantic" sense. ¿do you think this could be possible, like decomposing the words (by means of controlled independent lemmatization) into parts/chunks (prefix, root, suffix, as well as modal information and semantic features of the parts,) My intuition is that this might lower the training and may be better the generalization power with less extensive corpus. Like capturing higher level syntax rules, and by the way generating semantic content constraints (may be even some common sense)...
It's just a question, on theory!
The text was updated successfully, but these errors were encountered: