Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation details #1

Closed
thinline72 opened this issue Dec 28, 2018 · 15 comments
Closed

Implementation details #1

thinline72 opened this issue Dec 28, 2018 · 15 comments

Comments

@thinline72
Copy link

Hi Guillaume,

I'm also pretty exciting of this paper and approached suggested. I've went through your implementation as well as your questions on Stackoverflow:
https://stackoverflow.com/questions/53876974/how-is-hashing-implemented-in-sgnn-self-governing-neural-networks

Sorry, I don't have enough points to comment your Stackoverflow question and I don't have right answers, I just want to discuss it.

  1. ... on words' characters to obtain a featurized representation of words in a text ...

For me it seems like they used the whole text on character-lvl, not word-lvl. So for each example it is one vector of features that then is passed to projections. Although, you implementation is quite interesting as well because it still allows to use RNN/CNNs and keeps skip/ngrams ordering.

  1. ... it doesn't seem to yield a binary output ...

If I understand correctly paper, binary output comes from sign of dot product of feature (F) vector and generated projection (P) vector of feature in F. Assuming, that both vectors are normalized (unit length and they pointed that LSH produces unit vectors), dot product would be in range of {-1, 1} (cosine similarity) and sign of it would give us binary output.

What do you think about it? Does my understanding make sense?

I'm still not quite sure how those P vectors are obtained via LSH though.

@thinline72
Copy link
Author

Actually, I haven't found mention that F vector should be normalized, so probably there is no restriction here and we can just compute dot product + sign.

@thinline72
Copy link
Author

BTW, paper that inspired authors contains much more details regarding projection functions: https://arxiv.org/abs/1708.00630

@thinline72
Copy link
Author

Looks like this RandomBinaryProjections impl might be used: https://github.com/pixelogik/NearPy/blob/master/nearpy/hashes/randombinaryprojections.py

@guillaume-chevalier
Copy link
Owner

@thinline72 Thanks for the link! 🎉 Will read soon!

Yet, I'm hesitant on this:

it seems like they used the whole text on character-lvl, not word-lvl

They mention skip-gram. Could it be that they did like I do, but that the regression goal was formulated like word2vec's skip-gram (training with negative sampling) with some "nearest words v.s. random/far words"? Just throwing ideas here. I wonder how they use the "trainer network" in the paper you just linked (didn't read it yet). Because they can't sample the words back to their original form, they could be training directly on a softmax on cosine similarity (as an attention mechanism), like what's done in Matching Networks.

For now I'm somehow satisfied with the code I've written, but I'd still like to fully understand the original paper. My goal is to use this word-level "word projector" to use that within a self-attention mechanism (like the Transformer in AIAYN), and then to perform a sentence-level skip-gram training with a Matching-Network-like regression head. When I'll be done with this, I'll upload the big neural net to this repo (and perhaps then rename the repo), for now the repo is still private.

@thinline72
Copy link
Author

Hi @guillaume-chevalier ,
Thanks for the response and for the links, I really like an idea in Matching Networks one. I'd definitely play with it.

it seems like they used the whole text on character-lvl, not word-lvl
They mention skip-gram

Sorry, looks like I explained it incorrectly. I actually meant text-lvl, like in classical approaches. So there would not be a sequence of tokens/words/vectors, but just one sparse vector for the whole text example.

I also think that for skipgrams they used old classic ones like this implementation in NLTK.

So it looks like their preprocessing and architecture might be reproduced by using the following steps:

  1. CountVectorizer with custom analyzer that uses skipgrams (but looks like it's OK to use any other features). That would give us F.
  2. RandomBinaryProjections (T=80, d=14 in the paper), so it's 80 different projections, each produces 14 dim-size binary vector. That would give us P. Actually, it might be just one RandomBinaryProjections with T*d. Looks like they separated to T of them as there are memory restrictions when you are working with mobile devices.
  3. Concatenation of all binary vectors - i
  4. 2 FullyConnected layers of size 256
  5. Last classification layer with softmax/sigmoid activation.

So the workflow looks pretty similar to something well-known like TFIDFVecotrizer + SVD + NN, but just adjusted to the restrictions of mobile devices.

@guillaume-chevalier
Copy link
Owner

guillaume-chevalier commented Dec 31, 2018

Very interesting. I've finally read the previous supporting paper, thanks for the shootout. Here are my thoughts after reading it. To sum up, I now think that the projections are really at word-level instead of at sentence level. This is for two reasons:

  1. they use a hidden layer size of only 256 to represent words neurally (whereas sentence representations would be quite bigger), and
  2. they seem to use an LSTM on top of the ProjectionNet (SGNN) to model short sentences in their benchmarks, which would mean the ProjectionNet doesn't encode at sentence-level but at least at a lower level (probably words).

Here is my full review:

On 80*14 v.s. 1*1120 projections:

  • I thought the set of 80 projection functions was not for time performance, but rather to make the union of potentially different features. I think that either way, if one projection function of 1120 entries would take as much time to compute as 80 functions of 14 entries (80*14=1120) - please correct me if I'm wrong.

On the hidden layer size of 256:

  • I find peculiar that the size of their FullyConnected layers is only of 256. I'd expect 300 for word-level neural representations and rather 2000 for sentence-level neural representations. This leads me to think that the projection layer is at the word-level with char features and not at the sentence-level with char features.

On the benchmark against a nested RNN (see section "4.3 Semantic Intent Classification") in the previous supporting paper:

  • They say "We use an RNN sequence model with multilayer LSTM architecture (2 layers, 100 dimensions) as the baseline and trainer network. The LSTM model and its ProjectionNet variant are also compared against other baseline systems [...]". The fact they phrase their experiment as "The LSTM model and its ProjectionNet" leads me to think that they pre-tokenized texts on words and that the projection layer is applied at word-level from skip-gram char features. This would seem to go in the same direction of the fact they use a hidden layer (FullyConnected) size of only 256 rather than something over or equal to like 1000.

On teacher-student model training:

  • They seem to use a regular NN like a crutch to assist the projection layer's top-level layer to reshape information correctly. They even train the teacher at the same time that they train the student SGNN, which is something I hadn't yet seen compared to regular teacher-student setups. I'd find simpler to use a Matching Networks directly which would be quite simpler than setting up student-teacher learning. I'm not sure how their "graph structured loss functions" works - I yet still assume that they'd need train the whole thing like in word2vec with skip-gram or CBOW (but here with the new type of skip-gram training procedure instead of the char feature-extraction skip-gram). I wonder why they did things in a so complicated way. Matching Networks (a.k.a. cosine similarity loss, a.k.a. self-attention queries dotted with either attention keys or values before a softmax) directly with negative sampling seems so much simpler.

@thinline72
Copy link
Author

Hi @guillaume-chevalier ,

Sorry, I haven't read in deep the supporting paper. I only checked how Randomized Binary Projections are implemented. So yeah, they might worked on words or chars level and it totally makes sense to do it. My point was that in Self-Governing Neural Networks for On-Device Short Text Classification paper it's text level. It's clear from the description and architecture (Figure 1).

On 8014 v.s. 11120 projections:

Based on Randomized Binary Projections creating 80 random matrices of size 14inp_dim (and concatenating results after applying) or just one random matrix of size 1120inp_dim - are the same things. After all it's just about having 8014inp_dim or 1120inp_dim random numbers. But the latest one would require much more memory to compute hashing (as it'd involve multiplication of much bigger matrix) rather than 80 computations of smaller matrices. So 8014 is more appropriate for devices with limited RAM. That's why I think they used 80*14.

@guillaume-chevalier
Copy link
Owner

Reading your reply and the SGNN paper again, you seem right about the 80*14 thing. Next thing would be to understand exactly how they did the skip-gram, but from what I read I don't have enough information yet. For now, at least I'm fine with using n-grams, I somehow like it that way although I won't really compare n-gram v.s. skip-gram, at least in the short term. I also still have doubts on my usage of random projections, I hope it works similarly enough to the one you linked despite the fact I don't obtain booleans. Handn't found the time to do detailed readings on LSH, althought I liked this video (and the next one titled #9 in the playlist), and the fact that such sparse random hashing preserves the cosine similarity in a lower-dim space.

Overall, thank you for the discussion - I'm glad you linked me the previous supporting paper ProjectionNet and that we exchanged on those!

@guillaume-chevalier
Copy link
Owner

Just got an idea.

Instead of using a Matching Networks loss, I'll use a Cosine Proximity loss which is simpler.

I'd compute the similarity of the B sentences to each other such as to have a BxB matrix of similarities between -1 and 1. The result would be a Block Diagonal Matrix in which every block would have inner values of 1, and elements nearby not in a block would take value 0. I'd have blocks because this would be skip-gram. But I'd compare multiple windows (blocks) of sentences to other blocks instead of having the regular skip-gram block-v.s.-nothing. So I'd have a blocks v.s. blocks. training setting. What do you think @thinline72 ?

One thing that bothers me is that the cosine similarities aren't passed through a softmax or something. For example I could take a binary crossentropy loss instead of a cosine proximity, but I don't know how that would compare. I feel that a cosine poximity wouldn't perform well. I then read thread which validates my thinking. Would you have a loss to suggest?

@thinline72
Copy link
Author

Hi @guillaume-chevalier ,

I think this paper Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs might be useful for your, although I'm not 100% sure 🙂

@guillaume-chevalier
Copy link
Owner

Hi again @thinline72, I tested with only 1 hasher and things were quite faster. See:
#2

That's an interesting paper you just linked. However for the time being, I finally settled on a binary cross-entropy loss (bxent) on a block diagonal matrix of similarity. Because the similarity is from -1 to 1, I remap it from 0 to 1 before sending to the bxent loss (+=1, then /=2).

The whole code for the SGNN+Transformer+SimilarityBXENT will soon become public here:
https://github.com/guillaume-chevalier/NLP-TP3

@thinline72
Copy link
Author

thinline72 commented Jan 15, 2019

This is an amazing work, well done! And thanks for mentioning me as well.

So I think this issue might be closed. Feel free to reopen or contact me via email if you want to discuss anything else 🙂

@guillaume-chevalier
Copy link
Owner

@thinline72 Hey, I just found this out, they also explore the diagonal matrix, here it's for every word of a sentence related to each other, and also they explore learning at different levels: https://arxiv.org/pdf/1808.08949.pdf
This paper is cited in the abstract of BERT, which is special to do in an abstract I think.

@thinline72
Copy link
Author

Great! Thank you for sharing @guillaume-chevalier

@glicerico
Copy link

Thanks for your discussion guys, I gave some comments as a response to the original question here: https://stackoverflow.com/questions/53876974/how-is-hashing-implemented-in-sgnn-self-governing-neural-networks/65546274#65546274

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants