-
Notifications
You must be signed in to change notification settings - Fork 70
Fine gateloop doesnt use the param gateloop_use_heinsen #87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@MarcusLoppe if you think gateloop helps, let's add it! |
The gateloop actually works, the heinsen works even better. float32 coarse_pre_gateloop_depth =2, coarse_pre_gateloop_depth =0, coarse_pre_gateloop_depth =2, Epoch 15 average loss: 1.742983629359281 fine gateloop_use_heinsen = True Epoch 15 average loss: 0.84913022887898 float16 fine gateloop_use_heinsen = True coarse_pre_gateloop_depth =2, Epoch 15 average loss: 0.733044471651475 fine gateloop_use_heinsen = False coarse_pre_gateloop_depth =2, Epoch 15 average loss: 1.1578134642565314 |
@MarcusLoppe wow, thanks! that's surprising because the tokens for the fine transformer aren't very long at all |
@lucidrains I actually get much better training progression if I rearrange |
@MarcusLoppe that was mostly for the hierarchical transformer (the fine stage attends to each token within each face), but for gateloop you are right, it could make sense to just have it act on the whole sequence |
I replace the the line: L1657 with: But for the fine gateloop it already proccess the whole sequence since it rearranges to Little off topic but you were right that MeshGPT would be a stepping stone for 3D mesh generation since there have been many autoregressive 3D projects/papers thanks to you ❤️ Have you looked into the MeshGPT clone repos MeshAnything pivotmesh to see if they have any interesting changes? |
@MarcusLoppe haha, well the real OG would be Yawar Siddiqui for all the clever tricks he discovered for encoding the mesh. But also you for helping with all the training, feedback, debugging! I don't think this repository would have ever taken off and spawned off downstream work without your help No not yet, but I will gather up a few and read them probably in a month or two |
Very true :) Although they said they haven't released the source code due they are getting legal approval. Using 3k triangles would result in 9k token length, hopefully the PPL score wont get too high :) |
pretty sure I'll see your name on some paper within a year 🤣 |
Hey, What would you say if we were to remove the rearranging to so the fine_decoder inputs I could make a pull request but not quite sure how to fix the cache code, looks like I can just remove:
Using
Using:
|
@MarcusLoppe hey Marcus, i think it is difficult to get the caching right what about just increasing the number of layers of gateloop in the coarse stage? |
Is there any particular reason why you want to use (b nf)? I had some issues training on 9k labels and the loss got stuck around 0.2 (with 0.001 increments per epoch), but then I implemented (nf n) then the progress got alot faster and was able to get to 0.001 loss. I rather not since the gateloop layers has some compute cost, using 2 fine & 2 coarse increases the epoch time by around 25%. |
ah, yeah, I can revisit this at a later date added ability for full sequence gateloops just preceding the fine transformer, easiest thing that can be done for now |
If I do some testing and get the cache to work, would you implement it or do you doubt if it's better? :) I'll give it a go, however you wrote the wrong layer name , you are calling the same course gateloop twice. |
@MarcusLoppe haha, no i have no doubts! if you can get it working, would be happy to merge it in oops, let me quick fix |
@MarcusLoppe set you up with a test btw. as long as this passes you are golden |
That was bit more complicated then I thought, I'll need to have a think on that. But I tested using 2x post course gateloop but that made the performance slower and worse :/ |
Could I get some advice before rasing a issue? All the new ai GPU's have loads of fp16 compute, for example, if i train using float32 on 8 h100's it takes 6hrs per epoch (2.5B dataset) but If I use fp16 it takes 3hrs since they got x2 compute in fp16. So some lessons I've learned is that the coarse adaptive rmsnorm and gateloop (very fast when using heinsen) makes the loss go nan very fast. The reason why gateloop makes it go nan is probably due to different high precision math expressions and dtypes (complex64). I've only a little more then a week left to use the h100's so instead of dragging my ass through debugging the code for a week I'd figure I might ask you :) |
Hey, congrats with the success of alphafold3 :) Quite surprised when they mentioned you in "Last week in ai"-podcast So I got some new of my own, I manage to train the autoencoder & transformers using only 1 quantizer on meshes with 700-1000 triangles . So for instance a 800 triangle mesh, MeshXL would require a sequence length of 7200, but this needs only 2400 tokens! <3 I found out that the persistent memory (num_mem_kv) & 64 sos tokens was quite useful when dealing with higher detailed meshes. There is a issue regarding the text conditioning, I'd imagine that you implemented text_condition_cond_drop_prob to make it more robust. This will then cause issues when generating the mesh, if you provide it the text "chair" and it's never trained on that exact embedding and only on the label "a chair" it will fail even if you have trained with the text_condition_cond_drop_prob. Please let me know if you have any idea :) ? Here are some tests, I've tried many different but paraphrase-multilingual-MiniLM-L12-v2 seems to do the with small syntax differences but not as well as semantic words.
|
No description provided.