-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exact command to reproduce the curve in MOD is Vibe? #10
Comments
I trained 300M model on single A6000 (from paperspace grident) with bf16-mixed presicon ,
first install this repo git clone and pip install -e . I'll add README here |
Thank you so much! Yeah I've pretty much read your code related to MoD. One concern for me is that I noticed the dataset is implemented as an iterator object. So I am not sure whether lightning fabric would handle this correctly in a multi-gpu setup as we would need a distributed sampler. |
We are taking softmax over long seq length, most values at other end will be close to zeros, |
@joey00072
Since MoD only makes a very small fraction of the tokens for caluating attention, I have concerns about model performance in some extreme cases, such as very short inputs.
I think topK should have a minial value, like 10 or seq_len (when seq_len < 10). And when faced with long text, this thing is kind of like sparse attention or a sliding window, which I think is acceptable. I would like to consult your thoughts |
Hi Joey,
Thank you for such a wonderful OS work! !
Could you share the exact command to reproduce the curve in your MOD is Vibe blog? For example, did you use DDP and how many GPUs?
The text was updated successfully, but these errors were encountered: