-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for LLMLingua #4823
Comments
There is support for creating a custom attention mask by utilizing the position and the sequence id of the tokens in a batch. The implementation is that tokens from a given sequence id attend only to tokens from the same sequence id with smaller position. Using this, one can construct most (if not all) possible attention masks. One of the better demonstrations of this is in the lookahead decoding example. I skimmed through the LLMLingua paper and it seems that attention masking is required for evaluating the segments of the original prompt: If that is the only case that requires attention masking, then I think this is trivially supported - just assign different sequence ids to the different segments. |
@ggerganov Since attention masks are already implemented, I can either close this issue, or rename to "Support for LLMLingua" if this is worth pursuing. |
Hi everybody, I'm really interested to the llmlingua support in llama-cpp-python since it's the most efficient way of reducing the context size (up to 7X times). I have a ready use case to test it so please feel free to ask me to test it whenever you are ready. |
Would also be interested in this, pretty useful. |
def get_prompt(instruction, new_result_str_modified, new_system_prompt=DEFAULT_SYSTEM_PROMPT): This's giving me - > Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx -> I'm running this on streamlit cloud, can someone help in solving this? Thanks! |
This issue is stale because it has been open for 30 days with no activity. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Hi! I was attempting to see if llama.cpp could be supported in LLMLingua (prompt compression) via llama-cpp-python, but it looks like attention masks are required. Attention masks are supported in transformers, and it would seem like they would enable more projects to work with llama.cpp.
I think that this might be worth pursuing in order to use LLMLingua in downstream projects, since CPU and partial-GPU prompt processing is obviously quite slow, and adds up for longer passages. Additionally, perhaps implementing LLMLingua's methods in llama.cpp is worth consideration?
The text was updated successfully, but these errors were encountered: