Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for LLMLingua #4823

Closed
TechnotechGit opened this issue Jan 8, 2024 · 7 comments
Closed

Support for LLMLingua #4823

TechnotechGit opened this issue Jan 8, 2024 · 7 comments
Labels
enhancement New feature or request stale

Comments

@TechnotechGit
Copy link

Hi! I was attempting to see if llama.cpp could be supported in LLMLingua (prompt compression) via llama-cpp-python, but it looks like attention masks are required. Attention masks are supported in transformers, and it would seem like they would enable more projects to work with llama.cpp.

I think that this might be worth pursuing in order to use LLMLingua in downstream projects, since CPU and partial-GPU prompt processing is obviously quite slow, and adds up for longer passages. Additionally, perhaps implementing LLMLingua's methods in llama.cpp is worth consideration?

@TechnotechGit TechnotechGit added the enhancement New feature or request label Jan 8, 2024
@ggerganov
Copy link
Owner

There is support for creating a custom attention mask by utilizing the position and the sequence id of the tokens in a batch. The implementation is that tokens from a given sequence id attend only to tokens from the same sequence id with smaller position. Using this, one can construct most (if not all) possible attention masks. One of the better demonstrations of this is in the lookahead decoding example.

I skimmed through the LLMLingua paper and it seems that attention masking is required for evaluating the segments of the original prompt:

image

image

If that is the only case that requires attention masking, then I think this is trivially supported - just assign different sequence ids to the different segments.

@TechnotechGit
Copy link
Author

@ggerganov Since attention masks are already implemented, I can either close this issue, or rename to "Support for LLMLingua" if this is worth pursuing.

@ggerganov ggerganov changed the title Support for attention masks Support for LLMLingua Jan 11, 2024
@xcottos
Copy link

xcottos commented Jan 11, 2024

Hi everybody, I'm really interested to the llmlingua support in llama-cpp-python since it's the most efficient way of reducing the context size (up to 7X times). I have a ready use case to test it so please feel free to ask me to test it whenever you are ready.

@pathquester
Copy link

Would also be interested in this, pretty useful.

@sathyapriyaa-sketch
Copy link

def get_prompt(instruction, new_result_str_modified, new_system_prompt=DEFAULT_SYSTEM_PROMPT):
new_prompt = new_system_prompt + new_result_str_modified
llm_lingua = PromptCompressor(device='cpu')
new_prompt = llm_lingua.compress_prompt(new_prompt, instruction="", question="", target_token=200)
SYSTEM_PROMPT = B_SYS + new_prompt + E_SYS
prompt_template = B_INST + SYSTEM_PROMPT + instruction + E_INST
return prompt_template

This's giving me - > Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx -> I'm running this on streamlit cloud, can someone help in solving this? Thanks!

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
Copy link
Contributor

github-actions bot commented Apr 4, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

5 participants