-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add retrieval
example
#6193
add retrieval
example
#6193
Conversation
3f7be5c
to
eb760a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the question about batch
and u_batch
, let's ask @ggerganov
examples/retrieval/retrieval.cpp
Outdated
for (auto & chunk : chunks) { | ||
auto inp = ::llama_tokenize(ctx, chunk.textdata, true, false); | ||
if (inp.size() > n_batch) { | ||
inp.resize(n_batch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure about this line. Will it drops tokens if inp.size() > n_batch
? If that's the case, that means we drop information from embedding, which will cause the output embedding to become inaccurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we'll have to restrict the chunk size to fit the n_batch parameter somehow. This presents the identical issue to setting a configurable maximum chunk length: How should we handle situations where separators fail to appear within the maximum length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe reversed, we can force the user to give sufficient n_batch
(and n_ubatch
) before running the app. AFAIK that's because with embeddings, we use non-causal models the requires prompt to be processed in one batch.
My plan is maybe after tokenize all chunks, you can see if any chunk have tokens.size() > n_batch
, then raise an error and exit the program.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can see if any chunk have tokens.size() > n_batch, then raise an error and exit the program.
This seems like the best option for now 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't it make more sense to determine params.n_batch
based on the largest chunk after tokenization? It should not be a user-provided parameter in this example - just set it to the largest chunk size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that can be a solution. I'm just not sure if that requires re-creating a new llama_context
, because the prior llama_init_from_gpt_params
call already used params.n_batch
and params.n_ubatch
Also maybe you can look at the implementation of langchain's CharacterTextSplitter, which also support overlap chunks. We can use single separator for now, assuming that the input dataset is single-language. However the overlap chunk is quite important I think, as it yield a better result in RAG. |
Hey @ngxson, thanks for the feedback! 😄 Just pushed some quick updates before hitting the hay. I'll get to the rest tomorrow.
Totally agree with aiming for a comprehensive RAG example down the line. But for now, based on @ggerganov's insights in #5692, it seemed best to keep things simple. |
I mentioned CharacterTextSplitter because the split behavior that we're having now is very similar to that. I agree that having overlap is not very important at this stage, but I still consider it to be a simple thing but yet effective, so maybe we'll add this in another PR. I'm also currently working on a retrieval example on wllama. During implementation, I noticed that some models need prefix for document embedding and query embedding. Maybe not really important to care about at this stage, but I just list it here so we don't forget. For example, with bge models, you need to add Also here is my test dataset if you need. It's just wiki page of Albert Einstein, but the data is quite varied, so that will be a very real-life test case. |
For embeddings, |
Thanks for the information 🙇♂️ , since batch_size defaults to 2048, which is larger than ubatch_size at 512, I'll set ubatch_size equal to batch_size right from the start so that we don't have explicitly provide the option every time. +) IIUC, we should apply the same approach to the embedding example, right? 🤔 |
Yes, the embedding.cpp example is a bit out-dated I think, for now let's set n_ubatch == n_batch for simplification. |
I'll get the embedding example fixed likewise after this is merged 😃 |
* add `retrieval` example * add README * minor fixes * cast filepos on print * remove use of variable sized array * store similarities in separate vector * print error on insufficient batch size * fix error message printing * assign n_batch value to n_ubatch * fix param definitions * define retrieval-only parameters in retrieval.cpp * fix `--context-file` option to be provided multiple times for multiple files * use vector for `query_emb` * add usage description in README * fix merge conflict * fix usage printing * remove seed setting * fix lint * increase file read buffer size * retrieval : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add `retrieval` example * add README * minor fixes * cast filepos on print * remove use of variable sized array * store similarities in separate vector * print error on insufficient batch size * fix error message printing * assign n_batch value to n_ubatch * fix param definitions * define retrieval-only parameters in retrieval.cpp * fix `--context-file` option to be provided multiple times for multiple files * use vector for `query_emb` * add usage description in README * fix merge conflict * fix usage printing * remove seed setting * fix lint * increase file read buffer size * retrieval : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add `retrieval` example * add README * minor fixes * cast filepos on print * remove use of variable sized array * store similarities in separate vector * print error on insufficient batch size * fix error message printing * assign n_batch value to n_ubatch * fix param definitions * define retrieval-only parameters in retrieval.cpp * fix `--context-file` option to be provided multiple times for multiple files * use vector for `query_emb` * add usage description in README * fix merge conflict * fix usage printing * remove seed setting * fix lint * increase file read buffer size * retrieval : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
resolves #5692
retrieval
example--context-files
: multiple files that are to-be embedded--chunk-size
: minimum size of each text chunk to be embedded--chunk-separator
: STRING to divide chunks by. newline by defaultI don't think I put enough thought into naming these parameters, so proposals are welcome
the new
retrieval
example can be tested as belowwhich chunks & embeds all given files and starts a loop requesting query inputs:
and on query input, top k chunks are shown along with file name, chunk position within file and original text
+)
I'd like opinion on two subjects regarding the chunking functionality
--chunk-separator . 。
chunks input by either . or 。)+) (help wanted)
I have an ongoing problem rn where the below assertion fails when
-ub 2048
option is not given. It seems related to the recent parallelism support which I haven't had the chance to look deeply into, so any advice about how to solve this is very welcome.