diff --git a/documents/references.bib b/documents/references.bib index 352eb5a..935bea1 100644 --- a/documents/references.bib +++ b/documents/references.bib @@ -1,5 +1,18 @@ # Related work +## RAG + +# +@misc{lewis2021retrievalaugmentedgenerationknowledgeintensivenlp, + title={Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks}, + author={Patrick Lewis and Ethan Perez and Aleksandra Piktus and Fabio Petroni and Vladimir Karpukhin and Naman Goyal and Heinrich Küttler and Mike Lewis and Wen-tau Yih and Tim Rocktäschel and Sebastian Riedel and Douwe Kiela}, + year={2021}, + eprint={2005.11401}, + archivePrefix={arXiv}, + primaryClass={cs.CL}, + url={https://arxiv.org/abs/2005.11401}, +} + ## Differential Privacy # diff --git a/documents/report.md b/documents/report.md index cb26e41..d56b5e6 100644 --- a/documents/report.md +++ b/documents/report.md @@ -23,7 +23,7 @@ bibliography: references.bib # Introduction -Retrieval-Augmented Generation (RAG) has become a leading approach to enhance the capabilities of Large Language Models (LLMs) by supplying them with up-to-date and pertinent information. This method is particularly valuable in environments where knowledge bases are rapidly evolving, such as news websites, social media platforms, or scientific research databases. By integrating fresh context, RAG helps mitigate the risk of "hallucinations"—instances where the model generates plausible but factually incorrect information—and significantly improves the overall quality and relevance of the responses generated by the LLM. +Retrieval-Augmented Generation (RAG, [@lewis2021retrievalaugmentedgenerationknowledgeintensivenlp]) has become a leading approach to enhance the capabilities of Large Language Models (LLMs) by supplying them with up-to-date and pertinent information. This method is particularly valuable in environments where knowledge bases are rapidly evolving, such as news websites, social media platforms, or scientific research databases. By integrating fresh context, RAG helps mitigate the risk of "hallucinations"—instances where the model generates plausible but factually incorrect information—and significantly improves the overall quality and relevance of the responses generated by the LLM. However, incorporating external documents into the generation process introduces substantial privacy concerns. When these documents are included in the input prompt for the LLM, there is no foolproof way to ensure that the generated response will not accidentally reveal sensitive or confidential data [@qi2024followinstructionspillbeans]. This potential for inadvertent data exposure can lead to serious breaches of privacy and presents significant ethical challenges. For instance, if an LLM is used in a healthcare setting and it accidentally includes patient information from an external document in its response, it could violate patient confidentiality and legal regulations. @@ -91,21 +91,42 @@ DP-RAG is made of 2 main components: * A method to collect documents related to the question in a way that does not prevent its output to be used in a DP mechanism. * A method to use the collected documents to prompt a LLM and produce a reponse with DP guarantees. -To understand the need for these components, let's describe what RAG is usually made of, and the assumptions we make for its private variant (DP-RAG). +To understand the need for these components, let's describe what RAG is usually made of (see also[@lewis2021retrievalaugmentedgenerationknowledgeintensivenlp]) and introduce some notations. -A LLM: $\mathcal{L}$ is a function, taking some text, in the form of a sequence of tokens: $x = \left$ as input and outputing a probability distribution of the next token $x_n$ conditional on $x$: -$$\mathcal{L}(y, x) = \mathcal{L}(s, \left) = \Pr(x_n = y | \mathcal{L}, x_0, x_1, \ldots, x_{n-1})$$ +A LLM: $\mathcal{L}$ is a function, taking some text, in the form of a sequence of tokens: $x = \left$ as input and outputing a probability distribution of the next token $x_{n+1}$ conditional on $x$: +$$\mathcal{L}(y, x) = \mathcal{L}(s, \left) = \Pr(x_{n+1} = y | \mathcal{L}, x_1, x_2, \ldots, x_n)$$ -We assume we have a set of $N$ documents: $d_1, d_2, \ldots, d_N$ containing domain specific knowledge. These documents are also sequences of tokens: $d_i = \left$. We also assume these documents are *privacy sensitive*, and make the relatively strong assumption that each document relates to only one individual that we call *privacy unit* (PU)[^2]. +We assume we have a set of $N$ documents: $D = \left\{d_1, d_2, \ldots, d_N\right\} \subset \mathcal{D}$ containing domain specific knowledge. These documents are also sequences of tokens: $d_i = \left$ (we will also denote $\left$ the concatenation of 2 sequences of token or a sequence with one token). -[^2]: Such structuration of documents by privacy unit can sometime be achieved by cutting documents and groupping all the content relative to one PU in one document. +We also assume we have a similarity function $S: \mathcal{D}^2 \mapsto [-1, 1]$ which value is close to 1 when two documents are very similar, close to 0 when independent, and close to -1 when conveying opposite meaning. $S$ will be the cosine similarity between some embedings of the documents, mapping them to some $d$-dimensional vector space: $\mathbb{R}^d$: +$$S(d_i, d_j) = \frac{\left}{\|E(d_i)\|_2\|E(d_j)\|_2}$$ + +When receiving a query in the form of a sequence of token: $q = \left$, the similarity between $q$ and each document is computed and the top $k$ documents in term of similarity are collected: +$$d_{i_1}, d_{i_2}, \ldots d_{i_k} \text{ with } S(q, d_{i_1}) \geq S(q, d_{i_2}) \geq \ldots \geq S(q, d_{i_N})$$ + +Then a new query $q_{RAG}$ is built by concatenating the original query $q$ with the top $k$ documents and other elements (the operation is denoted $\left<\cdot, \ldots ,\cdot\right>_{RAG}$) +$$q_{RAG} = \left_{RAG}$$ + +The augmented query is then sent to the LLM to compute the distribution of the next token (the first token of the response) +$$\mathcal{L}(r_1, \left_{RAG})$$ + +The token is generated by sampling according to the distribution (or proportionaly to some power $1/T$ of the distribution) or my selecting the mode of the distribution (the most likely token or the limit when $T$ goes to $0$). + +The tokens of the response are then generated one by one in an auto-regressive manner (the generated response tokens are concatenated to the input sequence): +$$\mathcal{L}(r_{i+1}, \left<\left_{RAG}, r_1, r_2,\ldots, r_i\right>)$$ + +![A broad picture of how RAG works](figures/noDP-RAG.svg){ width=100mm } + +In the private variant of the problem (DP-RAG) also assume the documents are *privacy sensitive*, and make the additional assumption that each document relates to only one individual that we call *privacy unit* (PU)[^3]. + +[^3]: Such structuration of documents by privacy unit can sometime be achieved by cutting documents and groupping all the content relative to one PU in one document. ## Differential Privacy and its application to RAG A (randomized) algorithm: $\mathcal {A}$ provides $(\epsilon,\delta)$-Differential Privacy *if and only if* for all event $S$ and neighboring datasets $D_0$ and $D_1$, we have: $$\Pr[{\mathcal {A}}(D_{0})\in S]\leq e^{\varepsilon }\Pr[{\mathcal {A}}(D_{1})\in S]+\delta$$ -This means that for datasets that differ by one individual, neighboring datasets, the algorithm's outputs are not different in a statistically significant manner. This property guarantees that no bit of information can be learned. See [@dwork2014algorithmic] for more background on DP. +This means that for datasets that differ by one individual, neighboring datasets, the algorithm's outputs are not different in a statistically significant manner. This property guarantees that no bit of information can reasonably be learned about an individual. See [@dwork2014algorithmic] for more background on DP. ![A broad picture of how RAG works](figures/noDP-RAG.svg){ width=100mm } diff --git a/documents/report.pdf b/documents/report.pdf index 8109c9f..5f71d32 100644 Binary files a/documents/report.pdf and b/documents/report.pdf differ