From 76b22ec9bfc67d58c52bb1f9dc951693d524fc20 Mon Sep 17 00:00:00 2001 From: Artem Kozlov Date: Sun, 14 Jul 2024 12:19:28 +0200 Subject: [PATCH 1/6] Improvements for sec 5 & 10 --- .../Magic_shaperoint_design.md | 142 ++++++++++++++---- 1 file changed, 111 insertions(+), 31 deletions(-) diff --git a/Design_Doc_Examples/Magic_shaperoint_design.md b/Design_Doc_Examples/Magic_shaperoint_design.md index 76c0f85..c94377d 100644 --- a/Design_Doc_Examples/Magic_shaperoint_design.md +++ b/Design_Doc_Examples/Magic_shaperoint_design.md @@ -360,9 +360,8 @@ Considering the minimal variability among the documents and the sufficient cover 1. File Type Handler: Differentiate and handle file types accordingly, as PDFs and images may require additional processing steps. 2. Text Extraction: Deploy a customized OCR solution designed to handle non-text elements. -3. Text Preprocessing: Remove unwanted characters, whitespace, or any artifacts. -4. Markdown Formatting: Ensure that the extracted content is formatted correctly according to markdown standards. -5. Error Management & Spell Checking: Integrate an error handler and a spell checker to maintain data quality. +3. Markdown Formatting: Ensure that the extracted content is formatted correctly according to markdown standards. +4. Error Management & Spell Checking: This part ensures extraction logging and raises awareness for the maintainer that some documents might not be reliable. #### Retrieval-Augmented Generation Framework @@ -423,7 +422,7 @@ A basic RAG system consists of the following components components: - DB-indexing 2. Retrieval Layer: - Embedder - - DB simularity search + - DB simularity search. This part is usually provided by the same tools utilized for indexing. 3. Chat Service: - Manages chat context - Prompt template constructor: supports diaologs for clarification @@ -434,11 +433,23 @@ A basic RAG system consists of the following components components: - Provides a dialogue mode for user interaction. - User Feedback: Collects user input to continuously refine the system. -We have opted to develop an in-house embedder while utilizing API calls to vendor-based LLMs. +##### Locating the components -In-house embedder: +###### Embedder + +A good embedder should define the retrieval layer's ability to: + +- Store content representations efficiently +- Encode the content and queries to be semantically similar +- Capture nuanced details due to domain-specific content + +Considering that the retrieval layer is the first step in the pipeline, we do believe it could become a bottleneck in performance. This is because if the context is provided incorrectly, there is less ability for the upstream generation model to improve upon the irrelevant context. + +Taking these factors into account, we might be ready to explore and evaluate the performance of various encoders at our disposal. We seek to avoid being restricted by any particular design solution, ensuring that there is room for continuous enhancement over time. With this perspective, we will consider the potential of implementing an in-house embedding solution. + +Here are the benefits we highlight: - Provides potential for improving this critical component without vendor lock-in -- Offers deterministic behavior +- Provides control over versioning, determinism, availability (not going to be depricated) - Does not require us to provide per-token costs - Could potentially benefit from interaction data enhancements @@ -446,23 +457,49 @@ Drawbacks: - Development and maintenance costs. - Per-token costs may not be as optimized as those of larger companies. -API-based LLMs: +When it comes to generation levels, considering the number of users and the app economy, there is no clear evidence that the company would like to invest in training or fine-tuning custom LLMs. Therefore, it might be beneficial to keep in mind the use of vendor-based API-accessible LLMs. -- LLMs are continually improving, particularly in few-shot learning capabilities. We don't want to invest in LLM traning. +Here are the potential benefits: +- LLMs are continually improving, particularly in few-shot learning capabilities - Competitive market dynamics are driving down the cost of API calls over time - Switching vendors involves minimal effort since it only requires switching APIs, allowing for potential utilization of multiple vendors. Drawbacks: - Less control over the responses - Data privacy (though not a significant concern) +- There is a possibility of service denial from a vendor on account of policy-related issues, such as content restrictions or economic sanctions -We have also selected an open-source framework, LlamaIndex for RAG, which supports the aforementioned design choice and offers many capabilities out of the box, including: +#### Framework Selection +When considering a framework, we would like it to support the following features: 1. Document storage 2. Index storage 3. Chat service 4. Modular design for document extraction that supports custom modules -5. Built-in logging and monitoring capabilities +5. Modular design for retrieval and generation that can utilize both local and vendor-based solutions +6. Built-in logging and monitoring capabilities + +We will compare a couple of popular frameworks that might suit our needs: LlamaIndex and LangChain. + +Here are some resources that summarize the differences between the two frameworks: +1. [LlamaIndex vs LangChain: Haystack – Choosing the Right One](https://www.linkedin.com/pulse/llamaindex-vs-langchain-haystack-choosing-right-one-subramaniam-yvere/) +2. [LlamaIndex vs LangChain: Key Differences](https://softwaremind.com/blog/llamaindex-vs-langchain-key-differences/) +3. [LangChain vs LlamaIndex: Main Differences](https://addepto.com/blog/langchain-vs-llamaindex-main-differences/) + +| Feature/Aspect | LangChain | LlamaIndex | +|-------------------------|--------------------------------------------------|-------------------------------------------------| +| **Main Purpose** | Various tasks | Querying and retrieving information using LLMs | +| **Modularity** | High, allows swapping of components | Average, yet sufficient for our current design | +| **Workflow Management** | High, supports managing chains of models/prompts | Average, primarily focused on querying | +| **Integration** | High: APIs, databases, etc. | Average: APIs, data sources, but customizable | +| **Tooling** | Debugging, monitoring, optimization | Debugging, monitoring | +| **LLM Flexibility** | Supports various LLMs (local/APIs) | Supports various LLMs (local/APIs) | +| **Indexing** | No primary focus on indexing | Core feature, creates indices for data | +| **Query Interface** | Complex workflows | Straightforward | +| **Optimization** | Optimization of LLM applications | Optimized for the retrieval of relevant data | +| **Ease of Use** | Challenging | Easy | + +Given the pros and cons listed above, it appears that LlamaIndex provides all the features we are looking for, combined with an ease of use that could reduce development and maintenance costs. Additionally, LlamaIndex offers enterprise cloud versions of the platform. If our solution evolves towards a simpler design, we might want to move to the paid cloud version if it makes economical sense. ### **VI. Error analysis** @@ -857,38 +894,81 @@ The system has latency and feedback based switchings, which reroutes requests to ### XI. Monitoring -#### Logging +#### Engineering Logging & Monitoring -1. **Ingestion Layer**: Every step of the ETL pipeline for document extraction must be fully logged to ensure the process is reproducible and help issue resolution. +1. **Ingestion Layer**: + - Process and I/O timings + - Code errors -2. **Retrieval**: Logging should save the details of each query, including the tokenizer used, the document context found within a particular document version, and any other relevant metadata that could aid in future analyses. +2. **Retrieval**: + - **Embedder**: Monitor preprocessing time, embedding model time, and utilization instances of the embedding model. + - **Database (DB)**: Monitor the time taken for each retrieval operation and DB utilization. + +3. **Generation**: + - **LLM**: Monitor latency, cost, error rates, uptime, and the volume of generated content to predict scaling needs. -3. **Chat History**: Storing all chat history is crucial for a thorough analysis and debugging process, providing valuable insights into user interactions and system performance +#### ML Logging & Monitoring -#### Monitoring +1. **Ingestion Layer** + - Every step of the ETL pipeline for document extraction must be fully logged to ensure the process is reproducible and help issue resolution + - Statistics for documents during ingestion should be monitored, including word count, character distribution, document length, paragraph length, detected languages, and the percentage of tables or images + - Monitor the preprocessing layer to bring awareness of not ingested documetns or documents with too many errors -1. **Ingestion Layer**: statistics for documents during ingestion should be monitored, including word count, character distribution, document length, paragraph length, detected languages, and the percentage of tables or images +2. **Retrieval**: + - Logging the details of each query, including the tokenizer used, the document context found within a particular document version, and other relevant metadata for future analyses + - Keep track of the indexes found, similarity scores -2. **Retirement**: - - **Embedder**: Monitor preprocessing time, embedding model time, and utilization instances of the embedding model - - **Database (DB)**: Keep track of the indixes found, similarity scores, and the time taken for each retrieval operation - -3. **Augmented Generation**: Quality of generated content through user feedback, cost and latency. Furthermore, monitor the volume of generated content to predict scaling needs. - -4. **System Health Metrics**: Implement continuous monitoring of system health metrics such as CPU usage, memory usage, disk I/O, network I/O, error rates, and uptime to ensure the system is functioning optimally. +3. **Chat History**: Storing all chat history is crucial for a thorough analysis and debugging process, providing valuable insights into user interactions and system performance -5. **Alerting Mechanisms**: Build an alerting mechanisms for any anomalies or exceeded thresholds based on the metrics being monitored. +4. **Augmented Generation**: + - Quality of generated content through user feedback +5. **Alerting Mechanisms**: Have an alerting mechanisms for any anomalies or exceeded thresholds based on the metrics being monitored. #### Tooling -1. **For RAG operations - Langfuse callback**. - -2. **For System Health Metrics, Ingestion Layer - Prometheus & Grafana**: Prometheus is an open-source system monitoring and alerting toolkit. Grafana is used to visualize the data collected by Prometheus. - -3. **Code error reports - Sentry.io**: Sentry is a widely-used error tracking tool that helps developers monitor, fix, and optimize application performance. +1. **For RAG Operations - Langfuse Callback**: + - Integrates with LlamaIndex + - Supports measuring the quality of the model through user feedback, both explicit and implicit + - Calculates costs, latency, and total volume + - For more information about analytics capabilities, see: [Langfuse Analytics Overview](https://langfuse.com/docs/analytics/overview) + +2. **For System Health Metrics, Ingestion Layer, Alerting - Prometheus & Grafana**: + - Prometheus is an open-source system monitoring and alerting toolkit + - Grafana is used to visualize the data collected by Prometheus + - Since LLMs logging is stored within Langfuse, there is no need to build additional solutions for this + + **Why not a standard ELK stack?** + + For more details, please read this great blogpost [Prometheus-vs-ELK](https://www.metricfire.com/blog/prometheus-vs-elk/) + + | Feature/Aspect | Prometheus | ELK (Elasticsearch, Logstash, Kibana) | + |-------------------------------|---------------------------------------------|---------------------------------------------------| + | **Primary Use Case** | Metrics collection and monitoring | Log management, analysis, and visualization | + | **Data Type** | Numeric time series data | Various data types (numeric, string, boolean, etc.)| + | **Database Model** | Time-series DB | Search engine with inverted index | + | **Data Ingestion Method** | Pull-based metrics collection via HTTP | Log collection from various sources using Beats and Logstash | + | **Data Retention** | Short-term (default 15 days, configurable) | Long-term | + | **Visualization Tool** | Grafana | Kibana | + | **Alerting** | Integrated with Prometheus | Extensions | + | **Operational Complexity** | Lower (single-node) | Higher (clustering) | + | **Scalability** | Limited horizontal scaling | High horizontal and vertical scalability | + | **Setup and Configuration** | Simple | Complex | + + **Pros for this solution:** + 1. Metric-Focused Monitoring: Prometheus is optimized for collecting and analyzing time-series data, making it ideal for tracking metrics + 2. Ease of Setup and Configuration: Prometheus's pull-based model simplifies the setup process + 3. Operational Simplicity: It is advantageous without needing a large, dedicated team to manage it + 4. Real-Time Alerts and Querying: Prometheus provides a powerful query language (PromQL) and supports real-time alerting + + **Cons:** + 1. No vertical scaling. + 2. Limited logs data retention. This might become a problem if we change the RAG framework and want to store ML logs elsewhere + +3. **Code error reports - Sentry.io**: + - Sentry is a widely-used error tracking tool that helps developers monitor, fix, and optimize application performance + - We may choose between self-hosted versions and the paid cloud version in the future. -4. **For alerting mechanism - Prometheus Alertmanager**: Alertmanager handles alerts sent by Prometheus servers and takes care of deduplicating, grouping, and routing them to the correct receiver. ### **XII. Serving and inference** From b45facdcfe4302e3ad892b77ea5f5fdaea5787e9 Mon Sep 17 00:00:00 2001 From: anna-gulik <43071140+anna-gulik@users.noreply.github.com> Date: Sat, 13 Jul 2024 19:04:55 +0200 Subject: [PATCH 2/6] Update section 2 metrics and losses --- .../Magic_shaperoint_design.md | 53 ++++++++++++------- 1 file changed, 35 insertions(+), 18 deletions(-) diff --git a/Design_Doc_Examples/Magic_shaperoint_design.md b/Design_Doc_Examples/Magic_shaperoint_design.md index c94377d..a410a74 100644 --- a/Design_Doc_Examples/Magic_shaperoint_design.md +++ b/Design_Doc_Examples/Magic_shaperoint_design.md @@ -116,7 +116,10 @@ Every month: The task could be split into independent subtasks: data extraction (OCR) and data retrieval and answer generation. These parts can be evaluated independently to prioritize improvements based on the source of errors as well as overall solution performance. ***Data Extraction Metrics:*** -It’s reasonable to measure OCR quality separately, as in the case of poor OCR quality, an accurate result can’t be reached. + +Pre-requirements: Dataset of scanned documents and appropriate them texts. (As a work around: readable documents could be scanned manually, which gives both - scanned image and ground truth text values) + +It’s reasonable to measure OCR quality separately, as in the case of poor OCR quality, an accurate result can’t be reached. On the first stage of project let’s skip this metrics step, calculate high-level metrics on markup docs vs scanned images, only in case of significant difference in numbers data extraction metrics are calculated. **a. Word Error Rate** @@ -132,41 +135,55 @@ As it is important to extract table-structured data as well, the percentage of i ***Retrieval Metrics:*** +Pre-requirements: Dataset of queries collected from experts and list of N most relevant chunks for each of the query. + **d. Recall@k** -It determines how many relevant results from all existing relevant results for the query are returned in the retrieval step, where K is the number of results considered, a hyperparameter of the model. If the answer will be generated based only on one document, Mean Reciprocal Rank (MRR) could be used. +It determines how many relevant results from all existing relevant results for the query are returned in the retrieval step, where K is the number of results considered, a hyperparameter of the model. + +**e. Normalized discounted cumulative gain (NDCG)** +NDCG is a metric that calculates the average of DCGs for a given set of results, which is a measure of the sum of relevance scores taken from the first N results divided by the ideal DCG. In its turn, DCG is the sum of relevance scores among the first N results in the order of decreasing relevance. ***Answer Generation Metrics:*** -**e. Answer Relevance Score** +**f. Average Relevance Score** -Measures how well the generated answers match the context and query. It could be a binary value (relevant/irrelevant) or a score from 1 to 5. +Measures how well the generated answers match the context and query. +There are several approaches to calculate that metric: +- automatically with framework (detailed description provided in section IV. Validation Schema) +- with other llms paper to consider https://arxiv.org/pdf/2305.06311 +- manually based on experts output (approach is provided in section IX. Measuring and reporting) -**f. Hallucination Rate** + +**g. Hallucination Rate** -As one of the requirements is to avoid hallucinating, it is possible to calculate the percentage of incorrect or fabricated information in the generated answers. +As one of the requirements is to avoid hallucinating, it is possible to calculate the percentage of incorrect or fabricated information in the generated answers. -**g. Clarification Capability** +How to calculate: -As one of the requirements is the ability to automatically request more details if an insufficient answer is generated, the average number of interactions or follow-up questions needed to clarify or correct an answer could be calculated to measure clarification capability. This metric helps to check the system’s ability to provide comprehensive answers initially or minimize the number of interactions needed for detalization. +- Manually: prepare dataset of queries (including queries without answer in dataset) + expected responses; calculate by comparing expected response to provided +- Finetune smaller llms to detect hallucination +- Add guardrails https://github.com/NVIDIA/NeMo-Guardrails - it will not only improve reponse, but also helps to calculate amount of times, model attempts to hallucinate -**h. Consistency** +**h. Clarification Capability** -Measures the consistency of answers across different versions of the same or related document/same or related query. Consistent answers indicate reliability and stability of the information provided. +Pre-requirements: Dataset of queries (ideally with unambiguous answer) + expected response, domain experts to evaluate the metric manually. + +As one of the requirements is the ability to automatically request more details if an insufficient answer is generated, the average number of interactions or follow-up questions needed to clarify or correct an answer could be calculated to measure clarification capability and average relevance of follow-up questions. This metric helps to check the system’s ability to provide comprehensive answers initially or minimise the number of interactions needed for detalization. -Metrics to pick: -A lot of metrics were provided, but it’s a good idea to go in the reverse direction: start from the more general and dive deeper into partial ones only when necessary. -~~Online metrics of interest during A/B tests are:~~ +*Metrics to pick:* -- ~~Click-through Rate: The ratio of users who click on a retrieved document to the total number of users who view the retrieval results. Higher CTR indicates more relevant retrievals.~~ -- ~~Time to Retrieve (TTR)~~ -- ~~User Satisfaction~~ +A lot of metrics were provided, but it’s a good idea to go in the reverse direction: start from the more general and dive deeper into partial ones only when necessary. +*TODO*: build hierarchy of metrics -**ii. Loss Functions** +*Online metrics of interest during A/B tests are (more details are provided in section IX. Measuring and reporting):* +- Time to Retrieve (TTR) +- Average Relevance score +- Average amount of clarification questions +- Average time of dialogue -As the task consists of several steps, the loss function shall consider all of them. An appropriate loss function that combines both retrieval and generation objectives might involve a combination of retrieval loss (e.g., ranking loss for retrieval) and generation loss (e.g., language modeling loss for generation). For example, it could be the sum of Cross Entropy Loss and Contrastive Loss. From 58dd06e171969881d8f9312b95ca15bb0563f536 Mon Sep 17 00:00:00 2001 From: Danil Date: Sun, 14 Jul 2024 00:18:28 +0200 Subject: [PATCH 3/6] Update Magic_shaperoint_design.md --- .../Magic_shaperoint_design.md | 25 ++++++++++++------- 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/Design_Doc_Examples/Magic_shaperoint_design.md b/Design_Doc_Examples/Magic_shaperoint_design.md index a410a74..de98868 100644 --- a/Design_Doc_Examples/Magic_shaperoint_design.md +++ b/Design_Doc_Examples/Magic_shaperoint_design.md @@ -816,6 +816,9 @@ The report should include: ### **i. Embeddings Database** This is one of the core components in the system for efficient document search and retrival. +It consists of +1. Vector representations of uploaded/created by users files. +2. Chat communications and response ratings, structured with fields of user queries, responses, timestamps, ratings and session id. **i.i. Embeddings Generation** @@ -825,20 +828,24 @@ This is one of the core components in the system for efficient document search a **i.ii. Database Features** -- A cloud and scalable database. +A cloud and scalable database, e.g. Pinecone. It is designed to scale horizontally to handle large volumes of embeddings. - Supports nearest neighbor search, using cosine or Euclidean similarity. +- Supports filtering based on metadata. - The following fields are stored for further mapping with Documents Storage: 1. Document ID 2. Version number - 3. Metadata storage (document title, author, creation date) - 4. Embeddings representation -- Embeddings, metadata, and queries are encrypted to ensure security. Strict access control managment. + 3. Document's Metadata (document title, author, creation date) + 4. Model's Metadata (Baseline / Main embedding tool, model's realease version) + 5. Embeddings representation +- Embeddings, metadata, and queries are encrypted to ensure security. Strict access control managment. + +**Database quantified requirements:** +- Should return top-10 nearest neighbors within 100ms for up to 1 million vectors. +- Should support at least 1000 Queries Per Second for nearest neighbor searches on a dataset of 1 million vectors. ### **ii. Documents Storage** -A scalable cloud service (e.g. AWS S3) for scalable storage and files managment for the following data types: -1. Original files uploaded by clients, including their versions. The service returns a URL (Document ID) for each uploaded file, which is stored in the embeddings database metadata. -2. Chat communications and response ratings, structured as JSON objects with fields for user queries, responses, timestamps and ratings. +A scalable cloud service (e.g. AWS S3) for scalable storage and files managment for original files uploaded by clients, including their versions. The service returns a URL (Document ID) for each uploaded file, which is stored in the embeddings database metadata. ### **iii. Chat UI** @@ -846,7 +853,7 @@ An intuitive and responsive interface for clients to query and receive results. **Features.** 1. Clients can upload new documents, which automatically triggers embedding generation and storage. - 2. Consists of a five star rating system for feedback on answers. + 2. Consists of a positive/negative feedback on answers. 3. Clients can report offesnive or unproper responses, which triggers another LLM as a fallback scenario. 4. Allows to save a chat history and responses for future reference. @@ -878,7 +885,7 @@ Below are provided events, when a corresponding API action gets triggered, while **Embeddings Management.** - Generate embeddings for a new document -- Update embeddings for a document version change +- Update embeddings for a document version change, keeping previous embiddings, corresponding to the same session id. **Chat Session Management.** - Start a new chat session From 1a64d885f72eb2046ab6987db5537cfd62df3cfd Mon Sep 17 00:00:00 2001 From: pfilipovich <102610383+pfilipovich@users.noreply.github.com> Date: Sun, 14 Jul 2024 10:54:26 +0200 Subject: [PATCH 4/6] Section 1, fack missmatching levels --- Design_Doc_Examples/Magic_shaperoint_design.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/Design_Doc_Examples/Magic_shaperoint_design.md b/Design_Doc_Examples/Magic_shaperoint_design.md index de98868..40c52b6 100644 --- a/Design_Doc_Examples/Magic_shaperoint_design.md +++ b/Design_Doc_Examples/Magic_shaperoint_design.md @@ -42,7 +42,7 @@ Client expect answers to be: - Fast - First token within 1 minute. - Trustfull - - Limited hallucinations or 'extended' answers. At least 95% of the answers should not contain fact missmatching. + - Limited hallucinations or 'extended' answers. At least 95% of the answers should not contain fact missmatching of level 1 and level 2 (described below). - In case if clients would have any doubts, they won't need to proofread whole document to resolve the uncertainty. - Interactive - Ability to provide more details/follow-up questions if the answer is insufficient. @@ -60,6 +60,14 @@ Client wants to select documents: - Explicit, using filters. - Implicit, through dialogue. +Fact missmatching levels: +1. Fact Presence +- Numbers / Terms / Facts in the answer were not present in the document +2. Fact Integrity +- Numbers / Terms / Facts in the answer were present in the document but had different context and meaning +3. Reasoning +- Numbers / Terms / Facts in the answer were correct and had the same context and meaning, but the conclusion was off + Use case examples: We will categorize the use cases as follows: From 32774651b9d1e3ee88c54f54c6e83f81d6afc1ec Mon Sep 17 00:00:00 2001 From: Artem Kozlov Date: Sun, 14 Jul 2024 15:45:20 +0200 Subject: [PATCH 5/6] Resolve comments from Pavel --- .../Magic_shaperoint_design.md | 58 ++++++++++++++++++- 1 file changed, 57 insertions(+), 1 deletion(-) diff --git a/Design_Doc_Examples/Magic_shaperoint_design.md b/Design_Doc_Examples/Magic_shaperoint_design.md index 40c52b6..c6410da 100644 --- a/Design_Doc_Examples/Magic_shaperoint_design.md +++ b/Design_Doc_Examples/Magic_shaperoint_design.md @@ -462,6 +462,23 @@ A basic RAG system consists of the following components components: ###### Embedder +** Granularity of embeddings** + +There are multiple options for embeddings granularity, i.e., a vector could represent: +- a document `[level 0]` +- an article `[level 1]` +- a paragraph `[level 2]` +- a sentence `[level 3]` + +Surely enough, we could cover all the layers by having separate embeddings for each layer and deciding which one to use based on the context, for example: +- search without specifying a particular document -> `[level 0]` +- user selects a document and asks if there is a particular section in it -> `[level 1]` +... + +This approach might bring high accuracy, but it's complex and costly to implement. For the baseline solution, we would like to start with a single embedding representation. Based on the most common use case, this would be a paragraph encoding. Because according to our analysis, most of the problems' answers could be found given the context of a single paragraph. + +** Embedder: design choice ** + A good embedder should define the retrieval layer's ability to: - Store content representations efficiently @@ -482,6 +499,7 @@ Drawbacks: - Development and maintenance costs. - Per-token costs may not be as optimized as those of larger companies. + When it comes to generation levels, considering the number of users and the app economy, there is no clear evidence that the company would like to invest in training or fine-tuning custom LLMs. Therefore, it might be beneficial to keep in mind the use of vendor-based API-accessible LLMs. Here are the potential benefits: @@ -494,6 +512,43 @@ Drawbacks: - Data privacy (though not a significant concern) - There is a possibility of service denial from a vendor on account of policy-related issues, such as content restrictions or economic sanctions +#### Bridging the Qualitative Gap + +Currently, the baseline description lacks modules to ensure the solution meets quality criteria, specifically in areas such as hallucination mitigation and tolerance against misuse. To address these gaps, we propose using guardrails for quality assurance. This includes a retry strategy and a fallback mechanism designed to enhance reliability and robustness. + +##### Baseline QA Framework + +Here is an example of an alogirthm we might utilise. +The fallback strategy could involve calling multiple LLMs simultaneously. The Guardrails would then evaluate these parallel answers to select the best one that meets quality standards. This approach increases the likelihood of obtaining a satisfactory response without significant delay. + +The complexity might be increased or decreased depending on the metrics we obtain for the baseline, but this is something we need to keep in mind while choosing the framework in advance. + +*** Algorithm *** + +**Input:** Request from user +**Output:** Response to user + +1. **Primary Answer Generation** + 1.1 `main_answer` ← obtain answer from main process + +2. **Guardrails Evaluation** + 2.1 `guardrail_result` ← evaluate `main_answer` with Guardrails + 2.2 If `guardrail_result` is satisfactory: + 2.2.1 Return `main_answer` to user + 2.3 Else: + 2.3.1 `time_remaining` ← check remaining response time + 2.3.2 If `time_remaining` is sufficient to invoke fallback model: + 2.3.2.1 `fallback_answer` ← obtain answer from fallback pipeline + 2.3.2.2 `fallback_guardrail_result` ← evaluate `fallback_answer` with Guardrails + 2.3.2.3 If `fallback_guardrail_result` is satisfactory: + 2.3.2.3.1 Return `fallback_answer` to user + 2.3.2.4 Else: + 2.3.2.4.1 Return `override_response` to user + 2.3.3 Else: + 2.3.3.1 Return `override_response` to user + +**End Algorithm** + #### Framework Selection When considering a framework, we would like it to support the following features: @@ -516,7 +571,7 @@ Here are some resources that summarize the differences between the two framework | **Main Purpose** | Various tasks | Querying and retrieving information using LLMs | | **Modularity** | High, allows swapping of components | Average, yet sufficient for our current design | | **Workflow Management** | High, supports managing chains of models/prompts | Average, primarily focused on querying | -| **Integration** | High: APIs, databases, etc. | Average: APIs, data sources, but customizable | +| **Integration** | High: APIs, databases, guardrails, etc. | Average: APIs, data sources, guardrails | | **Tooling** | Debugging, monitoring, optimization | Debugging, monitoring | | **LLM Flexibility** | Supports various LLMs (local/APIs) | Supports various LLMs (local/APIs) | | **Indexing** | No primary focus on indexing | Core feature, creates indices for data | @@ -526,6 +581,7 @@ Here are some resources that summarize the differences between the two framework Given the pros and cons listed above, it appears that LlamaIndex provides all the features we are looking for, combined with an ease of use that could reduce development and maintenance costs. Additionally, LlamaIndex offers enterprise cloud versions of the platform. If our solution evolves towards a simpler design, we might want to move to the paid cloud version if it makes economical sense. + ### **VI. Error analysis** **i. Learning Curve Analysis** From 75b7c19203546ab48af46b3388526e75bdd2c955 Mon Sep 17 00:00:00 2001 From: Artem Kozlov Date: Sun, 14 Jul 2024 15:49:18 +0200 Subject: [PATCH 6/6] Fix --- Design_Doc_Examples/Magic_shaperoint_design.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Design_Doc_Examples/Magic_shaperoint_design.md b/Design_Doc_Examples/Magic_shaperoint_design.md index c6410da..dd886dc 100644 --- a/Design_Doc_Examples/Magic_shaperoint_design.md +++ b/Design_Doc_Examples/Magic_shaperoint_design.md @@ -519,7 +519,7 @@ Currently, the baseline description lacks modules to ensure the solution meets q ##### Baseline QA Framework Here is an example of an alogirthm we might utilise. -The fallback strategy could involve calling multiple LLMs simultaneously. The Guardrails would then evaluate these parallel answers to select the best one that meets quality standards. This approach increases the likelihood of obtaining a satisfactory response without significant delay. +The fallback strategy could involve calling another or multiple LLMs. The Guardrails would evaluate these answers to select the best one that meets quality standards. This approach increases the likelihood of obtaining a satisfactory response. The complexity might be increased or decreased depending on the metrics we obtain for the baseline, but this is something we need to keep in mind while choosing the framework in advance.