Merge pull request #319 from aiverify-foundation/dev_main

[Sprint 13] New Features & Unit Tests
aiverify-foundation · Aug 30, 2024 · 9c42fce · 9c42fce
2 parents f083cc7 + 9c3d52d
commit 9c42fce
Show file tree

Hide file tree

Showing 59 changed files with 16,663 additions and 2,275 deletions.
diff --git a/.github/scripts/run_smoke_test.sh b/.github/scripts/run_smoke_test.sh
@@ -49,3 +49,5 @@ cp $SCRIPTS_DIR/moonshot_test_env .env
 
 echo "Running smoke test..."
 npx playwright test tests/smoke-test.spec.ts --reporter=list
+
+#echo "Exit code: $?"
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 ![Moonshot Logo](https://github.com/aiverify-foundation/moonshot/raw/main/misc/aiverify-moonshot-logo.png)
 
-**Version 0.4.5**
+**Version 0.4.6**
 
 A simple and modular tool to evaluate any LLM application.
 

diff --git a/docs/faq.md b/docs/faq.md
@@ -40,20 +40,27 @@ Some of the functions may not work as expected. We suggest users to reinstall Mo
 
 ## Using Moonshot
 
-### My tests are all completed with errors! I can't view any report!
+### My tests are all completed with errors! I cannot view any report!
 
 Some benchmark tests and attack modules require connector endpoints to be configured beforehand. You may encounter this type of error:
 
 ![](./getting_started/getting_started/8.png)
 
-Some examples are:
+####Requirements
+This is the full list of requirements for the following tests:
+
+| Test | Type | Model Required | Name of the Endpoint | Configuration Required
+| --- | ---| --- | --- | --- |
+| [MLCommons AI Safety Benchmarks v0.5](https://github.com/aiverify-foundation/moonshot-data/blob/main/cookbooks/mlc-ai-safety.json) | Cookbook | Meta LlamaGuard | [Together Llama Guard 7B Assistant](https://github.com/aiverify-foundation/moonshot-data/blob/main/connectors-endpoints/together-llama-guard-7b-assistant.json) | API Token - `token` field
+| All MLCommons Recipes (i.e. [mlc-cae](https://github.com/aiverify-foundation/moonshot-data/blob/main/recipes/mlc-cae.json)) | Recipe | Meta LlamaGuard | [Together Llama Guard 7B Assistant](https://github.com/aiverify-foundation/moonshot-data/blob/main/connectors-endpoints/together-llama-guard-7b-assistant.json)  | API Token - `token` field
+| [Singapore Safety](https://github.com/aiverify-foundation/moonshot-data/blob/main/recipes/singapore-safety.json) | Recipe | Meta LlamaGuard | [Together Llama Guard 7B Assistant](https://github.com/aiverify-foundation/moonshot-data/blob/main/connectors-endpoints/together-llama-guard-7b-assistant.json)  | API Token - `token` field
+| [Bias - Occupation](https://github.com/aiverify-foundation/moonshot-data/blob/main/recipes/bias-occupation.json) | Recipe | OpenAI GPT4 | [OpenAI GPT4](https://github.com/aiverify-foundation/moonshot-data/blob/main/connectors-endpoints/openai-gpt4.json)| API Token - `token` field
+| [Chinese Linguistics & Cognition Challenge](https://github.com/aiverify-foundation/moonshot-data/blob/main/recipes/clcc.json) | Recipe | Flageval Flag Judge | [Flageval Flag Judge](https://github.com/aiverify-foundation/moonshot-data/blob/main/connectors-endpoints/flageval-flagjudge.json) | -
+| [Malicious Question Generator](https://github.com/aiverify-foundation/moonshot-data/blob/main/attack-modules/malicious_question_generator.py) | Attack Module | OpenAI GPT4 | [OpenAI GPT4](https://github.com/aiverify-foundation/moonshot-data/blob/main/connectors-endpoints/openai-gpt4.json)| API Token - `token` field
+| [Violent Durian](https://github.com/aiverify-foundation/moonshot-data/blob/main/attack-modules/violent_durian.py) | Attack Module | OpenAI GPT4 | [OpenAI GPT4](https://github.com/aiverify-foundation/moonshot-data/blob/main/connectors-endpoints/openai-gpt4.json)| API Token - `token` field
+
+You can also check out the [metric configuration JSON](https://github.com/aiverify-foundation/moonshot-data/blob/main/metrics/metrics_config.json) to see if a cookbook or recipe uses any of these metrics.
 
-| Test | Model Required | Name of the Endpoint |
-| --- | ---| --- |
-| MLCommons AI Safety Benchmarks v0.5 (Cookbook) | Meta LlamaGuard | Together Llama Guard 7B Assistant | 
-| Singapore Safety (Recipe) | Meta LlamaGuard | Together Llama Guard 7B Assistant | 
-| Malicious Question Generator (Attack Module) | OpenAI GPT4 | OpenAI GPT4 | 
-| Violent Durian (Attack Module) | OpenAI GPT4 | OpenAI GPT4 |
 
 If you are not running any of the above, you should check the details of the specific attack module/ recipe’s metric that you are using, on what model connection is needed.
 
@@ -64,20 +71,8 @@ If you do not have tokens for Llama Guard via Together AI,
 3.	Replace `together-llama-guard-7b-assistant` with your new endpoint ID.
 4.	Save the file and run your test.
 
-### I can't delete my runner in the CLI on Windows.
-
-We are aware that there is an issue deleting runner in the CLI if you are using Windows operating system. You may see the following error when you attempt to delete one of the runners using CLI:
-
-```
-moonshot > delete_runner new-recipe
-Are you sure you want to delete the runner (y/N)? y
-[Runner] Failed to delete runner: [WinError 32] The process cannot access the file because it is being used by another process: 'moonshot-data-test\\generated-outputs\\databases\\new-recipe.db'
-[delete_runner]: [WinError 32] The process cannot access the file because it is being used by another process: 'moonshot-data-test\\generated-outputs\\databases\\new-recipe.db'
-```
-
-We are working to produce a fix. In the meanwhile, please exit the program and delete it via your file explorer.
 
-### I can't save my token for the connector endpoint!
+### I cannot save my token for the connector endpoint!
 
 We acknowledge a potential issue with saving tokens via the UI. As a workaround, you can directly access the JSON file of your endpoint. This file is located in the `moonshot-data/connector-endpoints` directory, which was created during the installation process.
 
@@ -101,6 +96,36 @@ Open your preferred code editor, locate the `token` field, and replace `ADD_API_
 
 Please refresh the page.
 
-### I am unable to install PyTorch
+
+### Issues related to MacOS
+####  I am unable to install PyTorch
 
 If you are operating on an x86 MacOS, you may encounter difficulties when attempting to install the PyTorch requirement from the moonshot-data. To resolve this issue, it is recommended to manually install PyTorch version 2.2.0, which is compatible with your computer's architecture.
+
+
+### Issues related to Windows
+#### I am having issues installing some Tensorflow Python packages
+
+At the time of writing, there seems to be no `tensorflow-io-gcs-filesystem` wheel for Windows beyond a certain version. You may encounter this issue while you're installing `moonshot-data`:
+
+![windows-installation-error-tensorflow](./res/faq/windows-installation-error-tensorflow.png)
+
+You can try the following:
+
+1. In the directory where you installed `moonshot-data`, change the version of `tensorflow-io-gcs-filesystem` in `moonshot-data/requirements.txt` to `0.31.0`.
+2. Install the requirements of `moonshot-data` again: `pip install -r moonshot-data/requirements.txt`.
+3. The issue should be resolved.
+
+
+#### I cannot delete my runner in the CLI on Windows.
+
+We are aware that there is an issue deleting runner in the CLI if you are using Windows operating system. You may see the following error when you attempt to delete one of the runners using CLI:
+
+```
+moonshot > delete_runner new-recipe
+Are you sure you want to delete the runner (y/N)? y
+[Runner] Failed to delete runner: [WinError 32] The process cannot access the file because it is being used by another process: 'moonshot-data-test\\generated-outputs\\databases\\new-recipe.db'
+[delete_runner]: [WinError 32] The process cannot access the file because it is being used by another process: 'moonshot-data-test\\generated-outputs\\databases\\new-recipe.db'
+```
+
+We are working to produce a fix. In the meanwhile, please exit the program and delete it via your file explorer.
diff --git a/docs/getting_started/first_test.md b/docs/getting_started/first_test.md
@@ -26,22 +26,27 @@ Upon navigating to the webpage, you will be greeted with our main screen. To sta
 
 ![The main page of Moonshot UI](getting_started/1.png)
 
-This will direct you to a wizard that will guide you through the testing process. In the first step, select the tests you would like to run on your model. By default, three baseline tests are selected.
+This will direct you to a wizard that will guide you through the testing process. In the first step, select the tests you would like to run on your model. By default, three baseline tests are selected. These tests are selected as they are applicable to most types of applications.
 
-!!! note
-    We will be testing a model from OpenAI in this guide. You will need to prepare an OpenAI API token.
 
 ![This step guides the user in selecting a set of benchmarks.](getting_started/2.png)
 
 Once you have completed the selection, click on the arrow to proceed to the next step. In this step, you will see the total number of prompts in this set of tests. Click on the arrow again to advance to the next step.
 
+!!! warning
+    <b>Important information before running your benchmark:</b>
+
+    Certain benchmarks may require metrics that connect to a particular model (i.e. MLCommons cookbooks and recipes like [mlc-cae](https://github.com/aiverify-foundation/moonshot-data/blob/main/recipes/mlc-cae.json) use the metric [llamaguardannotator](https://github.com/aiverify-foundation/moonshot-data/blob/main/metrics/llamaguardannotator.py), which requires the API token of [together-llama-guard-7b-assistant endpoint](https://github.com/aiverify-foundation/moonshot-data/blob/main/connectors-endpoints/together-llama-guard-7b-assistant.json)).
+
+    Refer to this [list for the requirements](../faq.md#requirements).
+
 ![This step shows the total number of prompts available in this benchmark.](getting_started/3.png)
 
 Connect to your AI system. Click "Edit" for one of the OpenAI models, such as OpenAI GPT-3.5 Turbo.
 
 ![alt text](getting_started/4.png)
 
-Enter your API token on this screen, then click "Save". Repeat this step for "Together Llama Guard 7B Assistant."
+Enter your API token on this screen, then click "Save". Repeat this step for "Together Llama Guard 7B Assistant”. Enter the API token that you obtained from TogetherAI to set up the "Together Llama Guard 7B Assistant" endpoint.
 
 !!! note
     Some cookbooks use another LLM to evaluate the response. For this test, one of the baseline cookbooks uses Llama Guard 7B to evaluate if the response is safe or unsafe.
@@ -50,7 +55,7 @@ Enter your API token on this screen, then click "Save". Repeat this step for "To
 
 You will return to the screen to select the endpoint. Choose the endpoint you have just configured, then proceed to the next step by clicking the arrow.
 
-Finally, enter the name and description for this test. Set the number of prompts to "1" and click "Run."
+Finally, enter the name and description for this test. Set the number of prompts to "1" and click "Run."This means that only 1 prompt from each dataset in the cookbooks will be tested.
 
 ![alt text](getting_started/6.png)
 
@@ -90,6 +95,13 @@ In this screen, you have the option to select one of the attack modules to autom
 
 ![alt text](getting_started/13.png)
 
+!!! warning
+    <b>Important information before running red teaming:</b>
+
+    Certain attack modules may require connection to certain model endpoints. (i.e. [Violent Durian](https://github.com/aiverify-foundation/moonshot-data/blob/main/attack-modules/violent_durian.py) requires the endpoint [openai-gpt4](https://github.com/aiverify-foundation/moonshot-data/blob/main/connectors-endpoints/openai-gpt4.json) and you will need an API token to connect to this endpoint.
+
+    Refer to this [list for the requirements](../faq.md#requirements).
+
 Enter a name and type a description in this screen, then click "Start".
 
 ![alt text](getting_started/14.png)

diff --git a/docs/getting_started/overview.md b/docs/getting_started/overview.md
@@ -18,7 +18,7 @@ Moonshot offers a range of benchmarks to measure your LLM application's performa
 
 Red teaming is the adversarial prompting of LLM applications to induce them to behave in a manner incongruent with their design. This process is crucial to identify vulnerabilities in AI systems.
 
-Moonshot simplifies the process of red teaming by providing an easy to use interface that allows for the simulataneous probing of multiple LLM applications, and equipping you with red teaming utilities like prompt templates, context strategies and attack modules.
+Moonshot simplifies the process of red teaming by providing an easy to use interface that allows for the simultaneous probing of multiple LLM applications, and equipping you with red teaming utilities like prompt templates, context strategies and attack modules.
 
 #### Automated Red Teaming
 

diff --git a/docs/res/faq/windows-installation-error-tensorflow.png b/docs/res/faq/windows-installation-error-tensorflow.png
diff --git a/docs/resources/cookbooks.md b/docs/resources/cookbooks.md
@@ -10,4 +10,4 @@
 | Facts about Singapore | This cookbook is designed to evaluate Singapore's historical events and essential facts, serving as a litmus test for its understanding of the country's unique context. In addition, there are safety prompts written in Singapore context. By assessing a model's familiarity with Singapore's cultural and historical landscape, it provides valuable insights into its overall proficiency and accuracy in natural language processing systems tailored to Singaporean contexts. | singapore-facts |
 | Tamil Language | This is a cookbook that consists of datasets related to the Tamil Language. | 1. tamil-kural-classification<br>2. tamil-tamilnews-classification<br>3. tamil-tanglish-tweets |
 | AI Safety in Chinese Language | This cookbook measures the system's ability in answering trust and safety questions that are asked in Chinese languages. | 1. cvalues<br>2. cbbq-lite |
-| Legal Summarisation | This cookbook runs general capabilitiy benchmark on legal summarisation model. | 1. analogical-similarity<br>2. auto-categorisation<br>3. cause-and-effect<br>4. contextual-parametric-knowledge-conflicts<br>5. gre-reading-comprehension<br>6. squad-shifts-tnf |
+| Legal Summarisation | This cookbook runs general capability benchmark on legal summarisation model. | 1. analogical-similarity<br>2. auto-categorisation<br>3. cause-and-effect<br>4. contextual-parametric-knowledge-conflicts<br>5. gre-reading-comprehension<br>6. squad-shifts-tnf |