Feat/worksheet nlq beta #1639

noah-paige · 2024-10-14T16:58:07Z

Feature or Bugfix

Feature

Detail

Add Natural Language Querying Feature in data.all
- Text To SQL Generation
- Text Document Analysis

Relates

Natural Language Querying (NLQ) using genAI #1659

Security

Please answer the questions below briefly where applicable, or write N/A. Based on
OWASP 10.

Does this PR introduce or modify any input fields or queries - this includes
fetching data from storage outside the application (e.g. a database, an S3 bucket)?
- Is the input sanitized?
- What precautions are you taking before deserializing the data you consume?
- Is injection prevented by parametrizing queries?
- Have you ensured no eval or similar functions are used?
Does this PR introduce any functionality or component that requires authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
- Are you logging failed auth attempts?
Are you using or adding any cryptographic features?
- Do you use a standard proven implementations?
- Are the used keys controlled by the customer? Where are they stored?
Are you introducing any new policies/roles/users?
- Have you used the least-privilege principle? How?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…ksheet-nlq-beta # Conflicts: # backend/requirements.txt

frontend/src/modules/Worksheets/components/TextDisplay.js

frontend/src/modules/Worksheets/components/WorksheetDocAnalyzer.js

frontend/src/modules/Worksheets/views/WorksheetView.js

frontend/src/modules/Worksheets/components/WorksheetTextToSQLEditor.js

frontend/src/modules/Worksheets/views/WorksheetView.js

dlpzx

Reviewed and tested frontend and comments from yesterday

noah-paige · 2024-10-24T15:59:04Z

Some example views from the UI:

Text To SQL:

Document Analyzer:

UserGuide Information in this PR:
documentation/userguide/docs/worksheets.md - changes in this PR

cc: @zsaltys @TejasRGitHub @anushka-singh

…file

noah-paige · 2024-10-24T17:37:11Z

@dlpzx - added additional tests and believe I have addressed all comments

The one pending item is when testing the happy paths for both TextToSQL and analyzeTextDocument GQL operations - often the integration tests will fail with the following error:

An error occurred (ThrottlingException) when calling the InvokeModel operation (reached max retries: 4): Too many requests, please wait before trying again. You have sent too many requests. Wait before trying again.

If you retry the test in isolation / after some time it will resolve itself. Looking at the service quota limits, Claude 3.5 Sonnet has a fairly low max invoke count per account per minute of 50 (note: we are still far under that per minute threshold but I think that or similar could be causing this throttling exception - other Claude 3 models have a much higher per minute tolerance of invocations)

Not sure if we should opt for a Claude 3 version for the time being to remediate this testing case and also to improve UX for a user using these features, thoughts?

backend/dataall/modules/worksheets/aws/bedrock_prompts/process_text_template.txt

backend/dataall/modules/worksheets/services/worksheet_service.py

dlpzx

Looks good, but needs to resolve conflicts

dlpzx · 2024-10-31T12:31:12Z

backend/dataall/modules/worksheets/aws/bedrock_client.py

+    def __init__(self):
+        self._session = SessionHelper.get_session()
+        self._client = self._session.client('bedrock-runtime')
+        model_id = 'anthropic.claude-3-5-sonnet-20240620-v1:0'


I am facing errors of the type: An error occurred (ValidationException) when calling the InvokeModel operation: Invocation of model ID anthropic.claude-3-5-sonnet-20240620-v1:0 with on-demand throughput isn’t supported. Retry your request with the ID or ARN of an inference profile that contains this model. I think it is related to the enforced cross-region inference

I wonder if you have faced something similar when using this model

# Conflicts: # backend/dataall/modules/s3_datasets/services/dataset_service.py # backend/dataall/modules/worksheets/api/resolvers.py # backend/dataall/modules/worksheets/services/worksheet_service.py # deploy/stacks/lambda_api.py # frontend/src/modules/Worksheets/views/WorksheetView.js

~~⚠️ merge after #1639~~ (cherry-picked the resource_thresholds feature) ### Feature or Bugfix - Feature ### Detail - Implements #1599 - see design for full explanation ### Relates - #1599 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Co-authored-by: Pelin Kuran <52086912+pelinKuran@users.noreply.github.com> Co-authored-by: kalosp <kalosp@amazon.com>

LoveBroman and others added 30 commits August 21, 2024 15:57

Added parts of nlq project

270f0ae

final hahah

74dc110

Merge branch 'love-main' into feat/worksheet-nlq-beta

c76983d

fixed cdk, unstructured data and migrations

3b7d8a7

Merge latest from fork repo

398a149

Deleted some files

ab82d50

removed account info from cdk.json and reverted to default

1d579bb

ruff formatting ran

666e272

config.json set to default

bd3b285

fixed hardcoding in resource_treshold

09dfc22

Merge remote-tracking branch 'refs/remotes/origin/main' into feat/wor…

07b100e

…ksheet-nlq-beta # Conflicts: # backend/requirements.txt

Merge branch 'love-main' into feat/worksheet-nlq-beta

6371462

ruff

f0ea070

Merge branch 'os-main' into feat/worksheet-nlq-beta

be51a84

Merge branch 'os-main' into feat/worksheet-nlq-beta

1b7151d

Resolve incorrect diffs os main + nlq branch

242498a

fix version json

8471484

fix config json formatting

838f46c

format cdk json

fec8b79

Re format Lambda IAM Permissions CDK App

fb0b4f1

rename treshold to threshold

411d4fa

push changes invocation count decorator

5700d34

textToSql Edits

efb11fc

clean up AWS Clients

0361931

Refactor Worksheet Views

6068eae

rename FE components and fix textToSQL inputs

7f28408

textToSql using env group role

af0d206

clean up and refactor unstructured text use case

3cb2006

Combine Bedrock Clients to 1 and add new file to store prompts

5dbb773

linting and touch ups

f7520db

noah-paige added 8 commits October 23, 2024 18:07

Updates to resource threhsold module based on comments

205a2c4

fix when creating session, format worksheet resolver + services

8816fde

resolve more PR comments backend

3c39b5a

change config json parameter path

0b66a79

update typos in docs

3eee507

positional args FE

b735f2e

fix imports

5001ad2

Merge branch 'os-main' into feat/worksheet-nlq-beta

9bca372