Skip to content

feat: Amazon Bedrock Support ⛰️ #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

JGalego
Copy link

@JGalego JGalego commented Aug 6, 2024

This PR adds support for EMBEDDING models provided by Amazon Bedrock.

Tested with Amazon Titan Embeddings Text and Cohere Embed.

Implementation includes ad-hoc functions to add AWS SigV4 signatures.

Example:

.bail on
.mode table
.header on

.timer on

-- Load extensions
.load ./dist/debug/rembed0
.load ../sqlite-vec/dist/vec0

-- Initialize rembed clients
insert into
  temp.rembed_clients(name, options)
values 
  ('amazon.titan-embed-text-v1', 'bedrock'),
  ('amazon.titan-embed-text-v2:0', 'bedrock'),
  ('cohere.embed-multilingual-v3', 'bedrock');
  
-- Debug (extensions)
select
  *
from
  pragma_function_list
where
  builtin != 1
and (
  name like 'rembed%'
or
  name like 'vec%'
);

-- Debug (models)
with models as (
  select
	name as model_id
  from
    temp.rembed_clients
)
select
  model_id,
  length(rembed(model_id, 'where is the love?'))/4 as embedding_dimension
from
  models;

-- Debug (model + inference options)
select substr(
  hex(
    rembed(
      'cohere.embed-multilingual-v3',
	  'Where is the love?',
	  json('{"input_type": "search_query", "truncate": "END"}')
    )
  ),
  1, 100
);

-- Create articles table
create table articles(
  headline text
);

-- Populate table with random headlines
insert into articles VALUES
  ('Shohei Ohtani''s ex-interpreter pleads guilty to charges related to gambling and theft'),
  ('The jury has been selected in Hunter Biden''s gun trial'),
  ('Larry Allen, a Super Bowl champion and famed Dallas Cowboy, has died at age 52'),
  ('After saying Charlotte, a lone stingray, was pregnant, aquarium now says she''s sick'),
  ('An Epoch Times executive is facing money laundering charge');


-- Build a vector table for the embeddings
create virtual table vec_articles using vec0(
  headline_embeddings float[1536]
);

-- Embed headlines
insert into vec_articles(rowid, headline_embeddings)
  select rowid, rembed('amazon.titan-embed-text-v1', headline)
  from articles;

-- Debug
select substr(hex(headline_embeddings), 1, 100) from vec_articles;

-- Run a query
.parameter set :query 'firearm courtroom'

with matches as (
  select
    rowid,
    distance
  from vec_articles
  where headline_embeddings match rembed('amazon.titan-embed-text-v1', :query)
  order by distance
  limit 3
)
select
  headline,
  distance
from matches
left join articles on articles.rowid = matches.rowid;

Output:

+--------------------------------------------------------------+------------------+
|                           headline                           |     distance     |
+--------------------------------------------------------------+------------------+
| The jury has been selected in Hunter Biden's gun trial       | 22.7055225372314 |
+--------------------------------------------------------------+------------------+
| Shohei Ohtani's ex-interpreter pleads guilty to charges rela | 27.4765567779541 |
| ted to gambling and theft                                    |                  |
+--------------------------------------------------------------+------------------+
| Larry Allen, a Super Bowl champion and famed Dallas Cowboy,  | 29.5273017883301 |
| has died at age 52                                           |                  |
+--------------------------------------------------------------+------------------+

@JGalego JGalego marked this pull request as ready for review August 6, 2024 11:10
@JGalego
Copy link
Author

JGalego commented Aug 7, 2024

Tested multimodal queries with Amazon Titan Multimodal Embeddings G1 (requires sqlean extensions).

🖼️ Data: images.zip

sqlite_vec_multimodal_bedrock

Example:

/**************
 * MULTIMODAL *
 **************/

-- Load sqlean extensions
.load ../sqlean/dist/crypto
.load ../sqlean/dist/fileio
.load ../sqlean/dist/text

-- Add Titan Embeddings Multimodal
insert into
  temp.rembed_clients(name, options)
values
  ('amazon.titan-embed-image-v1', 'bedrock');

-- Create images table
create table multimodal(
  input_text text,
  input_image text  --b64 encoded
);

-- Download images
.shell wget https://github.com/user-attachments/files/16530358/images.zip && unzip images.zip

-- Populate table with images
insert into multimodal
  select
    text_split(text_split(name, '/', -1), '.', 1),
    encode(fileio_read(name), 'base64')
  from
    fileio_ls('./images')
  where
    name like '%.jpg';

-- Build a vector table for the multimodal embeddings
create virtual table vec_multimodal using vec0(
  embeddings float[1024]
);

-- Embed text + images
insert into vec_multimodal(rowid, embeddings)
  select
    rowid,
	rembed(
      'amazon.titan-embed-image-v1',
      input_text,
      json(
        concat(
          '{"inputImage": "',
          input_image,
          '"}'
        )
      )
    )
  from
    multimodal;

-- Run a multimodal query
.parameter set :query_image './images/puppy.jpg'

with matches as (
  select
    rowid,
    distance
  from vec_multimodal
  where embeddings match
    rembed(
      'amazon.titan-embed-image-v1',
      text_split(text_split(:query_image, '/', -1), '.', 1),
      json(
        concat(
          '{"inputImage": "',
          encode(fileio_read(:query_image), 'base64'),
          '"}'
        )
      )
    )
  order by distance
  limit 3
)
select
  input_text,
  distance
from matches
left join
  multimodal
on
  multimodal.rowid = matches.rowid;

Output:

+------------+-------------------+
| input_text |     distance      |
+------------+-------------------+
| puppy      | 0.0               |
| kitty      | 0.604697287082672 |
| toy_train  | 0.807757079601288 |
+------------+-------------------+
Run Time: real 0.787 user 0.015932 sys 0.001098

@JGalego
Copy link
Author

JGalego commented Aug 7, 2024

Added a fix for when the model ID contains 'bad' characters like colons : and forward slashes /. For a similar issue, see curl/curl#11794.

The idea is to escape the model ID once (amazon.titan-embed-text-v2:0 --> amazon.titan-embed-text-v2%3A0) to build the URL path and then escape it again to generate the canonical request and signature (amazon.titan-embed-text-v2%3A0 --> amazon.titan-embed-text-v2%253A0).

Notice that the colon can appear multiple times in the model ID e.g. cohere.embed-english-v3:0:512

@JGalego
Copy link
Author

JGalego commented Nov 2, 2024

@asg017 would you mind reviewing this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant