cipherstash · calvinbrewer · Nov 7, 2024 · Nov 7, 2024 · Nov 7, 2024
diff --git a/.gitignore b/.gitignore
@@ -185,3 +185,7 @@ cipherstash-proxy.toml
 release/
 
 .mise.*
+
+# jupyter notebook
+.ipynb_checkpoints
+__pycache__
diff --git a/README.md b/README.md
diff --git a/cipherstash-proxy/cipherstash-proxy.toml.example b/cipherstash-proxy/cipherstash-proxy.toml.example
diff --git a/cipherstash-proxy/docker-compose.yaml b/cipherstash-proxy/docker-compose.yaml
diff --git a/WHY.md → docs/concepts/WHY.md b/WHY.md → docs/concepts/WHY.md
diff --git a/docs/reference/INDEX.md b/docs/reference/INDEX.md
@@ -0,0 +1,241 @@
+# EQL index configuration
+
+The following functions allow you to configure indexes for encrypted columns.
+All these functions modify the `cs_configuration_v1` table in your database, and is added during the EQL installation.
+
+> **IMPORTANT:** When you modify or add an index, you must re-encrypt data that's already been stored in the database.
+The CipherStash encryption solution will encrypt the data based on the current state of the configuration.
+
+### Adding an index (`cs_add_index`)
+
+Add an index to an encrypted column.
+
+```sql
+SELECT cs_add_index_v1(
+  'table_name',       -- Name of the table
+  'column_name',      -- Name of the column
+  'index_name',       -- Index kind ('unique', 'match', 'ore', 'ste_vec')
+  'cast_as',          -- PostgreSQL type to cast decrypted data ('text', 'int', etc.)
+  'opts'              -- Index options as JSONB (optional)
+);
+```
+
+| Parameter     | Description                                        | Notes                                                                    |
+| ------------- | -------------------------------------------------- | ------------------------------------------------------------------------ |
+| `table_name`  | Name of target table                               | Required                                                                 |
+| `column_name` | Name of target column                              | Required                                                                 |
+| `index_name`  | The index kind                                     | Required.                                                                |
+| `cast_as`     | The PostgreSQL type decrypted data will be cast to | Optional. Defaults to `text`                                             |
+| `opts`        | Index options                                      | Optional for `match` indexes, required for `ste_vec` indexes (see below) |
+
+#### Option (`cast_as`)
+
+Supported types:
+
+- `text`
+- `int`
+- `small_int`
+- `big_int`
+- `boolean`
+- `date`
+- `jsonb`
+
+#### Options for match indexes (`opts`)
+
+A match index enables full text search across one or more text fields in queries.
+
+The default Match index options are:
+
+```json
+  {
+    "k": 6,
+    "m": 2048,
+    "include_original": true,
+    "tokenizer": {
+      "kind": "ngram",
+      "token_length": 3
+    }
+    "token_filters": {
+      "kind": "downcase"
+    }
+  }
+```
+
+- `tokenFilters`: a list of filters to apply to normalize tokens before indexing.
+- `tokenizer`: determines how input text is split into tokens.
+- `m`: The size of the backing [bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) in bits. Defaults to `2048`.
+- `k`: The maximum number of bits set in the bloom filter per term. Defaults to `6`.
+
+**Token filters**
+
+There are currently only two token filters available: `downcase` and `upcase`. These are used to normalise the text before indexing and are also applied to query terms. An empty array can also be passed to `tokenFilters` if no normalisation of terms is required.
+
+**Tokenizer**
+
+There are two `tokenizer`s provided: `standard` and `ngram`.
+`standard` simply splits text into tokens using this regular expression: `/[ ,;:!]/`.
+`ngram` splits the text into n-grams and accepts a configuration object that allows you to specify the `tokenLength`.
+
+**m** and **k**
+
+`k` and `m` are optional fields for configuring [bloom filters](https://en.wikipedia.org/wiki/Bloom_filter) that back full text search.
+
+`m` is the size of the bloom filter in bits. `filterSize` must be a power of 2 between `32` and `65536` and defaults to `2048`.
+
+`k` is the number of hash functions to use per term.
+This determines the maximum number of bits that will be set in the bloom filter per term.
+`k` must be an integer from `3` to `16` and defaults to `6`.
+
+**Caveats around n-gram tokenization**
+
+While using n-grams as a tokenization method allows greater flexibility when doing arbitrary substring matches, it is important to bear in mind the limitations of this approach.
+Specifically, searching for strings _shorter_ than the `tokenLength` parameter will not _generally_ work.
+
+If you're using n-gram as a token filter, then a token that is already shorter than the `tokenLength` parameter will be kept as-is when indexed, and so a search for that short token will match that record.
+However, if that same short string only appears as a part of a larger token, then it will not match that record.
+In general, therefore, you should try to ensure that the string you search for is at least as long as the `tokenLength` of the index, except in the specific case where you know that there are shorter tokens to match, _and_ you are explicitly OK with not returning records that have that short string as part of a larger token.
+
+#### Options for ste_vec indexes (`opts`)
+
+An ste_vec index on a encrypted JSONB column enables the use of PostgreSQL's `@>` and `<@` [containment operators](https://www.postgresql.org/docs/16/functions-json.html#FUNCTIONS-JSONB-OP-TABLE).
+
+An ste_vec index requires one piece of configuration: the `context` (a string) which is passed as an info string to a MAC (Message Authenticated Code).
+This ensures that all of the encrypted values are unique to that context.
+It is generally recommended to use the table and column name as a the context (e.g. `users/name`).
+
+Within a dataset, encrypted columns indexed using an `ste_vec` that use different contexts cannot be compared.
+Containment queries that manage to mix index terms from multiple columns will never return a positive result.
+This is by design.
+
+The index is generated from a JSONB document by first flattening the structure of the document such that a hash can be generated for each unique path prefix to a node.
+
+The complete set of JSON types is supported by the indexer.
+Null values are ignored by the indexer.
+
+- Object `{ ... }`
+- Array `[ ... ]`
+- String `"abc"`
+- Boolean `true`
+- Number `123.45`
+
+For a document like this:
+
+```json
+{
+  "account": {
+    "email": "alice@example.com",
+    "name": {
+      "first_name": "Alice",
+      "last_name": "McCrypto"
+    },
+    "roles": ["admin", "owner"]
+  }
+}
+```
+
+Hashes would be produced from the following list of entries:
+
+```js
+[
+  [Obj, Key("account"), Obj, Key("email"), String("alice@example.com")],
+  [
+    Obj,
+    Key("account"),
+    Obj,
+    Key("name"),
+    Obj,
+    Key("first_name"),
+    String("Alice"),
+  ],
+  [
+    Obj,
+    Key("account"),
+    Obj,
+    Key("name"),
+    Obj,
+    Key("last_name"),
+    String("McCrypto"),
+  ],
+  [Obj, Key("account"), Obj, Key("roles"), Array, String("admin")],
+  [Obj, Key("account"), Obj, Key("roles"), Array, String("owner")],
+];
+```
+
+Using the first entry to illustrate how an entry is converted to hashes:
+
+```js
+[Obj, Key("account"), Obj, Key("email"), String("alice@example.com")];
+```
+
+The hashes would be generated for all prefixes of the full path to the leaf node.
+
+```js
+[
+  [Obj],
+  [Obj, Key("account")],
+  [Obj, Key("account"), Obj],
+  [Obj, Key("account"), Obj, Key("email")],
+  [Obj, Key("account"), Obj, Key("email"), String("alice@example.com")],
+  // (remaining leaf nodes omitted)
+];
+```
+
+Query terms are processed in the same manner as the input document.
+
+A query prior to encrypting & indexing looks like a structurally similar subset of the encrypted document, for example:
+
+```json
+{ 
+  "account": { 
+    "email": "alice@example.com", 
+    "roles": "admin" 
+  } 
+}
+```
+
+The expression `cs_ste_vec_v1(encrypted_account) @> cs_ste_vec_v1($query)` would match all records where the `encrypted_account` column contains a JSONB object with an "account" key containing an object with an "email" key where the value is the string "alice@example.com".
+
+When reduced to a prefix list, it would look like this:
+
+```js
+[
+  [Obj],
+  [Obj, Key("account")],
+  [Obj, Key("account"), Obj],
+  [Obj, Key("account"), Obj, Key("email")],
+  [Obj, Key("account"), Obj, Key("email"), String("alice@example.com")][
+    (Obj, Key("account"), Obj, Key("roles"))
+  ],
+  [Obj, Key("account"), Obj, Key("roles"), Array],
+  [Obj, Key("account"), Obj, Key("roles"), Array, String("admin")],
+];
+```
+
+Which is then turned into an ste_vec of hashes which can be directly queries against the index.
+
+### Modifying an index (`cs_modify_index`)
+
+Modifies an existing index configuration.
+Accepts the same parameters as `cs_add_index`
+
+```sql
+SELECT cs_modify_index_v1(
+  table_name text, 
+  column_name text, 
+  index_name text, 
+  cast_as text, 
+  opts jsonb
+);
+```
+
+### Removing an index (`cs_remove_index`)
+
+Removes an index configuration from the column.
+
+```sql
+SELECT cs_remove_index_v1(
+  table_name text, 
+  column_name text, 
+  index_name text
+);
+```
diff --git a/JSON.md → docs/reference/JSON.md b/JSON.md → docs/reference/JSON.md
diff --git a/MIGRATOR.md → docs/reference/MIGRATOR.md b/MIGRATOR.md → docs/reference/MIGRATOR.md
diff --git a/NATIVE_POSTGRES_JSON_COMPARED_TO_EQL.md → ...e/NATIVE_POSTGRES_JSON_COMPARED_TO_EQL.md b/NATIVE_POSTGRES_JSON_COMPARED_TO_EQL.md → ...e/NATIVE_POSTGRES_JSON_COMPARED_TO_EQL.md
diff --git a/GETTINGSTARTED.md → docs/tutorials/GETTINGSTARTED.md b/GETTINGSTARTED.md → docs/tutorials/GETTINGSTARTED.md
diff --git a/PROXY.md → docs/tutorials/PROXY.md b/PROXY.md → docs/tutorials/PROXY.md
diff --git a/playground/.envrc.example b/playground/.envrc.example
@@ -0,0 +1,5 @@
+export CS_WORKSPACE_ID=1234
+export CS_CLIENT_ACCESS_KEY=1234
+export CS_ENCRYPTION__CLIENT_ID=1234
+export CS_ENCRYPTION__CLIENT_KEY=1234
+export CS_DATASET_ID=1234
diff --git a/cipherstash-proxy/dataset.yml → playground/dataset.yml b/cipherstash-proxy/dataset.yml → playground/dataset.yml
diff --git a/playground/db/Dockerfile b/playground/db/Dockerfile
@@ -0,0 +1,8 @@
+FROM curlimages/curl:7.85.0 as fetch-eql
+WORKDIR /out
+RUN  curl -sLo /out/cipherstash-encrypt.sql https://github.com/cipherstash/encrypt-query-language/releases/download/eql-0.4.2/cipherstash-encrypt.sql
+
+FROM postgres:16.2-bookworm as db
+WORKDIR /app
+COPY init.sh /docker-entrypoint-initdb.d
+COPY --from=fetch-eql /out/cipherstash-encrypt.sql /app/scripts/db/cipherstash-encrypt.sql
diff --git a/playground/db/init.sh b/playground/db/init.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+
+psql -U $POSTGRES_USER -d $POSTGRES_DB -a -f /app/scripts/db/cipherstash-encrypt.sql
diff --git a/playground/docker-compose.yml b/playground/docker-compose.yml
@@ -0,0 +1,41 @@
+services:
+  postgres:
+    container_name: eql-playground-pg
+    build:
+      context: ./db
+    command: [ "postgres", "-c", "log_statement=all" ]
+    environment:
+      POSTGRES_USER: postgres
+      POSTGRES_PASSWORD: postgres
+      POSTGRES_DB: postgres
+    ports:
+      - ${PGPORT:-5432}:5432
+    networks:
+      - eql-playground-nw
+  proxy:
+    container_name: postgres_proxy
+    image: cipherstash/cipherstash-proxy:cipherstash-proxy-v0.3.4
+    depends_on:
+      - postgres
+    ports:
+      - ${CS_PORT:-6432}:${CS_PORT:-6432}
+    environment:
+      CS_WORKSPACE_ID: $CS_WORKSPACE_ID
+      CS_CLIENT_ACCESS_KEY: $CS_CLIENT_ACCESS_KEY
+      CS_ENCRYPTION__CLIENT_ID: $CS_ENCRYPTION__CLIENT_ID
+      CS_ENCRYPTION__CLIENT_KEY: $CS_ENCRYPTION__CLIENT_KEY
+      CS_ENCRYPTION__DATASET_ID: $CS_DATASET_ID
+      CS_TEST_ON_CHECKOUT: "true"
+      CS_AUDIT__ENABLED: "false"
+      CS_DATABASE__PORT: 5432
+      CS_DATABASE__USERNAME: postgres
+      CS_DATABASE__PASSWORD: postgres
+      CS_DATABASE__NAME: postgres
+      CS_DATABASE__HOST: eql-playground-pg
+      CS_UNSAFE_LOGGING: "true"
+    networks:
+      - eql-playground-nw
+
+networks:
+  eql-playground-nw:
+    driver: bridge
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		#!/bin/bash

		psql -U $POSTGRES_USER -d $POSTGRES_DB -a -f /app/scripts/db/cipherstash-encrypt.sql