Skip to content

docs: add docs v1 and improve readme #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -185,3 +185,7 @@ cipherstash-proxy.toml
release/

.mise.*

# jupyter notebook
.ipynb_checkpoints
__pycache__
687 changes: 265 additions & 422 deletions README.md

Large diffs are not rendered by default.

25 changes: 0 additions & 25 deletions cipherstash-proxy/cipherstash-proxy.toml.example

This file was deleted.

11 changes: 0 additions & 11 deletions cipherstash-proxy/docker-compose.yaml

This file was deleted.

File renamed without changes.
241 changes: 241 additions & 0 deletions docs/reference/INDEX.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,241 @@
# EQL index configuration

The following functions allow you to configure indexes for encrypted columns.
All these functions modify the `cs_configuration_v1` table in your database, and is added during the EQL installation.

> **IMPORTANT:** When you modify or add an index, you must re-encrypt data that's already been stored in the database.
The CipherStash encryption solution will encrypt the data based on the current state of the configuration.

### Adding an index (`cs_add_index`)

Add an index to an encrypted column.

```sql
SELECT cs_add_index_v1(
'table_name', -- Name of the table
'column_name', -- Name of the column
'index_name', -- Index kind ('unique', 'match', 'ore', 'ste_vec')
'cast_as', -- PostgreSQL type to cast decrypted data ('text', 'int', etc.)
'opts' -- Index options as JSONB (optional)
);
```

| Parameter | Description | Notes |
| ------------- | -------------------------------------------------- | ------------------------------------------------------------------------ |
| `table_name` | Name of target table | Required |
| `column_name` | Name of target column | Required |
| `index_name` | The index kind | Required. |
| `cast_as` | The PostgreSQL type decrypted data will be cast to | Optional. Defaults to `text` |
| `opts` | Index options | Optional for `match` indexes, required for `ste_vec` indexes (see below) |

#### Option (`cast_as`)

Supported types:

- `text`
- `int`
- `small_int`
- `big_int`
- `boolean`
- `date`
- `jsonb`

#### Options for match indexes (`opts`)

A match index enables full text search across one or more text fields in queries.

The default Match index options are:

```json
{
"k": 6,
"m": 2048,
"include_original": true,
"tokenizer": {
"kind": "ngram",
"token_length": 3
}
"token_filters": {
"kind": "downcase"
}
}
```

- `tokenFilters`: a list of filters to apply to normalize tokens before indexing.
- `tokenizer`: determines how input text is split into tokens.
- `m`: The size of the backing [bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) in bits. Defaults to `2048`.
- `k`: The maximum number of bits set in the bloom filter per term. Defaults to `6`.

**Token filters**

There are currently only two token filters available: `downcase` and `upcase`. These are used to normalise the text before indexing and are also applied to query terms. An empty array can also be passed to `tokenFilters` if no normalisation of terms is required.

**Tokenizer**

There are two `tokenizer`s provided: `standard` and `ngram`.
`standard` simply splits text into tokens using this regular expression: `/[ ,;:!]/`.
`ngram` splits the text into n-grams and accepts a configuration object that allows you to specify the `tokenLength`.

**m** and **k**

`k` and `m` are optional fields for configuring [bloom filters](https://en.wikipedia.org/wiki/Bloom_filter) that back full text search.

`m` is the size of the bloom filter in bits. `filterSize` must be a power of 2 between `32` and `65536` and defaults to `2048`.

`k` is the number of hash functions to use per term.
This determines the maximum number of bits that will be set in the bloom filter per term.
`k` must be an integer from `3` to `16` and defaults to `6`.

**Caveats around n-gram tokenization**

While using n-grams as a tokenization method allows greater flexibility when doing arbitrary substring matches, it is important to bear in mind the limitations of this approach.
Specifically, searching for strings _shorter_ than the `tokenLength` parameter will not _generally_ work.

If you're using n-gram as a token filter, then a token that is already shorter than the `tokenLength` parameter will be kept as-is when indexed, and so a search for that short token will match that record.
However, if that same short string only appears as a part of a larger token, then it will not match that record.
In general, therefore, you should try to ensure that the string you search for is at least as long as the `tokenLength` of the index, except in the specific case where you know that there are shorter tokens to match, _and_ you are explicitly OK with not returning records that have that short string as part of a larger token.

#### Options for ste_vec indexes (`opts`)

An ste_vec index on a encrypted JSONB column enables the use of PostgreSQL's `@>` and `<@` [containment operators](https://www.postgresql.org/docs/16/functions-json.html#FUNCTIONS-JSONB-OP-TABLE).

An ste_vec index requires one piece of configuration: the `context` (a string) which is passed as an info string to a MAC (Message Authenticated Code).
This ensures that all of the encrypted values are unique to that context.
It is generally recommended to use the table and column name as a the context (e.g. `users/name`).

Within a dataset, encrypted columns indexed using an `ste_vec` that use different contexts cannot be compared.
Containment queries that manage to mix index terms from multiple columns will never return a positive result.
This is by design.

The index is generated from a JSONB document by first flattening the structure of the document such that a hash can be generated for each unique path prefix to a node.

The complete set of JSON types is supported by the indexer.
Null values are ignored by the indexer.

- Object `{ ... }`
- Array `[ ... ]`
- String `"abc"`
- Boolean `true`
- Number `123.45`

For a document like this:

```json
{
"account": {
"email": "alice@example.com",
"name": {
"first_name": "Alice",
"last_name": "McCrypto"
},
"roles": ["admin", "owner"]
}
}
```

Hashes would be produced from the following list of entries:

```js
[
[Obj, Key("account"), Obj, Key("email"), String("alice@example.com")],
[
Obj,
Key("account"),
Obj,
Key("name"),
Obj,
Key("first_name"),
String("Alice"),
],
[
Obj,
Key("account"),
Obj,
Key("name"),
Obj,
Key("last_name"),
String("McCrypto"),
],
[Obj, Key("account"), Obj, Key("roles"), Array, String("admin")],
[Obj, Key("account"), Obj, Key("roles"), Array, String("owner")],
];
```

Using the first entry to illustrate how an entry is converted to hashes:

```js
[Obj, Key("account"), Obj, Key("email"), String("alice@example.com")];
```

The hashes would be generated for all prefixes of the full path to the leaf node.

```js
[
[Obj],
[Obj, Key("account")],
[Obj, Key("account"), Obj],
[Obj, Key("account"), Obj, Key("email")],
[Obj, Key("account"), Obj, Key("email"), String("alice@example.com")],
// (remaining leaf nodes omitted)
];
```

Query terms are processed in the same manner as the input document.

A query prior to encrypting & indexing looks like a structurally similar subset of the encrypted document, for example:

```json
{
"account": {
"email": "alice@example.com",
"roles": "admin"
}
}
```

The expression `cs_ste_vec_v1(encrypted_account) @> cs_ste_vec_v1($query)` would match all records where the `encrypted_account` column contains a JSONB object with an "account" key containing an object with an "email" key where the value is the string "alice@example.com".

When reduced to a prefix list, it would look like this:

```js
[
[Obj],
[Obj, Key("account")],
[Obj, Key("account"), Obj],
[Obj, Key("account"), Obj, Key("email")],
[Obj, Key("account"), Obj, Key("email"), String("alice@example.com")][
(Obj, Key("account"), Obj, Key("roles"))
],
[Obj, Key("account"), Obj, Key("roles"), Array],
[Obj, Key("account"), Obj, Key("roles"), Array, String("admin")],
];
```

Which is then turned into an ste_vec of hashes which can be directly queries against the index.

### Modifying an index (`cs_modify_index`)

Modifies an existing index configuration.
Accepts the same parameters as `cs_add_index`

```sql
SELECT cs_modify_index_v1(
table_name text,
column_name text,
index_name text,
cast_as text,
opts jsonb
);
```

### Removing an index (`cs_remove_index`)

Removes an index configuration from the column.

```sql
SELECT cs_remove_index_v1(
table_name text,
column_name text,
index_name text
);
```
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
5 changes: 5 additions & 0 deletions playground/.envrc.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
export CS_WORKSPACE_ID=1234
export CS_CLIENT_ACCESS_KEY=1234
export CS_ENCRYPTION__CLIENT_ID=1234
export CS_ENCRYPTION__CLIENT_KEY=1234
export CS_DATASET_ID=1234
File renamed without changes.
8 changes: 8 additions & 0 deletions playground/db/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
FROM curlimages/curl:7.85.0 as fetch-eql
WORKDIR /out
RUN curl -sLo /out/cipherstash-encrypt.sql https://github.com/cipherstash/encrypt-query-language/releases/download/eql-0.4.2/cipherstash-encrypt.sql

FROM postgres:16.2-bookworm as db
WORKDIR /app
COPY init.sh /docker-entrypoint-initdb.d
COPY --from=fetch-eql /out/cipherstash-encrypt.sql /app/scripts/db/cipherstash-encrypt.sql
3 changes: 3 additions & 0 deletions playground/db/init.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash

psql -U $POSTGRES_USER -d $POSTGRES_DB -a -f /app/scripts/db/cipherstash-encrypt.sql
41 changes: 41 additions & 0 deletions playground/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
services:
postgres:
container_name: eql-playground-pg
build:
context: ./db
command: [ "postgres", "-c", "log_statement=all" ]
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: postgres
ports:
- ${PGPORT:-5432}:5432
networks:
- eql-playground-nw
proxy:
container_name: postgres_proxy
image: cipherstash/cipherstash-proxy:cipherstash-proxy-v0.3.4
depends_on:
- postgres
ports:
- ${CS_PORT:-6432}:${CS_PORT:-6432}
environment:
CS_WORKSPACE_ID: $CS_WORKSPACE_ID
CS_CLIENT_ACCESS_KEY: $CS_CLIENT_ACCESS_KEY
CS_ENCRYPTION__CLIENT_ID: $CS_ENCRYPTION__CLIENT_ID
CS_ENCRYPTION__CLIENT_KEY: $CS_ENCRYPTION__CLIENT_KEY
CS_ENCRYPTION__DATASET_ID: $CS_DATASET_ID
CS_TEST_ON_CHECKOUT: "true"
CS_AUDIT__ENABLED: "false"
CS_DATABASE__PORT: 5432
CS_DATABASE__USERNAME: postgres
CS_DATABASE__PASSWORD: postgres
CS_DATABASE__NAME: postgres
CS_DATABASE__HOST: eql-playground-pg
CS_UNSAFE_LOGGING: "true"
networks:
- eql-playground-nw

networks:
eql-playground-nw:
driver: bridge