hive as a persisted operations/documents store

Having GraphQL Hive act as a persistent operation store would be great.

## Critical user flow

- publish app deployement (app name + app version + set of persisted documents) via Hive CLI
- use Hive CLI to mark an app deployment as active (as soon as it is active, the persisted documents should be available via CDN)
- The Hive SDK should with a single option opt-into using the persisted document store from Hive CDN, no additional configuration needed
  - `@graphql-hive/envelope`
  - `@graphql-hive/yoga`
  - `@graphql-hive/apollo`
  - Apollo Router
- user configures urql, apollo and relay use persisted documents with Hive, through our guidance documentation (we also reference to frameworks/libraries for further information)
    - https://commerce.nearform.com/open-source/urql/docs/advanced/persistence-and-uploads/
- once an app deployment is no longer used → retire app via CLI or Hive App UI (persisted documents are no longer available via persisted documents CDN)
## Technical Challenges


- How do we “store”/insert all the operations?
    - We don’t want to send one POST request to API that includes 1000 GraphQL documents 🤔
    - We should only POST batches
- ~API: We need an async task runner for writing operations from “storage” to “CDN” and also cleaning them up if app deployment got retired (we cannot send 1000 requests to S3 from within a single HTTP call, can't we? :))~ - We decided that it makes most sense to immediately write to S3 in the incoming request; we can check whether deployment is active via another S3 key lookup
- How do we extract operations with Hive? Do we rely on people providing a JSON file?
    - In codegen-client preset we provide it (https://the-guild.dev/graphql/codegen/plugins/presets/preset-client#persisted-documents)
    - graphql.tada does the same (https://gql-tada.0no.co/guides/persisted-documents)
    - Should we still provide a "generic" way of generating the json file by extracting operations from code files? In that case people would have problems with matching their app documents to the persisted document hashes anyways...

## Nice to have (stretch goals; follow up tasks)

- Use persisted documents store for web app
- Use persisted documents store for Hive CLI
- HTTP persisted operations store specification (for the ecosystem ™️; not just us)
- Opt-into breaking changes based on active app deployment (usage) instead of operation usage (if one operation of app deployment is used → all schema coordinates within app deployments being removed is a breaking change, as the app is still active)
- Because we now know which clients and client version operations are executed we can notice when incoming operations from usage no longer reference the operations in a persisted operations set and thus notify hive users when a specific version of their product is not being used anymore/low in usage

-->

Allow users to decide whether to do breaking change detection based on app deployments or usage data.

Delete persisted operation deployment flow
1. Drop the persisted operation deployment via UI or CLI
2. Schedule Async Task for actually deleting the persisted operation documents from S3
3. (optional) Releasec for conditional breaking changes

We must figure out how to incorporate the persisted operation schema coordinates within the hive check/breaking change detection flow (usage data). As long as a persisted operation deployment is active, the field is in usage, even if there is no data in the retention period. After a persisted operation deployment has been deleted/marked as retired/inactive, the deployment schema coordinates removal is no longer blocked by the deployment.

Based on the usage data, we can notify users when a client version seems unused (e.g. old mobile client).

## Documentation

- We should mainly advertise this as a security and "performance" improvement feature
- Don't execute arbitrary queries
- Reduce client -> origin upstream traffic (heavy graphql documents being sent over the wire!!!)

## Details

Some ideas on how to store stuff...

### S3 Key Structure

Here we write the graphql documents as long as the deployment is active - we need to ensure that it is removed from S3 if the deployment gets inactive. Thus a transactional background job seems inevitable.

```
persisted/{orgId}/{project}/{target}/{client}/{clientVersion}/{operationHash}
```

### SQL

```sql
CREATE TABLE "persisted_document_deployments" (
  "target_id" uuid NOT NULL,
  "client_name" text NOT NULL,
  "client_version" text NOT NULL,
  "is_active" boolean -- if it is active you should not be able to add new operations to it
);

CREATE TABLE "persisted_documents" (
  "id" uuid,
  "persisted_document_deployment_id" uuid REFERENCES "persisted_document_deployment"("id")
  "hash" text NOT NULL,
  "operation_document" text,
  "document_s3_location" text NOT NULL, -- we should store a reference (in case we at some point have to change the key structure/pattern
  "schema_coordinates" text[], -- see notes
  "created_at" TIMESTAMPTZ NOT NULL DEFAULT NOW() -- this column is most likely unnecessary
);

-- Everything here is not necessarily required for the initial version - but could help for breaking change detection...

CREATE INDEX "persisted_documents_pagination" on "persisted_documents" USING GIN ("schema_coordinates");

-- get list of all operations that are related to a set of schema coordinates
SELECT
  "persisted_documents"."hash",
FROM
  "persisted_documents"
  INNER JOIN
    "persisted_document_deployments"
      ON "persisted_document_deployments"."id" = "persisted_documents"."persisted_document_deployment_id"
WHERE
  "persisted_document_deployments"."is_active" = TRUE
  AND "persisted_documents"."schema_coordinates" && '{A.foo,B.ff}'
```

When a deployment has been created and "frozen", we could generate the schema coordinate ---> hash mapping, for quick lookups of which a schema coordinate impacts operations. 🤔
Alternatively, we can execute the SQL live for each active deployment.

Unsure whether we should store all the schema coordinates used within a document alongside the document. 🤔

 PROs:
- When we introduce usage reporting by only sending the hash of the operation (instead of all the schema coordinates), we don't need to process the whole document with a graphql visitor to write data to clickhouse
- Could be indexed for a lookup table to find which app deployments are affected by a breaking change...
Cons:
- We store a lot more of data
- Inserts might become slower

____

Links:
- https://the-guild.dev/graphql/yoga-server/docs/features/persisted-operations
- https://engineering.zalando.com/posts/2022/02/graphql-persisted-queries-and-schema-stability.html
- https://chillicream.com/docs/hotchocolate/performance/persisted-queries
- https://guild-oss.slack.com/archives/C01E8ATBQGN/p1690226623145209

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

hive as a persisted operations/documents store #659

Critical user flow

Technical Challenges

Nice to have (stretch goals; follow up tasks)

Documentation

Details

S3 Key Structure

SQL

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

hive as a persisted operations/documents store #659

Description

Critical user flow

Technical Challenges

Nice to have (stretch goals; follow up tasks)

Documentation

Details

S3 Key Structure

SQL

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions