Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hive as a persisted operations/documents store #659

Closed
n1ru4l opened this issue Nov 16, 2022 · 7 comments
Closed

hive as a persisted operations/documents store #659

n1ru4l opened this issue Nov 16, 2022 · 7 comments
Labels
enhancement New feature or request that adds new things or value to Hive

Comments

@n1ru4l
Copy link
Collaborator

n1ru4l commented Nov 16, 2022

Having GraphQL Hive act as a persistent operation store would be great.

Critical user flow

  • publish app deployement (app name + app version + set of persisted documents) via Hive CLI
  • use Hive CLI to mark an app deployment as active (as soon as it is active, the persisted documents should be available via CDN)
  • The Hive SDK should with a single option opt-into using the persisted document store from Hive CDN, no additional configuration needed
    • @graphql-hive/envelope
    • @graphql-hive/yoga
    • @graphql-hive/apollo
    • Apollo Router
  • user configures urql, apollo and relay use persisted documents with Hive, through our guidance documentation (we also reference to frameworks/libraries for further information)
  • once an app deployment is no longer used → retire app via CLI or Hive App UI (persisted documents are no longer available via persisted documents CDN)

Technical Challenges

  • How do we “store”/insert all the operations?
    • We don’t want to send one POST request to API that includes 1000 GraphQL documents 🤔
    • We should only POST batches
  • API: We need an async task runner for writing operations from “storage” to “CDN” and also cleaning them up if app deployment got retired (we cannot send 1000 requests to S3 from within a single HTTP call, can't we? :)) - We decided that it makes most sense to immediately write to S3 in the incoming request; we can check whether deployment is active via another S3 key lookup
  • How do we extract operations with Hive? Do we rely on people providing a JSON file?

Nice to have (stretch goals; follow up tasks)

  • Use persisted documents store for web app
  • Use persisted documents store for Hive CLI
  • HTTP persisted operations store specification (for the ecosystem ™️; not just us)
  • Opt-into breaking changes based on active app deployment (usage) instead of operation usage (if one operation of app deployment is used → all schema coordinates within app deployments being removed is a breaking change, as the app is still active)
  • Because we now know which clients and client version operations are executed we can notice when incoming operations from usage no longer reference the operations in a persisted operations set and thus notify hive users when a specific version of their product is not being used anymore/low in usage

-->

Allow users to decide whether to do breaking change detection based on app deployments or usage data.

Delete persisted operation deployment flow

  1. Drop the persisted operation deployment via UI or CLI
  2. Schedule Async Task for actually deleting the persisted operation documents from S3
  3. (optional) Releasec for conditional breaking changes

We must figure out how to incorporate the persisted operation schema coordinates within the hive check/breaking change detection flow (usage data). As long as a persisted operation deployment is active, the field is in usage, even if there is no data in the retention period. After a persisted operation deployment has been deleted/marked as retired/inactive, the deployment schema coordinates removal is no longer blocked by the deployment.

Based on the usage data, we can notify users when a client version seems unused (e.g. old mobile client).

Documentation

  • We should mainly advertise this as a security and "performance" improvement feature
  • Don't execute arbitrary queries
  • Reduce client -> origin upstream traffic (heavy graphql documents being sent over the wire!!!)

Details

Some ideas on how to store stuff...

S3 Key Structure

Here we write the graphql documents as long as the deployment is active - we need to ensure that it is removed from S3 if the deployment gets inactive. Thus a transactional background job seems inevitable.

persisted/{orgId}/{project}/{target}/{client}/{clientVersion}/{operationHash}

SQL

CREATE TABLE "persisted_document_deployments" (
  "target_id" uuid NOT NULL,
  "client_name" text NOT NULL,
  "client_version" text NOT NULL,
  "is_active" boolean -- if it is active you should not be able to add new operations to it
);

CREATE TABLE "persisted_documents" (
  "id" uuid,
  "persisted_document_deployment_id" uuid REFERENCES "persisted_document_deployment"("id")
  "hash" text NOT NULL,
  "operation_document" text,
  "document_s3_location" text NOT NULL, -- we should store a reference (in case we at some point have to change the key structure/pattern
  "schema_coordinates" text[], -- see notes
  "created_at" TIMESTAMPTZ NOT NULL DEFAULT NOW() -- this column is most likely unnecessary
);

-- Everything here is not necessarily required for the initial version - but could help for breaking change detection...

CREATE INDEX "persisted_documents_pagination" on "persisted_documents" USING GIN ("schema_coordinates");

-- get list of all operations that are related to a set of schema coordinates
SELECT
  "persisted_documents"."hash",
FROM
  "persisted_documents"
  INNER JOIN
    "persisted_document_deployments"
      ON "persisted_document_deployments"."id" = "persisted_documents"."persisted_document_deployment_id"
WHERE
  "persisted_document_deployments"."is_active" = TRUE
  AND "persisted_documents"."schema_coordinates" && '{A.foo,B.ff}'

When a deployment has been created and "frozen", we could generate the schema coordinate ---> hash mapping, for quick lookups of which a schema coordinate impacts operations. 🤔
Alternatively, we can execute the SQL live for each active deployment.

Unsure whether we should store all the schema coordinates used within a document alongside the document. 🤔

PROs:

  • When we introduce usage reporting by only sending the hash of the operation (instead of all the schema coordinates), we don't need to process the whole document with a graphql visitor to write data to clickhouse
  • Could be indexed for a lookup table to find which app deployments are affected by a breaking change...
    Cons:
  • We store a lot more of data
  • Inserts might become slower

Links:

@n1ru4l n1ru4l added the enhancement New feature or request that adds new things or value to Hive label Nov 16, 2022
@n1ru4l n1ru4l changed the title Hive as a Persisted Operations/Documents Store hive as a persisted operations/documents store Nov 16, 2022
@kamilkisiela
Copy link
Collaborator

It could also show a complexity score next to each document.

@kamilkisiela
Copy link
Collaborator

Could also reject documents with complexity higher than X

@n1ru4l
Copy link
Collaborator Author

n1ru4l commented Nov 28, 2022

S3 could be used as a schema registry

@kamilkisiela
Copy link
Collaborator

Yes and Hive should control it all

@n1ru4l
Copy link
Collaborator Author

n1ru4l commented Nov 28, 2022

a few analytic stuff we can do here as well.

e.g. display how many bytes saved from client <-> server requests by using persisted operations over time

@kamilkisiela
Copy link
Collaborator

kamilkisiela commented Nov 28, 2022

Plus some part of data processing (related to the usage reporting pipeline) might be done ahead of time and the structure of the usage report could be much different much much lower in size (and more performant on the user side - no processing of documents involved)

@n1ru4l
Copy link
Collaborator Author

n1ru4l commented Aug 15, 2024

@n1ru4l n1ru4l closed this as completed Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request that adds new things or value to Hive
Projects
None yet
Development

No branches or pull requests

2 participants