-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement basic Storage Indexer #814
Conversation
} | ||
logger.Info("Downloaded new search-index-all index", zap.String("index.packages.size", fmt.Sprintf("%d", len(anIndex.Packages)))) | ||
|
||
refreshedList, err := transformSearchIndexAllToPackages(*anIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsoriano I was wondering if you can take a look in the middle of implementation to challenge/comment on the idea.
Storage Indexer uses a goroutine to regularly hit the cloud bucket for cursor changes. When the cursor changes, the indexer pulls a new index from the bucket and transforms into packages.Packages
.
I wanted to skip writing yet another model class and depend on json.RawMessage
. I will try to apply the same approach to the Package Storage Infra. With manifests present in the manifest, the indexer will create a ProxyPackageFileSystem
to use the packages.NewPackage
to build the Package
structure (a hackish method, but no need to rewrite the model).
I'm not sure what to do about streaming other files. I plan for the ProxyPackageFileSystem
to contain a streaming logic so that it can pull stuff from the public bucket. So far, the hardest part is to wire the spaghetti through packages.NewPackage
and ServeFile
.
Let me know what you think and please share other ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storage Indexer uses a goroutine to regularly hit the cloud bucket for cursor changes. When the cursor changes, the indexer pulls a new index from the bucket and transforms into
packages.Packages
.
Oh, after #813 (comment) I understood that we were going to use some kind of watch API 🙂 If we are going to be polling the cursor, I would think again about using authenticated HTTP instead of the storage-specific client.
I wanted to skip writing yet another model class and depend on
json.RawMessage
. I will try to apply the same approach to the Package Storage Infra. With manifests present in the manifest, the indexer will create aProxyPackageFileSystem
to use thepackages.NewPackage
to build thePackage
structure (a hackish method, but no need to rewrite the model).
Umm, the current model we have for packages is quite similar to the model in the index, have you tried to unmarshal directly to them? Maybe it works. I think I would prefer that to faking a file system over the index json.
Even if it doesn't work as is now, we could still evaluate the changes needed to adapt the current model, or the index, so the index works with current model in the registry.
If the model in the index cannot be aligned with the model here in the registry, I think I would still prefer to keep both models, unmarshal to the index one, and then convert it to the one in the registry. It may be cumbersome at first but easier to understand and maintain in the future.
I'm not sure what to do about streaming other files. I plan for the
ProxyPackageFileSystem
to contain a streaming logic so that it can pull stuff from the public bucket. So far, the hardest part is to wire the spaghetti throughpackages.NewPackage
andServeFile
.
For files in the storage, ideally the registry could answer with a redirect to the resource in GCP, so it doesn't serve files directly. We would need to check though if the http client in current versions of Kibana would follow these redirects.
We may need to add something in the current model to differentiate local from "external" packages, "external" packages wouldn't have a fs builder, but would have some way of translating the file name to urls, and do the redirect.
If http client in Kibana follows redirects, I would try to follow this approach: try to use the current model, but adapt it for "external" packages to serve files with redirects instead of from fake filesystems. This change to serve files with redirects could be done in a separated PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me share an update on this:
- I used marshalers to load the index fetched from the storage bucket.
- I will focus on streaming package data in a follow-up. Otherwise, the size of this PR will grow up.
- Re watch API:
Should you use Object change notification?
Generally, you should not use Object change notification. The recommended tool for generating notifications that track changes to objects in your Cloud Storage buckets is Pub/Sub notifications for Cloud Storage, because it's faster, more flexible, easier to set up, and more cost-effective. For a step-by-step guide to setting up Pub/Sub notifications for Cloud Storage, see Configuring Pub/Sub notifications for Cloud Storage.
I understand that it's recommended to use a Pub/Sub mechanism, but not sure if we want to use another cloud service (stronger coupling with GCP). Do you think that we should rather use a basic HTTP client with Oauth? I admit that using the GCP Go client is really convenient and we may introduce potential bugs if we go with signed HTTP requests. Is it an attempt to make it universal/storage independent?
Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admit that using the GCP Go client is really convenient and we may introduce potential bugs if we go with signed HTTP requests. Is it an attempt to make it universal/storage independent?
Using more generic HTTP would allow us to use more generic HTTP tooling and libraries, reducing dependencies and possibly simplifying development and testing. We could still use the library to initiate the HTTP client, keep its tokens updated and so on, but the operations could be generic.
A consequence of this would be effectively that this would be more storage independent, yes. I wouldn't consider this a priority, or a motivation for this, but this would be a nice to have 🙂
But I agree that if this complicates things, and the GCP library doesn't expose an authenticated client we can reuse, the effort is probably not worth it.
Regarding pub/sub, we can leave this for a future optimization, if anytime needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for ideas!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, let's go on with this if GCP client doesn't expose an authenticated generic HTTP client.
Issue: #767
This PR implements a storage indexer which periodically hits the Cloud Storage for cursor updates. When the cursor changes, the indexer fetches the latest search-index-all index and updates in-memory data. The marshaled index weighs ~21MB.
Follow-ups:
main_test.go
)ServeFile
to stream file content (HTTP 200, HTTP 303, HTTP 503)