Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] OpenSearch.org search functionality #1219

Closed
abbashus opened this issue Sep 6, 2021 · 8 comments
Closed

[RFC] OpenSearch.org search functionality #1219

abbashus opened this issue Sep 6, 2021 · 8 comments
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request

Comments

@abbashus
Copy link
Contributor

abbashus commented Sep 6, 2021

Tracking Issue: #696

Overview

We plan to introduce search capabilities for end user documentation, javadocs etc on opensearch.org . The current search functionality is based on client side lookup to local index file. We want to power this search functionality with an OpenSearch cluster (well, why not!) and provide a reference implementation so that other users can reuse it to power their own app search using OpenSearch.
We are looking for feedback and to discuss this potential solution.

Tenets

  1. Opensource first. Users should be able to fork, customize, play and contribute to project.
  2. Easy to setup and manage.

Requirements

  1. Able to search using a simple search query.
  2. Extensible : Should be able to support searching over different assets like blogs, webinars, javadocs etc. besides documentation.
  3. Granularity : User should have flexibility to search across multiple components (OpenSearch, OpenSearch Dashboards, plugins) or just a specific component, for e.g., search only Alerting plugin docs.
  4. Fast : Should return results with low latency.
  5. Availibilty : No downtime when upgrading the infra or changing configurations. Resilience against outages.
  6. Consistent experience across geographies : The search experience should stay consistent across different geographies. Users from different geographic regions should experience same latency and results.
  7. Security:
    1. Search endpoint should be read only
    2. Prevent users from running broad queries that can overload the cluster
    3. Prevention against Denial of Service attacks - legitimate queries from users should go through while those from robots should be throttled and/or blocked
  8. Monitoring: We need mechanism to monitor the health and functionality OpenSearch cluster, create alarms and take action based on those alarms to prevent any downtime and degradation
  9. Deployment: Easily to scale the backend, upgrade to new OpenSearch versions
  10. Analytics: Collect metrics on what the popular search queries are, what particular feature users are interested most
  11. Other good to have requirements [optional]:
    1. Auto-suggestion: Should support suggestions in real time as user is typing.
    2. Keyword highlight in results

How would it work?

  1. Search indexes will be built at website build time in the CI pipeline. This can be done for every commit or periodically at a given time.
  2. The built indexes will be ingested into OpenSearch cluster.
  3. The javascript code executing in the browser will interact with a frontend that translates search query into DSL, queries OpenSearch with this DSL, fetches the results. The javascript renders the results for the user on webpage.

Proposed Architecture

A common way to create a search application with OpenSearch is to use web forms to send user queries to a server. Then you can authorize the server to call the OpenSearch APIs directly and have the server send requests to OpenSearch cluster.

We can write a client-side code that doesn't rely on a server, however, we should compensate for the security and performance risks. Allowing unsigned, public access to the OpenSearch APIs is inadvisable. Users might access unsecured endpoints or impact cluster performance through overly broad queries (or too many queries). For example see #687 and #1078.

NOTE: We will be using AWS for infrastructure, but users are free to use other cloud providers that provide similar functionality or use self-hosted solutions.

Based on above requirements, we have come up with a minimal set of components to achieve the search functionality.

architecture

Components

1. OpenSearch cluster

The OpenSearch cluster will comprise of set of data nodes and master nodes each inside their own auto-scaling groups. The nodes will hosted on EC2. Auto-scaling groups (ASG) will prevent against node drops by replacing with a new node.
To prevent against availability zone outages (AZs), all nodes within an ASG will spread across at-least 3 AZs. This configuration atleast 3 data nodes and 3 master nodes.
A Network Load Balancer (NLB) will balance the incoming requests to data nodes and provides a single endpoint to access the cluster. All the resources will be created in a private subnet, and thus not reachable over internet.

2. AWS API Gateway

  • Allows for exposing different endpoints
    • Read only API for search with CORS enabled, obfuscates the OpenSearch endpoint (7.a)
    • Read/write API for indexing and perform cluster operations
  • Support for throttling (7.c)
  • Mitigation from DDoS by using Cloudfront distribution with WAF instead of using Edge optimized endpoint.
  • Provides useful API metrics : API call count, rate , latency etc.

3. AWS Lambda (for search)

  • Since we would be exposing a custom search API, we introduce a search AWS lambda function will act as a middleware that translates user search query to OpenSearch query DSL.
  • We can’t expose those credentials in the open on website pages via persistent web storage such as cookies. We also need a way inject scoped credentials to make search queries to OpenSearch. We would passing them as ENV variables to Lambda Function
  • Lambda has a cold start problem, to make searches with low-latency we plan to use Provisioned Concurrency to reduce initialization latency and make sure function is always hot.

4. AWS Lambda (for monitoring)
A monitoring Lambda function will allow us to continuously monitor the OpenSearch cluster for any problems like Red health, master issues, missing search indices, search sanity checks. This function will be triggered periodically by Cloudwatch Events (typically every minute) and emit Cloudwatch Metrics which we then be used to create Cloudwatch Alarms. The user can take appropriate action based on those alarms.

5. VPCLink
This feature allows us to map API Gateway endpoints to NLB and acts as a proxy. We use to this for ingestion/operations path. With HTTP Proxy Integration, API Gateway passes the entire request and response between the frontend and the backend.

6. Security, Authentication and Authorization
Security is the top-most priority. We will use OpenSearch security plugin to manage controlled access to cluster.
For initial implementation we will use Basic HTTP auth (username, password) based authentication. We will create individual users based on the below use cases that will be scoped to limited set of OpenSearch APIs using roles.
Since all the Lambda functions, NLB and nodes are in private subnet. We will be disabling HTTPS on security plugin.

  1. Search (Read): For the search request/response flow no auth credentials will be passed from browser. To restrict what indices can be searched, basic HTTP auth credentials will injected into Lambda via environment variables. This user will be mapped to backend roles that is read only and limits to certain indices (AuthC). Rotating credentials for search will require updating them both on Lambda function and backend, which requires disrupting live traffic and not meet our availability goals. Credentials will be seldom rotated or not rotated at all.
  2. Operations/Ingestion(Read/write) : A dedicated user for ingestion/operations. The credentials for this user will be rotated on a periodic basis.
  3. Monitoring (Read): A user to probe cluster for health-checks. Credentials for this user will be seldom rotated or not rotated at all.

7. AWS Secrets Manager
We will use AWS Secrets Manager to create and rotate credentials. The credentials will be encrypted at rest using AWS KMS key. Only certain users can fetch those secrets with restricted IAM policies attached to that user.
Apart from security plugin credentials, we will store private keys to ssh data nodes and master nodes.

8. Logging
In order for the service team to properly monitor service for security and operational issues, and in order for security teams to investigate issues, logs must be recorded and stored appropriately. Relevant logs should be built intentionally with security use cases - monitoring and investigations - in mind. Logs are both consumed by automation, and read by humans, and should address the needs of both.

Will will enable logging in the following places:

From Where
API Gateway : Execution and access logs Cloudwatch Logs
AWS Lambda (search and monitoring) Cloudwatch Logs
NLB logs S3 bucket
OpenSearch logs Cloudwatch Logs
Audit logs on OpenSearch cluster using security plugin Cloudwatch Logs
Security logs form bastion hosts (commands that are run) Cloudwatch Logs

9. Bastion Hosts
For operations, users may require access to nodes. Since all the cluster nodes are in private subnet, we need bastion hosts to limit access to SSH port from restricted IP ranges (typically corporate VPN). This will be done by setting appropriate rules via Security Groups (SG).
We will create one bastion host in at-least three AZs to safeguard against AZ outages.

Artifacts

As a part of this implementation we will provide users with AWS CDK (infra as code) to spin up there own infra and other auxiliary tools to setup and manage that infra. This artifacts will hosted in a separate public Github repository. (Don't forget the tenets 😉 )

Future Enhancements

  1. Client certificate based authentication instead of basic HTTP auth. The key problem to solve here is the PKI infra needed to create, sign and rotate certificates.
  2. The OpenSearch cluster could be cross-region replicated using CCR plugin, offering CDN like experience for faster searches based on geographies. Alternatively, if not using CCR, we can choose to index multiple clusters during website build.
@abbashus abbashus added discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request labels Sep 6, 2021
@stockholmux
Copy link
Member

stockholmux commented Sep 7, 2021

Shouldn't this RFC be in opensearch-project/project-website?

@adityaj1107
Copy link

adityaj1107 commented Sep 7, 2021

This design doc is a good example of deep dive. I have a couple of high-level comments/questions:

  • Can we weigh in the comparisons of using AWS Code Deploy?
  • Can we see if we want to deploy container based OS cluster instead of deploying it in vanilla EC2 instances? Containers require less memory management, as they consume fewer system resources and don't include OS images.
  • To provide the support for supervising the OS process on each node (container?), we might require a database which persists the status of the health check performed for each of the process. Also, maybe do you think we need a supervisor to do that kind of health checks on each node(or for each application deployed in a container)?

@CEHENKLE
Copy link
Member

CEHENKLE commented Sep 7, 2021

Shouldn't this RFC be in opensearch-project/project-website?

We generally get more eyeballs on this repo, and since I wanted this to be a reference implementation for opensearch, I wanted eyes that would be here on it :) But happy to move it over in a bit, because you're right, that's where the rubber will hit the road.

Thanks!
/C

@stockholmux
Copy link
Member

@CEHENKLE It's also being tracked on project-website #229. I understand the 'higher profile' but, to a degree, this sits between several repos.

From the perspective of someone trying to find or comment on it, I think it belongs in project-website or documentation-website.

@stockholmux
Copy link
Member

Architecturally, this is pretty Amazon centric. For someone to recreate this service they would pretty much have to be on AWS. This seems to defeat tenet #1.

I think there is some very interesting things that are not in the diagram nor description. How do you get search queries to this search service? How do you take web pages and extract content from them for OpenSearch, etc.?

Also, am I correct that the only real code here lies in Lambda and the rest is configuration?

@adityaj1107
Copy link

adityaj1107 commented Sep 7, 2021

Architecturally, this is pretty Amazon centric. For someone to recreate this service they would pretty much have to be on AWS. This seems to defeat tenet #1.

This repository provides a lot of resources which use AWS to create and manage different products. If we add investment to this proposal and get it added to AWS, it will be another feather in the cap of Open Search.

@abbashus
Copy link
Contributor Author

abbashus commented Sep 7, 2021

Architecturally, this is pretty Amazon centric. For someone to recreate this service they would pretty much have to be on AWS. This seems to defeat tenet #1.

Opensource in #1 implies all the CDK code and other code artifacts to be public. Users are free to use other cloud providers that provide similar functionality or use self-hosted solutions.

How do you get search queries to this search service?

Clients (browser) uses the exposed search endpoint (provided by API Gateway) and renders results based on response from search service.

How do you take web pages and extract content from them for OpenSearch, etc.?

It is left to the user on how to extract content from their website and ingest to cluster(general idea). They can either crawl the web pages and extract content or directly use the content if they have the source.
We will be using the latter approach.
The project website is currently built via AWS CodePipeline. We have jekyll plugin that outputs a json with the docs . We plan to add another step in the pipeline which uses a Lambda function that takes json and bulk ingest to the search service. This will be done on every commit on the repo branch we are tracking.

Also, am I correct that the only real code here lies in Lambda and the rest is configuration?

Mostly true.

@abbashus
Copy link
Contributor Author

abbashus commented Sep 7, 2021

I think there is some very interesting things that are not in the diagram nor description. How do you get search queries to this search service? How do you take web pages and extract content from them for OpenSearch, etc.?

Thanks @stockholmux . I will add more details/diagrams to showcase those parts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

4 participants