Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation Plan: Staging Elasticsearch Reindex DAGs #2358

Merged
merged 19 commits into from
Jul 14, 2023

Conversation

krysal
Copy link
Member

@krysal krysal commented Jun 8, 2023

Fixes

Resolves #1987 by @AetherUnbound

Assigned reviewers

Current round

Note
This discussion is following the Openverse decision-making process. Information about this process can be found
on the Openverse documentation site. Requested reviewers or participants will be following this process. If you are being asked to give input on a specific detail, you do not need to familiarise yourself with the process and follow it.

The discussion is currently in the Revision round.

The deadline for review of this round is 2023-06-27.

@krysal krysal added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 📄 aspect: text Concerns the textual material in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs skip-changelog 🧱 stack: documentation Related to Sphinx documentation 🧭 project: implementation plan An implementation plan for a project labels Jun 8, 2023
@AetherUnbound AetherUnbound mentioned this pull request Jun 9, 2023
16 tasks
@krysal krysal self-assigned this Jun 9, 2023
@krysal krysal force-pushed the rfc/staging_es_reindex_dags branch from 720ae1e to a65144a Compare June 20, 2023 18:51
@krysal krysal marked this pull request as ready for review June 20, 2023 19:43
@krysal krysal requested review from a team as code owners June 20, 2023 19:43
@krysal krysal requested a review from stacimc June 20, 2023 19:43
@krysal krysal force-pushed the rfc/staging_es_reindex_dags branch 2 times, most recently from 54e14d4 to d6cf421 Compare June 20, 2023 20:01
@krysal krysal changed the title Implementation Plan: Staging Elasticsearch Reindexing DAGs Implementation Plan: Staging Elasticsearch Reindex DAGs Jun 20, 2023
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so far, I left some clarifying questions, some suggestions, and some potential alternatives. But this looks great so far 🙂

Comment on lines 128 to 164
"query": {
"term": {
"source.keyword": "stocksnap"
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried this locally to see how the resulting query orders the documents? For these proportional indexes, I wonder if using the random score function would help us get a statistically significant sample 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try it locally but didn't pay attention to the order, I didn't think it was important. How would you define a "statistically significant sample" for this case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

statistically significant sample

This isn't the term I should have used. More accurately would be "more diverse sample". If an individual provider's document's scores are consistent, then we'd be testing the same 10% (or whatever proportion we use) each time the reindex happens. Using a random score would make it more likely to have a diverse set of document configurations (file types, sizes, etc). It may not be an issue if the default scores are already sufficiently random and distribute attribute variation proportionally. Random score would theoretically make it more likely for the subset of documents to be representative of the whole if the default scores are uniform (which I don't know).

Comment on lines 183 to 246
Elasticsearch does not impose any limit on the amount of indices one can create
but naturally they come with a cost. We don't have policies for creating or
deleting indices for the time being so we should monitor if we reach a point
where having many indexes impact the cluster performance.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a disk usage monitor for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more about a scenario where having many indices degrades the overall cluster response time. I read something around these lines in the Elastic forum. But that is a good suggestion as well. I believe something similar is included in the ECS alarms proposal, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EC2 instances need the same treatment that the catalogue got to add the CloudWatch agent to get disk space reporting: https://github.com/WordPress/openverse-infrastructure/pull/455

The bigger question, I suppose, is how we would measure/notice degraded staging performance, or if we would need to at all. The worst case is: we notice performance degrades through regular usage and then delete some indexes. Do we need any special monitoring at all, in that case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume not, given that "regular usage" for staging is primarily our own interactions with it rather than direct user interactions.

@AetherUnbound
Copy link
Collaborator

I will be looking at this IP this week! Sorry for the delay due to my AFK.

@stacimc
Copy link
Collaborator

stacimc commented Jun 27, 2023

@krysal -- I'm happy to review, but just clarifying whether I'm a required reviewer on this, as I'm not named in the PR?

@krysal
Copy link
Member Author

krysal commented Jun 29, 2023

I'll address @sarayourfriend's comments today! Sorry for the delay, other urgent matters took precedence in my calendar these days.

@AetherUnbound No worries at all!

@stacimc Yours was an automated assignation (probably because Sara isn't in the @WordPress/openverse-catalog group) but you're welcome to comment as well if you want. I am sure your ideas will improve the proposed plan :)

@krysal krysal force-pushed the rfc/staging_es_reindex_dags branch from d6cf421 to b7900a6 Compare June 29, 2023 17:14
Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for drafting this Krystle, and apologies it took me so long to review! I'm in agreement with Sara on a few points of clarification, along with some additional explicit steps being added.


```json
{
"max_docs": num_items,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, this rocks!

@krysal krysal force-pushed the rfc/staging_es_reindex_dags branch 2 times, most recently from 0894743 to 37b07f1 Compare June 30, 2023 01:54
@github-actions
Copy link

Full-stack documentation: https://docs.openverse.org/_preview/2358

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

@AetherUnbound
Copy link
Collaborator

Ah, that's a really important point you've raised @sarayourfriend. I don't think any of our current plans cover the filtered indices in staging actually 😮 That might be because both this project and the filtering of results on the API were being proposed alongside each other, so the latter hadn't been finalized while we were planning out the former. As it stands, I don't believe we have an explicit mechanism defined for creating the filtered index on staging. Given that this is not production though, perhaps it would make sense to filter the results while building the <media_type>-full index from the database? That would mean all downstream results based on that index are always filtered, and we don't have to worry about managing both the unfiltered & filtered indices on staging. What do you think @krysal?

We've had a lot of discussion about this IP, I wonder if it might make sense to merge it as-is given it's quite close, with a fast-follow clarification about how we'd manage the filtering (given we won't be actually working on the implementation of this for a little bit since the project is on hold). Would you prefer that Krystle?

And as for the automatic deletion, those are great points. I don't think we need to incorporate that aspect of it in the proposal here - we can leave it as an MSR task or something similar!

@krysal
Copy link
Member Author

krysal commented Jul 13, 2023

Given that this is not production though, perhaps it would make sense to filter the results while building the <media_type>-full index from the database? That would mean all downstream results based on that index are always filtered, and we don't have to worry about managing both the unfiltered & filtered indices on staging. What do you think @krysal?

@AetherUnbound The current way of building the filtered index is using the ES's Reindex API, it's not exactly a direct process form from the database. That would require creating a new process but I think we can just trigger the DAG for creating the filtered index as an additional step for building the <media_type>-full index. I'd prefer to have both indexes, the <media_type>-full and <media_type>-filtered, to provide more flexibility and work with the same indexes we have for production. What I am not sure about is if the DAG to create the filtered index is enabled for both environments or if it only applies to production. In the latter case, it would require some modifications. But it would be a really nice addition and actually closer to the part of the data refresh process that we're extracting here 😄

We've had a lot of discussion about this IP, I wonder if it might make sense to merge it as-is given it's quite close, with a fast-follow clarification about how we'd manage the filtering (given we won't be actually working on the implementation of this for a little bit since the project is on hold). Would you prefer that Krystle?

Agree on merging this, as the filtered index is a requirement that was not initially included. I'll change the default source to the filtered index for the subset-by-provider DAG, as it's an easy change. Thank you both for the excellent suggestions.

@AetherUnbound
Copy link
Collaborator

Sounds good! Yes unfortunately the filtered index DAG is only enabled for production, so we'd need to either have a similar DAG factory for it or set the environment as a DAG parameter (though I think our team preference is for the former to prevent accidental production operations).

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, thanks for all our hard work on this!

@krysal krysal force-pushed the rfc/staging_es_reindex_dags branch from 6159b34 to f2224a5 Compare July 13, 2023 23:39
@krysal krysal force-pushed the rfc/staging_es_reindex_dags branch from f2224a5 to 75fb276 Compare July 13, 2023 23:41
@openverse-bot
Copy link
Collaborator

Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR:

@stacimc
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 6 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2.

@krysal, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

@krysal
Copy link
Member Author

krysal commented Jul 14, 2023

Thank you all!

@krysal krysal merged commit 3a34d33 into main Jul 14, 2023
@krysal krysal deleted the rfc/staging_es_reindex_dags branch July 14, 2023 00:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧭 project: implementation plan An implementation plan for a project skip-changelog 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: documentation Related to Sphinx documentation
Projects
Status: Accepted
Archived in project
Development

Successfully merging this pull request may close these issues.

Implementation Plan: Staging Elasticsearch reindex DAGs for both potential index types
5 participants