Implementation Plan: Staging Elasticsearch Reindex DAGs #2358

krysal · 2023-06-08T23:16:51Z

Fixes

Resolves #1987 by @AetherUnbound

Assigned reviewers

@AetherUnbound (Project lead)
@sarayourfriend (for Elasticsearch expertise and familiarity with the proposed implementation)

Current round

Note
This discussion is following the Openverse decision-making process. Information about this process can be found
on the Openverse documentation site. Requested reviewers or participants will be following this process. If you are being asked to give input on a specific detail, you do not need to familiarise yourself with the process and follow it.

The discussion is currently in the Revision round.

The deadline for review of this round is 2023-06-27.

sarayourfriend

LGTM so far, I left some clarifying questions, some suggestions, and some potential alternatives. But this looks great so far 🙂

.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md

sarayourfriend · 2023-06-20T23:02:39Z

.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md

+    "query": {
+      "term": {
+        "source.keyword": "stocksnap"
+      }
+    }


Have you tried this locally to see how the resulting query orders the documents? For these proportional indexes, I wonder if using the random score function would help us get a statistically significant sample 🤔

I try it locally but didn't pay attention to the order, I didn't think it was important. How would you define a "statistically significant sample" for this case?

statistically significant sample

This isn't the term I should have used. More accurately would be "more diverse sample". If an individual provider's document's scores are consistent, then we'd be testing the same 10% (or whatever proportion we use) each time the reindex happens. Using a random score would make it more likely to have a diverse set of document configurations (file types, sizes, etc). It may not be an issue if the default scores are already sufficiently random and distribute attribute variation proportionally. Random score would theoretically make it more likely for the subset of documents to be representative of the whole if the default scores are uniform (which I don't know).

.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md

sarayourfriend · 2023-06-20T23:11:38Z

.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md

+Elasticsearch does not impose any limit on the amount of indices one can create
+but naturally they come with a cost. We don't have policies for creating or
+deleting indices for the time being so we should monitor if we reach a point
+where having many indexes impact the cluster performance.


We could add a disk usage monitor for this.

I was thinking more about a scenario where having many indices degrades the overall cluster response time. I read something around these lines in the Elastic forum. But that is a good suggestion as well. I believe something similar is included in the ECS alarms proposal, right?

EC2 instances need the same treatment that the catalogue got to add the CloudWatch agent to get disk space reporting: https://github.com/WordPress/openverse-infrastructure/pull/455

The bigger question, I suppose, is how we would measure/notice degraded staging performance, or if we would need to at all. The worst case is: we notice performance degrades through regular usage and then delete some indexes. Do we need any special monitoring at all, in that case?

I would assume not, given that "regular usage" for staging is primarily our own interactions with it rather than direct user interactions.

AetherUnbound · 2023-06-27T03:31:02Z

I will be looking at this IP this week! Sorry for the delay due to my AFK.

stacimc · 2023-06-27T17:15:18Z

@krysal -- I'm happy to review, but just clarifying whether I'm a required reviewer on this, as I'm not named in the PR?

krysal · 2023-06-29T14:21:32Z

I'll address @sarayourfriend's comments today! Sorry for the delay, other urgent matters took precedence in my calendar these days.

@AetherUnbound No worries at all!

@stacimc Yours was an automated assignation (probably because Sara isn't in the @WordPress/openverse-catalog group) but you're welcome to comment as well if you want. I am sure your ideas will improve the proposed plan :)

AetherUnbound

Thanks for drafting this Krystle, and apologies it took me so long to review! I'm in agreement with Sara on a few points of clarification, along with some additional explicit steps being added.

.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md

AetherUnbound · 2023-06-29T19:34:11Z

.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md

+
+```json
+{
+  "max_docs": num_items,


Definitely, this rocks!

.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md

github-actions · 2023-06-30T02:09:38Z

Full-stack documentation: https://docs.openverse.org/_preview/2358

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

AetherUnbound · 2023-07-13T00:10:47Z

Ah, that's a really important point you've raised @sarayourfriend. I don't think any of our current plans cover the filtered indices in staging actually 😮 That might be because both this project and the filtering of results on the API were being proposed alongside each other, so the latter hadn't been finalized while we were planning out the former. As it stands, I don't believe we have an explicit mechanism defined for creating the filtered index on staging. Given that this is not production though, perhaps it would make sense to filter the results while building the <media_type>-full index from the database? That would mean all downstream results based on that index are always filtered, and we don't have to worry about managing both the unfiltered & filtered indices on staging. What do you think @krysal?

We've had a lot of discussion about this IP, I wonder if it might make sense to merge it as-is given it's quite close, with a fast-follow clarification about how we'd manage the filtering (given we won't be actually working on the implementation of this for a little bit since the project is on hold). Would you prefer that Krystle?

And as for the automatic deletion, those are great points. I don't think we need to incorporate that aspect of it in the proposal here - we can leave it as an MSR task or something similar!

krysal · 2023-07-13T18:22:51Z

Given that this is not production though, perhaps it would make sense to filter the results while building the <media_type>-full index from the database? That would mean all downstream results based on that index are always filtered, and we don't have to worry about managing both the unfiltered & filtered indices on staging. What do you think @krysal?

@AetherUnbound The current way of building the filtered index is using the ES's Reindex API, it's not exactly a direct process form from the database. That would require creating a new process but I think we can just trigger the DAG for creating the filtered index as an additional step for building the <media_type>-full index. I'd prefer to have both indexes, the <media_type>-full and <media_type>-filtered, to provide more flexibility and work with the same indexes we have for production. What I am not sure about is if the DAG to create the filtered index is enabled for both environments or if it only applies to production. In the latter case, it would require some modifications. But it would be a really nice addition and actually closer to the part of the data refresh process that we're extracting here 😄

We've had a lot of discussion about this IP, I wonder if it might make sense to merge it as-is given it's quite close, with a fast-follow clarification about how we'd manage the filtering (given we won't be actually working on the implementation of this for a little bit since the project is on hold). Would you prefer that Krystle?

Agree on merging this, as the filtered index is a requirement that was not initially included. I'll change the default source to the filtered index for the subset-by-provider DAG, as it's an easy change. Thank you both for the excellent suggestions.

AetherUnbound · 2023-07-13T20:03:40Z

Sounds good! Yes unfortunately the filtered index DAG is only enabled for production, so we'd need to either have a similar DAG factory for it or set the environment as a DAG parameter (though I think our team preference is for the former to prevent accidental production operations).

AetherUnbound

Amazing, thanks for all our hard work on this!

.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md

Co-authored-by: sarayourfriend <24264157+sarayourfriend@users.noreply.github.com>

Co-authored-by: Madison Swain-Bowden <bowdenm@spu.edu>

openverse-bot · 2023-07-14T00:00:21Z

Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR:

@stacimc
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend¹ days, this PR was ready for review 6 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)².

@krysal, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Specifically, Saturday and Sunday. ↩
For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range. ↩

krysal · 2023-07-14T00:23:38Z

Thank you all!

AetherUnbound mentioned this pull request Jun 9, 2023

Search relevancy sandbox #392

Closed

16 tasks

krysal self-assigned this Jun 9, 2023

krysal force-pushed the rfc/staging_es_reindex_dags branch from 720ae1e to a65144a Compare June 20, 2023 18:51

krysal requested review from AetherUnbound and sarayourfriend June 20, 2023 19:43

krysal marked this pull request as ready for review June 20, 2023 19:43

krysal requested review from a team as code owners June 20, 2023 19:43

krysal requested a review from stacimc June 20, 2023 19:43

krysal force-pushed the rfc/staging_es_reindex_dags branch 2 times, most recently from 54e14d4 to d6cf421 Compare June 20, 2023 20:01

krysal changed the title ~~Implementation Plan: Staging Elasticsearch Reindexing DAGs~~ Implementation Plan: Staging Elasticsearch Reindex DAGs Jun 20, 2023

sarayourfriend reviewed Jun 20, 2023

View reviewed changes

sarayourfriend mentioned this pull request Jun 26, 2023

Unify indexing concurrency checks in Airflow DAGs #2480

Open

krysal force-pushed the rfc/staging_es_reindex_dags branch from d6cf421 to b7900a6 Compare June 29, 2023 17:14

AetherUnbound reviewed Jun 29, 2023

View reviewed changes

krysal force-pushed the rfc/staging_es_reindex_dags branch 2 times, most recently from 0894743 to 37b07f1 Compare June 30, 2023 01:54

AetherUnbound approved these changes Jul 13, 2023

View reviewed changes

.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md Show resolved Hide resolved

krysal force-pushed the rfc/staging_es_reindex_dags branch from 6159b34 to f2224a5 Compare July 13, 2023 23:39

krysal and others added 19 commits July 13, 2023 19:41

Fix broken link

ec33ee7

Create implementation plan draft from template

6152d79

Update link

159732b

Describe first DAG: recreate_full_<media>_index

187ad53

Describe second DAG: create_proportional_by_provider_<media>_index

bd0bd7a

Fill the Alternatives section

ab35e89

Use absolute paths and fix links

52aafa9

Generate DAG docs

6cb14e3

Add staging to the name of DAGs and fix typo

7e986e0

Co-authored-by: sarayourfriend <24264157+sarayourfriend@users.noreply.github.com>

Add steps for <media_type>-full alias creation

557c983

Change the approach to be DAG factories

e9ea24c

Rewrite 2nd DAG factory

a6412e4

Fix typos

6cbd28c

Co-authored-by: Madison Swain-Bowden <bowdenm@spu.edu>

Add reference to indices.update_aliases documentation

245ce95

Add link to Airflow's Dynamic Task Mapping docs

82c1b62

Add note of optionality of new aliases

3d3aa78

Add note on combining DAGs

98d9fe4

Change default source_index

5df190c

Add note on updating the filtered index and reviewers approval

75fb276

krysal force-pushed the rfc/staging_es_reindex_dags branch from f2224a5 to 75fb276 Compare July 13, 2023 23:41

krysal merged commit 3a34d33 into main Jul 14, 2023

krysal deleted the rfc/staging_es_reindex_dags branch July 14, 2023 00:24

sarayourfriend mentioned this pull request Oct 30, 2023

Add DAG for creating staging indices #3232

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation Plan: Staging Elasticsearch Reindex DAGs #2358

Implementation Plan: Staging Elasticsearch Reindex DAGs #2358

krysal commented Jun 8, 2023 •

edited

Loading

sarayourfriend left a comment

sarayourfriend Jun 20, 2023

krysal Jun 30, 2023

sarayourfriend Jul 4, 2023

sarayourfriend Jun 20, 2023

krysal Jun 30, 2023

sarayourfriend Jul 2, 2023

AetherUnbound Jul 10, 2023

AetherUnbound commented Jun 27, 2023

stacimc commented Jun 27, 2023

krysal commented Jun 29, 2023

AetherUnbound left a comment

AetherUnbound Jun 29, 2023

github-actions bot commented Jun 30, 2023

AetherUnbound commented Jul 13, 2023

krysal commented Jul 13, 2023

AetherUnbound commented Jul 13, 2023

AetherUnbound left a comment

openverse-bot commented Jul 14, 2023

krysal commented Jul 14, 2023

Implementation Plan: Staging Elasticsearch Reindex DAGs #2358

Implementation Plan: Staging Elasticsearch Reindex DAGs #2358

Conversation

krysal commented Jun 8, 2023 • edited Loading

Fixes

Assigned reviewers

Current round

sarayourfriend left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AetherUnbound commented Jun 27, 2023

stacimc commented Jun 27, 2023

krysal commented Jun 29, 2023

AetherUnbound left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jun 30, 2023

AetherUnbound commented Jul 13, 2023

krysal commented Jul 13, 2023

AetherUnbound commented Jul 13, 2023

AetherUnbound left a comment

Choose a reason for hiding this comment

openverse-bot commented Jul 14, 2023

Footnotes

krysal commented Jul 14, 2023

krysal commented Jun 8, 2023 •

edited

Loading