This repository has been archived by the owner on Aug 4, 2023. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add iNaturalist.org metadata #549
Add iNaturalist.org metadata #549
Changes from 85 commits
0dc32e3
e44b2c8
23419ba
2c98dd1
0c2f67c
7f357c0
3cd9db3
c7ebeba
faf6716
e78f9a3
1a29f00
b145238
fc2dc54
d4b2a56
15d2d65
59ece62
0c1f65c
0c0fd9e
067a16c
1053d3b
b1cc628
697b72e
89ce375
ab9f549
3cab54c
900de4f
f72c251
b54c4c5
59cf0b8
4449811
76111cc
e4b61c4
7fd1102
bd1bd5f
2509db2
ea335ba
10cc426
dcba976
115c7ae
93d576f
064e223
464ffe8
b97b8c7
afb9cb3
0b38630
ca7564b
9581c12
cff5c5d
2b5c53e
bfe7b31
b15d989
bf452b4
7b0854a
fe3d0c2
5b0356c
52f3b13
f515bda
20b6218
e2c1c9b
1a1d940
b2e1177
42d58e9
95ac706
e73c5ea
2ac5dac
216614e
337c1d7
212f91d
4894f72
43fe04f
87515e0
a263f31
dd690d7
788f69e
989ce98
5f25ca9
3bb7f65
772eb0b
1db92b0
a32f1d6
6c78765
1bbcb6e
b71fbdf
4e20e32
e21fecb
0100a7e
9eed1f9
dce23fb
2a4074e
a6ff6df
ce2df38
dc6887c
ec53e94
c039754
f21af29
6282c17
27506d8
44418e9
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this have an issue created for it, if it doesn't already? If so can we add the issue linked here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess so, yes. Is that something I can do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You sure can: https://github.com/wordpress/openverse-catalog/issues/new/choose
Once the issue is created we typically just put the issue URL where the TODO is in the code so that the task is recorded and can be prioritized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do! https://github.com/WordPress/openverse-catalog/issues/682
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity: how was 10k decided on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gut instinct from @AetherUnbound, if memory serves. I haven't been able to run the whole thing on my local machine, because of thermal throttling, but I did get through several hundred thousand (maybe a couple of million?) records locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Is there a plan to monitor the resource consumption of this DAG when we first run it in production @AetherUnbound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, our normal limit is 100 but since there's so much data here it seemed prudent to increase the batch volume on both ends. We will indeed be monitoring performance, and we'll be sure to kick this off for the first time when nothing else intensive is running on Airflow so we have the smallest likelyhood of taking something down (CC @stacimc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what circumstances does this happen? Is it a problem on our end or with the upstream data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a great question! I actually don't think it does happen because the foreign identifier we're using is the photo id, and inaturalist (almost certainly) has pk constraints on that in their database. But, I think I got the impression that it was something I should test for here just in case. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you're wrong to be cautious with assumptions. I wonder if it should be an error condition, or at least something we keep track of when it happens. Especially given how rare we assume it to be, it could be a bad thing if it does happen 🙂 I'd be fine if a follow up issue was created to handle this in an auditable way rather than attempting to solve it now in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, and actually, 🤦 there definitely are not pk constraints on the photo table, though I do still think that there are probably not null constraints. Anyway, I'll make an issue for this. I also wonder though, how it relates to WordPress/openverse#292, and whether this might be best as part of a more general cross-provider check in the catalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding a couple of data quality related follow-ups, actually: https://github.com/WordPress/openverse-catalog/issues/684 and https://github.com/WordPress/openverse-catalog/issues/685 and I will integrate them in the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this describing a special case or just a helpful note? This was my understanding of how templates worked generally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, yes. I was more trying to be clear about where
prev_start_date_success
comes from, but I see how that's maybe obvious and/or not adding enough additional info to be helpful. How about I drop it?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still a concern? How does this manifest in the final TSV?