-
Notifications
You must be signed in to change notification settings - Fork 54
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
openverse_catalog/dags/providers/provider_csv_load_scripts/inaturalist/01_photos.sql
Outdated
Show resolved
Hide resolved
openverse_catalog/dags/providers/provider_csv_load_scripts/inaturalist/04_observers.sql
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I think it's an awesome start and it feels like the right direction to me 💯
I was able to get this running locally, after a few small tweaks. I've committed those, hope that's okay!
I think our best bet right now is to set this up so we end up pulling the data from inaturalist.tsv
into a python script, and using the ImageStore
class to write the records to a local TSV. That can then be picked up by the rest of the workflow and finish as the others do. It does seem a bit pedantic to go from csv -> postgres -> TSV -> postgres, but I think it's a good middle ground for us. The MediaStore
class has a lot of validation and cleaning steps internally. We could eventually have replicas of that in postgres directly so we could just go from inaturalist.tsv
-> openledger.image
, but for now I think its best to leverage the tools we have. To that end, if there are steps (like the license munging) that would be easier to do in python, we can perform that in the pull script too before adding a record to the image store!
It looks like you got the docker/minio stuff figured out, sorry that was such a pain! Your suggested approach here looks great in that regard.
I think we could even still use the create_provider_api_workflow
function too! We would just need to grab the DAG and add the additional .gz
-to-postgres step before it. Maybe something like this using DAG::add_task
and DAG::set_dependency
:
DAG_ID = "inaturalist_workflow"
dag = create_provider_api_workflow(
DAG_ID,
inaturalist.main,
start_date=datetime(1970, 1, 1),
max_active_tasks=1,
schedule_string='@monthly',
dated=False,
)
PythonOperator("download_gz_and_upload_to_postgres", python_callable=..., dag=dag)
dag.set_dependency("download_gz_and_upload_to_postgres", "pull_image_data")
What do you think?
Use exclude so only need to list the one provider subfolder Co-authored-by: Madison Swain-Bowden <bowdenm@spu.edu>
No worries! But I tried adding
Maybe it's just that #635 hasn't been merged yet? |
Yep! Should have mentioned that, but it'll hopefully be merged before/when we deploy this DAG 😄 |
This reverts commit 9eed1f9.
Weird, looks like the error was a total fluke! CI is ✅ now |
Reviewing this again now, @rwidom. |
Okay, I've tested the DAG locally and it runs great 🙂 I had to just down -v twice for some reason, or it could have been that minio was still getting hydrated. In any case, it did work the second time after I made sure minio was ready before enabling the DAG. Going to review the code now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting! I don't have any blockers, just a handful of questions. Great work, Rebecca 🤠
# adjustments to buffer limits. TO DO: try to integrate this with the dev | ||
# environment logic in the base class, rather than just over-writing it. | ||
self.media_stores["image"].buffer_length = 10_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this have an issue created for it, if it doesn't already? If so can we add the issue linked here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess so, yes. Is that something I can do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You sure can: https://github.com/wordpress/openverse-catalog/issues/new/choose
Once the issue is created we typically just put the issue URL where the TODO is in the code so that the task is recorded and can be prioritized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# adjustments to buffer limits. TO DO: try to integrate this with the dev | ||
# environment logic in the base class, rather than just over-writing it. | ||
self.media_stores["image"].buffer_length = 10_000 | ||
self.batch_limit = 10_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity: how was 10k decided on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gut instinct from @AetherUnbound, if memory serves. I haven't been able to run the whole thing on my local machine, because of thermal throttling, but I did get through several hundred thousand (maybe a couple of million?) records locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Is there a plan to monitor the resource consumption of this DAG when we first run it in production @AetherUnbound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, our normal limit is 100 but since there's so much data here it seemed prudent to increase the batch volume on both ends. We will indeed be monitoring performance, and we'll be sure to kick this off for the first time when nothing else intensive is running on Airflow so we have the smallest likelyhood of taking something down (CC @stacimc)
if data.get("foreign_identifier") is None: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what circumstances does this happen? Is it a problem on our end or with the upstream data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a great question! I actually don't think it does happen because the foreign identifier we're using is the photo id, and inaturalist (almost certainly) has pk constraints on that in their database. But, I think I got the impression that it was something I should test for here just in case. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you're wrong to be cautious with assumptions. I wonder if it should be an error condition, or at least something we keep track of when it happens. Especially given how rare we assume it to be, it could be a bad thing if it does happen 🙂 I'd be fine if a follow up issue was created to handle this in an auditable way rather than attempting to solve it now in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, and actually, 🤦 there definitely are not pk constraints on the photo table, though I do still think that there are probably not null constraints. Anyway, I'll make an issue for this. I also wonder though, how it relates to WordPress/openverse#292, and whether this might be best as part of a more general cross-provider check in the catalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding a couple of data quality related follow-ups, actually: https://github.com/WordPress/openverse-catalog/issues/684 and https://github.com/WordPress/openverse-catalog/issues/685 and I will integrate them in the comments.
task_id="check_for_file_updates", | ||
python_callable=INaturalistDataIngester.compare_update_dates, | ||
op_kwargs={ | ||
# With the templated values ({{ x }}) airflow will fill it in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this describing a special case or just a helpful note? This was my understanding of how templates worked generally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, yes. I was more trying to be clear about where prev_start_date_success
comes from, but I see how that's maybe obvious and/or not adding enough additional info to be helpful. How about I drop it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you 👍
openverse_catalog/dags/providers/provider_csv_load_scripts/inaturalist/create_schema.sql
Show resolved
Hide resolved
openverse_catalog/dags/providers/provider_api_scripts/inaturalist.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested this PR locally, and it works really well for the sample data. The only thing I noticed is it doesn't add category
, because the changes to the ImageStore
were merged into main after this PR was opened.
Would it be possible to rebase this PR onto main so that we could make sure that the category is added, and there are no other problems?
I didn't update my local branch initially, sorry.
Congratulations on creating a new ingestion process, @rwidom!
Co-authored-by: Olga Bulat <obulat@gmail.com>
Fixes
Fixes WordPress/openverse#1608 by @sarayourfriend
Description
The iNaturalist API is not intended for data scraping, but there is a full dump intended for sharing on S3. Those files cover over 120 million images, with information about the person who shared the photo and the species in the photo represented across four normalized tables.
This PR loads all of the inaturalist metadata at once. The source files do not include any date or time fields to support differentiating among duplicate IDs or performing incremental updates. However, image files are in the same S3 bucket, prefixed by the photo_id in the metadata files. So, future dagruns could use s3Hook.list_keys to pull photo ID prefixes with files that have been updated since the prev_start_date_success and process only those photo IDs each month.
Testing Instructions
Check out the branch. And then this PR includes changes to the docker dev setup, so it's a good idea to use
just down -v
andjust recreate
when you're testing.just test
will cover basic tests of the provider data ingestion class functions, but not the DAG steps that make the metadata available .There are very small subsets of the real files in /tests/s3-data/inaturalist-open-data, that will be loaded to the dev environment equivalent of s3, so that you can go kick off the inaturalist_workflow dag in Airflow, and inspect the logs and the resulting tsv file and the
image
table in the db. The Airflow log will give you the exact file name, and you can see it by runningjust shell
andmore /var/workflow_output/inaturalist_image_v001_20220709220913.tsv
(or whatever the file is called). Run the dag a second time to confirm that the skips work when there is no new data.I tried putting the real fullsize inaturalist files in /tests/s3-data/inaturalist-open-data, running
just down -v
andjust recreate
, and waiting a little bit for minio to finish loading the files. I've not been able to process the full dataset in my dev environment, but for now I am assuming that it is due to limitations of the environment.Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin