Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Add iNaturalist.org metadata #549

Merged
merged 98 commits into from
Aug 19, 2022
Merged

Conversation

rwidom
Copy link
Collaborator

@rwidom rwidom commented Jun 3, 2022

Fixes

Fixes WordPress/openverse#1608 by @sarayourfriend

Description

The iNaturalist API is not intended for data scraping, but there is a full dump intended for sharing on S3. Those files cover over 120 million images, with information about the person who shared the photo and the species in the photo represented across four normalized tables.

This PR loads all of the inaturalist metadata at once. The source files do not include any date or time fields to support differentiating among duplicate IDs or performing incremental updates. However, image files are in the same S3 bucket, prefixed by the photo_id in the metadata files. So, future dagruns could use s3Hook.list_keys to pull photo ID prefixes with files that have been updated since the prev_start_date_success and process only those photo IDs each month.

Testing Instructions

Check out the branch. And then this PR includes changes to the docker dev setup, so it's a good idea to use just down -v and just recreate when you're testing.

just test will cover basic tests of the provider data ingestion class functions, but not the DAG steps that make the metadata available .

There are very small subsets of the real files in /tests/s3-data/inaturalist-open-data, that will be loaded to the dev environment equivalent of s3, so that you can go kick off the inaturalist_workflow dag in Airflow, and inspect the logs and the resulting tsv file and the image table in the db. The Airflow log will give you the exact file name, and you can see it by running just shell and more /var/workflow_output/inaturalist_image_v001_20220709220913.tsv (or whatever the file is called). Run the dag a second time to confirm that the skips work when there is no new data.

I tried putting the real fullsize inaturalist files in /tests/s3-data/inaturalist-open-data, running just down -v and just recreate, and waiting a little bit for minio to finish loading the files. I've not been able to process the full dataset in my dev environment, but for now I am assuming that it is due to limitations of the environment.

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@openverse-bot openverse-bot added ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🟩 priority: low Low priority and doesn't need to be rushed labels Jun 3, 2022
@sarayourfriend

This comment was marked as outdated.

@rwidom

This comment was marked as resolved.

@sarayourfriend

This comment was marked as resolved.

@rwidom

This comment was marked as resolved.

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I think it's an awesome start and it feels like the right direction to me 💯

I was able to get this running locally, after a few small tweaks. I've committed those, hope that's okay!

I think our best bet right now is to set this up so we end up pulling the data from inaturalist.tsv into a python script, and using the ImageStore class to write the records to a local TSV. That can then be picked up by the rest of the workflow and finish as the others do. It does seem a bit pedantic to go from csv -> postgres -> TSV -> postgres, but I think it's a good middle ground for us. The MediaStore class has a lot of validation and cleaning steps internally. We could eventually have replicas of that in postgres directly so we could just go from inaturalist.tsv -> openledger.image, but for now I think its best to leverage the tools we have. To that end, if there are steps (like the license munging) that would be easier to do in python, we can perform that in the pull script too before adding a record to the image store!

It looks like you got the docker/minio stuff figured out, sorry that was such a pain! Your suggested approach here looks great in that regard.

I think we could even still use the create_provider_api_workflow function too! We would just need to grab the DAG and add the additional .gz-to-postgres step before it. Maybe something like this using DAG::add_task and DAG::set_dependency:

DAG_ID = "inaturalist_workflow"

dag = create_provider_api_workflow(
    DAG_ID,
    inaturalist.main,
    start_date=datetime(1970, 1, 1),
    max_active_tasks=1,
    schedule_string='@monthly',
    dated=False,
)

PythonOperator("download_gz_and_upload_to_postgres", python_callable=..., dag=dag)

dag.set_dependency("download_gz_and_upload_to_postgres", "pull_image_data")

What do you think?

.gitignore Outdated Show resolved Hide resolved
docker-compose.override.yml Outdated Show resolved Hide resolved
Use exclude so only need to list the one provider subfolder

Co-authored-by: Madison Swain-Bowden <bowdenm@spu.edu>
@rwidom
Copy link
Collaborator Author

rwidom commented Aug 2, 2022

Gah, I keep thinking of things 😅 I think we can comfortably set the the default category for iNaturalist to PHOTOGRAPH (see #635), what do you think?

No worries!

But I tried adding "inaturalist": ImageCategory.PHOTOGRAPH.value, to DEFAULT_IMAGE_CATEGORY in provider_details, ran the dag successfully, and then looked in db-shell as follows:

openledger> select provider, category, count(*) from image group by provider, category;
+-------------+----------+-------+
| provider    | category | count |
|-------------+----------+-------|
| inaturalist | <null>   | 24    |
+-------------+----------+-------+
SELECT 1
Time: 0.007s

Maybe it's just that #635 hasn't been merged yet?

@AetherUnbound
Copy link
Contributor

Maybe it's just that #635 hasn't been merged yet?

Yep! Should have mentioned that, but it'll hopefully be merged before/when we deploy this DAG 😄

@AetherUnbound
Copy link
Contributor

Weird, looks like the error was a total fluke! CI is ✅ now

@AetherUnbound
Copy link
Contributor

@obulat @stacimc This should be ready for a final review, do you mind taking a look at it when you get a chance?

@sarayourfriend
Copy link
Contributor

Reviewing this again now, @rwidom.

@sarayourfriend
Copy link
Contributor

Okay, I've tested the DAG locally and it runs great 🙂 I had to just down -v twice for some reason, or it could have been that minio was still getting hydrated. In any case, it did work the second time after I made sure minio was ready before enabling the DAG. Going to review the code now.

Copy link
Contributor

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting! I don't have any blockers, just a handful of questions. Great work, Rebecca 🤠

Comment on lines 53 to 55
# adjustments to buffer limits. TO DO: try to integrate this with the dev
# environment logic in the base class, rather than just over-writing it.
self.media_stores["image"].buffer_length = 10_000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have an issue created for it, if it doesn't already? If so can we add the issue linked here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so, yes. Is that something I can do?

Copy link
Contributor

@sarayourfriend sarayourfriend Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You sure can: https://github.com/wordpress/openverse-catalog/issues/new/choose

Once the issue is created we typically just put the issue URL where the TODO is in the code so that the task is recorded and can be prioritized.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# adjustments to buffer limits. TO DO: try to integrate this with the dev
# environment logic in the base class, rather than just over-writing it.
self.media_stores["image"].buffer_length = 10_000
self.batch_limit = 10_000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: how was 10k decided on?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gut instinct from @AetherUnbound, if memory serves. I haven't been able to run the whole thing on my local machine, because of thermal throttling, but I did get through several hundred thousand (maybe a couple of million?) records locally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Is there a plan to monitor the resource consumption of this DAG when we first run it in production @AetherUnbound?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, our normal limit is 100 but since there's so much data here it seemed prudent to increase the batch volume on both ends. We will indeed be monitoring performance, and we'll be sure to kick this off for the first time when nothing else intensive is running on Airflow so we have the smallest likelyhood of taking something down (CC @stacimc)

Comment on lines 83 to 84
if data.get("foreign_identifier") is None:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what circumstances does this happen? Is it a problem on our end or with the upstream data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a great question! I actually don't think it does happen because the foreign identifier we're using is the photo id, and inaturalist (almost certainly) has pk constraints on that in their database. But, I think I got the impression that it was something I should test for here just in case. What do you think?

Copy link
Contributor

@sarayourfriend sarayourfriend Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you're wrong to be cautious with assumptions. I wonder if it should be an error condition, or at least something we keep track of when it happens. Especially given how rare we assume it to be, it could be a bad thing if it does happen 🙂 I'd be fine if a follow up issue was created to handle this in an auditable way rather than attempting to solve it now in this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and actually, 🤦 there definitely are not pk constraints on the photo table, though I do still think that there are probably not null constraints. Anyway, I'll make an issue for this. I also wonder though, how it relates to WordPress/openverse#292, and whether this might be best as part of a more general cross-provider check in the catalog.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding a couple of data quality related follow-ups, actually: https://github.com/WordPress/openverse-catalog/issues/684 and https://github.com/WordPress/openverse-catalog/issues/685 and I will integrate them in the comments.

task_id="check_for_file_updates",
python_callable=INaturalistDataIngester.compare_update_dates,
op_kwargs={
# With the templated values ({{ x }}) airflow will fill it in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this describing a special case or just a helpful note? This was my understanding of how templates worked generally.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, yes. I was more trying to be clear about where prev_start_date_success comes from, but I see how that's maybe obvious and/or not adding enough additional info to be helpful. How about I drop it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you 👍

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this PR locally, and it works really well for the sample data. The only thing I noticed is it doesn't add category, because the changes to the ImageStore were merged into main after this PR was opened.

Would it be possible to rebase this PR onto main so that we could make sure that the category is added, and there are no other problems?
I didn't update my local branch initially, sorry.
Congratulations on creating a new ingestion process, @rwidom!

Co-authored-by: Olga Bulat <obulat@gmail.com>
@rwidom
Copy link
Collaborator Author

rwidom commented Aug 19, 2022

Congratulations on creating a new ingestion process, @rwidom!

Thanks so much for all of the support along the way @obulat !

@rwidom rwidom merged commit ee474f2 into main Aug 19, 2022
@rwidom rwidom deleted the issue/439-add-provider-iNaturalist branch August 19, 2022 16:42
@krysal krysal mentioned this pull request Jan 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

iNaturalist
6 participants