Add iNaturalist.org metadata #549

rwidom · 2022-06-03T20:40:03Z

Fixes

Fixes WordPress/openverse#1608 by @sarayourfriend

Description

The iNaturalist API is not intended for data scraping, but there is a full dump intended for sharing on S3. Those files cover over 120 million images, with information about the person who shared the photo and the species in the photo represented across four normalized tables.

This PR loads all of the inaturalist metadata at once. The source files do not include any date or time fields to support differentiating among duplicate IDs or performing incremental updates. However, image files are in the same S3 bucket, prefixed by the photo_id in the metadata files. So, future dagruns could use s3Hook.list_keys to pull photo ID prefixes with files that have been updated since the prev_start_date_success and process only those photo IDs each month.

Testing Instructions

Check out the branch. And then this PR includes changes to the docker dev setup, so it's a good idea to use just down -v and just recreate when you're testing.

just test will cover basic tests of the provider data ingestion class functions, but not the DAG steps that make the metadata available .

There are very small subsets of the real files in /tests/s3-data/inaturalist-open-data, that will be loaded to the dev environment equivalent of s3, so that you can go kick off the inaturalist_workflow dag in Airflow, and inspect the logs and the resulting tsv file and the image table in the db. The Airflow log will give you the exact file name, and you can see it by running just shell and more /var/workflow_output/inaturalist_image_v001_20220709220913.tsv (or whatever the file is called). Run the dag a second time to confirm that the skips work when there is no new data.

I tried putting the real fullsize inaturalist files in /tests/s3-data/inaturalist-open-data, running just down -v and just recreate, and waiting a little bit for minio to finish loading the files. I've not been able to process the full dataset in my dev environment, but for now I am assuming that it is due to limitations of the environment.

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

…aturalist

openverse_catalog/dags/providers/provider_csv_load_scripts/inaturalist/01_photos.sql

openverse_catalog/dags/providers/provider_csv_load_scripts/inaturalist/04_observers.sql

AetherUnbound

This is great! I think it's an awesome start and it feels like the right direction to me 💯

I was able to get this running locally, after a few small tweaks. I've committed those, hope that's okay!

I think our best bet right now is to set this up so we end up pulling the data from inaturalist.tsv into a python script, and using the ImageStore class to write the records to a local TSV. That can then be picked up by the rest of the workflow and finish as the others do. It does seem a bit pedantic to go from csv -> postgres -> TSV -> postgres, but I think it's a good middle ground for us. The MediaStore class has a lot of validation and cleaning steps internally. We could eventually have replicas of that in postgres directly so we could just go from inaturalist.tsv -> openledger.image, but for now I think its best to leverage the tools we have. To that end, if there are steps (like the license munging) that would be easier to do in python, we can perform that in the pull script too before adding a record to the image store!

It looks like you got the docker/minio stuff figured out, sorry that was such a pain! Your suggested approach here looks great in that regard.

I think we could even still use the create_provider_api_workflow function too! We would just need to grab the DAG and add the additional .gz-to-postgres step before it. Maybe something like this using DAG::add_task and DAG::set_dependency:

DAG_ID = "inaturalist_workflow"

dag = create_provider_api_workflow(
    DAG_ID,
    inaturalist.main,
    start_date=datetime(1970, 1, 1),
    max_active_tasks=1,
    schedule_string='@monthly',
    dated=False,
)

PythonOperator("download_gz_and_upload_to_postgres", python_callable=..., dag=dag)

dag.set_dependency("download_gz_and_upload_to_postgres", "pull_image_data")

What do you think?

.gitignore

docker-compose.override.yml

Use exclude so only need to list the one provider subfolder Co-authored-by: Madison Swain-Bowden <bowdenm@spu.edu>

rwidom · 2022-08-02T23:01:03Z

Gah, I keep thinking of things 😅 I think we can comfortably set the the default category for iNaturalist to PHOTOGRAPH (see #635), what do you think?

No worries!

But I tried adding "inaturalist": ImageCategory.PHOTOGRAPH.value, to DEFAULT_IMAGE_CATEGORY in provider_details, ran the dag successfully, and then looked in db-shell as follows:

openledger> select provider, category, count(*) from image group by provider, category;
+-------------+----------+-------+
| provider    | category | count |
|-------------+----------+-------|
| inaturalist | <null>   | 24    |
+-------------+----------+-------+
SELECT 1
Time: 0.007s

Maybe it's just that #635 hasn't been merged yet?

AetherUnbound · 2022-08-03T01:39:31Z

Maybe it's just that #635 hasn't been merged yet?

Yep! Should have mentioned that, but it'll hopefully be merged before/when we deploy this DAG 😄

This reverts commit 9eed1f9.

AetherUnbound · 2022-08-03T19:03:42Z

Weird, looks like the error was a total fluke! CI is ✅ now

AetherUnbound · 2022-08-12T17:32:06Z

@obulat @stacimc This should be ready for a final review, do you mind taking a look at it when you get a chance?

sarayourfriend · 2022-08-18T19:16:04Z

Reviewing this again now, @rwidom.

sarayourfriend · 2022-08-18T19:28:52Z

Okay, I've tested the DAG locally and it runs great 🙂 I had to just down -v twice for some reason, or it could have been that minio was still getting hydrated. In any case, it did work the second time after I made sure minio was ready before enabling the DAG. Going to review the code now.

sarayourfriend

Exciting! I don't have any blockers, just a handful of questions. Great work, Rebecca 🤠

sarayourfriend · 2022-08-18T19:32:06Z

openverse_catalog/dags/providers/provider_api_scripts/inaturalist.py

+        # adjustments to buffer limits. TO DO: try to integrate this with the dev
+        # environment logic in the base class, rather than just over-writing it.
+        self.media_stores["image"].buffer_length = 10_000


Should this have an issue created for it, if it doesn't already? If so can we add the issue linked here?

I guess so, yes. Is that something I can do?

You sure can: https://github.com/wordpress/openverse-catalog/issues/new/choose

Once the issue is created we typically just put the issue URL where the TODO is in the code so that the task is recorded and can be prioritized.

Will do! https://github.com/WordPress/openverse-catalog/issues/682

sarayourfriend · 2022-08-18T19:34:38Z

openverse_catalog/dags/providers/provider_api_scripts/inaturalist.py

+        # adjustments to buffer limits. TO DO: try to integrate this with the dev
+        # environment logic in the base class, rather than just over-writing it.
+        self.media_stores["image"].buffer_length = 10_000
+        self.batch_limit = 10_000


Out of curiosity: how was 10k decided on?

Gut instinct from @AetherUnbound, if memory serves. I haven't been able to run the whole thing on my local machine, because of thermal throttling, but I did get through several hundred thousand (maybe a couple of million?) records locally.

Gotcha. Is there a plan to monitor the resource consumption of this DAG when we first run it in production @AetherUnbound?

That's right, our normal limit is 100 but since there's so much data here it seemed prudent to increase the batch volume on both ends. We will indeed be monitoring performance, and we'll be sure to kick this off for the first time when nothing else intensive is running on Airflow so we have the smallest likelyhood of taking something down (CC @stacimc)

openverse_catalog/dags/providers/provider_api_scripts/inaturalist.py

sarayourfriend · 2022-08-18T19:36:03Z

openverse_catalog/dags/providers/provider_api_scripts/inaturalist.py

+        if data.get("foreign_identifier") is None:
+            return None


Under what circumstances does this happen? Is it a problem on our end or with the upstream data?

What a great question! I actually don't think it does happen because the foreign identifier we're using is the photo id, and inaturalist (almost certainly) has pk constraints on that in their database. But, I think I got the impression that it was something I should test for here just in case. What do you think?

I don't think you're wrong to be cautious with assumptions. I wonder if it should be an error condition, or at least something we keep track of when it happens. Especially given how rare we assume it to be, it could be a bad thing if it does happen 🙂 I'd be fine if a follow up issue was created to handle this in an auditable way rather than attempting to solve it now in this PR.

Yeah, and actually, 🤦 there definitely are not pk constraints on the photo table, though I do still think that there are probably not null constraints. Anyway, I'll make an issue for this. I also wonder though, how it relates to WordPress/openverse#292, and whether this might be best as part of a more general cross-provider check in the catalog.

I'm adding a couple of data quality related follow-ups, actually: https://github.com/WordPress/openverse-catalog/issues/684 and https://github.com/WordPress/openverse-catalog/issues/685 and I will integrate them in the comments.

sarayourfriend · 2022-08-18T19:38:24Z

openverse_catalog/dags/providers/provider_api_scripts/inaturalist.py

+                task_id="check_for_file_updates",
+                python_callable=INaturalistDataIngester.compare_update_dates,
+                op_kwargs={
+                    # With the templated values ({{ x }}) airflow will fill it in


Is this describing a special case or just a helpful note? This was my understanding of how templates worked generally.

Right, yes. I was more trying to be clear about where prev_start_date_success comes from, but I see how that's maybe obvious and/or not adding enough additional info to be helpful. How about I drop it?

Up to you 👍

openverse_catalog/dags/providers/provider_csv_load_scripts/inaturalist/create_schema.sql

openverse_catalog/dags/providers/provider_api_scripts/inaturalist.py

obulat

I've tested this PR locally, and it works really well for the sample data. ~~The only thing I noticed is it doesn't add category, because the changes to the ImageStore were merged into main after this PR was opened.~~

~~Would it be possible to rebase this PR onto main so that we could make sure that the category is added, and there are no other problems?~~
I didn't update my local branch initially, sorry.
Congratulations on creating a new ingestion process, @rwidom!

Co-authored-by: Olga Bulat <obulat@gmail.com>

rwidom · 2022-08-19T16:35:01Z

Congratulations on creating a new ingestion process, @rwidom!

Thanks so much for all of the support along the way @obulat !

rwidom added 7 commits May 10, 2022 19:34

templated changes from existing script/readme

0dc32e3

Merge branch 'main' into issue/439-add-provider-iNaturalist

e44b2c8

first stab at transformation using pandas

23419ba

Merge branch 'main' for thumbnails etc into issue/439-add-provider-iN…

2c98dd1

…aturalist

add capability to load .gz files from s3

0c2f67c

load data from S3 and store transformation logic

7f357c0

Merge branch 'main' into issue/439-add-provider-iNaturalist

3cd9db3

openverse-bot added ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🟩 priority: low Low priority and doesn't need to be rushed labels Jun 3, 2022

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

rwidom added 10 commits June 12, 2022 08:40

Merge branch 'main' into issue/439-add-provider-iNaturalist

c7ebeba

clean up old file structure and files

faf6716

add test source files

e78f9a3

cleanup file structure

1a29f00

get sample data into minio container

b145238

add sql transformation steps

fc2dc54

hide minio testing side effects from git

d4b2a56

add python wrappers for airflow and testing

15d2d65

fyi on the source of the sample data

59ece62

Merge branch 'main' into issue/439-add-provider-iNaturalist

0c1f65c

rwidom commented Jun 21, 2022

View reviewed changes

openverse_catalog/dags/providers/provider_csv_load_scripts/inaturalist/01_photos.sql Outdated Show resolved Hide resolved

rwidom commented Jun 21, 2022

View reviewed changes

openverse_catalog/dags/providers/provider_csv_load_scripts/inaturalist/04_observers.sql Outdated Show resolved Hide resolved

Minor tweaks to get this working locally

0c0fd9e

AetherUnbound reviewed Jun 22, 2022

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

docker-compose.override.yml Outdated Show resolved Hide resolved

Update .gitignore

067a16c

Use exclude so only need to list the one provider subfolder Co-authored-by: Madison Swain-Bowden <bowdenm@spu.edu>

try adding default image category

0100a7e

AetherUnbound added 2 commits August 3, 2022 10:45

Print logs on failure

9eed1f9

Revert "Print logs on failure"

dce23fb

This reverts commit 9eed1f9.

rwidom added 6 commits August 3, 2022 19:59

post-ingestion task comment

2a4074e

clarify and simplify test

a6ff6df

simplify sql names and remove dedent

ce2df38

clarify duplicate photo id status

dc6887c

lint fix

ec53e94

documentation for pre- and post-ingestion task params

c039754

rwidom and others added 2 commits August 13, 2022 06:19

Merge branch 'main' into issue/439-add-provider-iNaturalist

f21af29

Update DAGs.md

6282c17

sarayourfriend approved these changes Aug 18, 2022

View reviewed changes

add issue references to comments

27506d8

obulat reviewed Aug 19, 2022

View reviewed changes

openverse_catalog/dags/providers/provider_api_scripts/inaturalist.py Outdated Show resolved Hide resolved

obulat approved these changes Aug 19, 2022

View reviewed changes

fix comment typo

44418e9

Co-authored-by: Olga Bulat <obulat@gmail.com>

rwidom merged commit ee474f2 into main Aug 19, 2022

rwidom deleted the issue/439-add-provider-iNaturalist branch August 19, 2022 16:42

This was referenced Apr 17, 2023

Ensure PG backend query is terminated on failure WordPress/openverse#1455

Closed

Bump iNaturalist timeouts to 5 days #691

Merged

krysal mentioned this pull request Jan 3, 2023

iNaturalist in-SQL loading #745

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add iNaturalist.org metadata #549

Add iNaturalist.org metadata #549

rwidom commented Jun 3, 2022 •

edited

Loading

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

AetherUnbound left a comment

rwidom commented Aug 2, 2022 •

edited

Loading

AetherUnbound commented Aug 3, 2022

AetherUnbound commented Aug 3, 2022

AetherUnbound commented Aug 12, 2022

sarayourfriend commented Aug 18, 2022

sarayourfriend commented Aug 18, 2022

sarayourfriend left a comment

sarayourfriend Aug 18, 2022

rwidom Aug 18, 2022

sarayourfriend Aug 18, 2022 •

edited

Loading

rwidom Aug 18, 2022

sarayourfriend Aug 18, 2022

rwidom Aug 18, 2022

sarayourfriend Aug 18, 2022

AetherUnbound Aug 18, 2022

sarayourfriend Aug 18, 2022

rwidom Aug 18, 2022

sarayourfriend Aug 18, 2022 •

edited

Loading

rwidom Aug 18, 2022

rwidom Aug 18, 2022

sarayourfriend Aug 18, 2022

rwidom Aug 18, 2022

sarayourfriend Aug 18, 2022

obulat left a comment •

edited

Loading

rwidom commented Aug 19, 2022 •

edited

Loading

Add iNaturalist.org metadata #549

Add iNaturalist.org metadata #549

Conversation

rwidom commented Jun 3, 2022 • edited Loading

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

AetherUnbound left a comment

Choose a reason for hiding this comment

rwidom commented Aug 2, 2022 • edited Loading

AetherUnbound commented Aug 3, 2022

AetherUnbound commented Aug 3, 2022

AetherUnbound commented Aug 12, 2022

sarayourfriend commented Aug 18, 2022

sarayourfriend commented Aug 18, 2022

sarayourfriend left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarayourfriend Aug 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarayourfriend Aug 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obulat left a comment • edited Loading

Choose a reason for hiding this comment

rwidom commented Aug 19, 2022 • edited Loading

rwidom commented Jun 3, 2022 •

edited

Loading

rwidom commented Aug 2, 2022 •

edited

Loading

sarayourfriend Aug 18, 2022 •

edited

Loading

sarayourfriend Aug 18, 2022 •

edited

Loading

obulat left a comment •

edited

Loading

rwidom commented Aug 19, 2022 •

edited

Loading