Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inaturalist data quality: handling duplicate photo ids #1460

Open
1 task
rwidom opened this issue Aug 18, 2022 · 0 comments
Open
1 task

inaturalist data quality: handling duplicate photo ids #1460

rwidom opened this issue Aug 18, 2022 · 0 comments
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@rwidom
Copy link
Collaborator

rwidom commented Aug 18, 2022

Current Situation

The initial PR for inaturalist left a couple of open questions related to data quality and structure: if a photo id (foreign identifier), appears more than once in the source table, only load it once, but there is no logic to optimize data quality there.

It's possible that photo ids appear multiple times when the same photo has two species in it and is therefore part of two observations. Here's one example. If that's the only reason for duplicates, it would make sense to combine/aggregate titles and tags across the records, rather than choosing one to load. Approximately 0.1% of photo ids appear more than once in the photos source table.

Suggested Improvement

Analyze duplicate photo ids, to see if we could reliably glean higher quality data in those cases, and if so, do the needful. If observation_uuid is the only difference across all photo_id duplicates, we could group by other attributes and aggregate the taxa accordingly in export_to_json.template.sql

Benefit

Possibly more complete search results, for species that often appear with others in inaturalist photos.

Implementation

  • 🙋 I would be interested in implementing this feature.
@rwidom rwidom added the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Aug 18, 2022
@stacimc stacimc added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Aug 19, 2022
@AetherUnbound AetherUnbound added 🟩 priority: low Low priority and doesn't need to be rushed and removed 🟨 priority: medium Not blocking but should be addressed soon labels Oct 19, 2022
@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 24, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Status: 📋 Backlog
Development

No branches or pull requests

4 participants