inaturalist data quality: handling duplicate photo ids #1460
Labels
💻 aspect: code
Concerns the software code in the repository
✨ goal: improvement
Improvement to an existing user-facing feature
🟩 priority: low
Low priority and doesn't need to be rushed
🧱 stack: catalog
Related to the catalog and Airflow DAGs
Current Situation
The initial PR for inaturalist left a couple of open questions related to data quality and structure: if a photo id (foreign identifier), appears more than once in the source table, only load it once, but there is no logic to optimize data quality there.
It's possible that photo ids appear multiple times when the same photo has two species in it and is therefore part of two observations. Here's one example. If that's the only reason for duplicates, it would make sense to combine/aggregate titles and tags across the records, rather than choosing one to load. Approximately 0.1% of photo ids appear more than once in the photos source table.
Suggested Improvement
Analyze duplicate photo ids, to see if we could reliably glean higher quality data in those cases, and if so, do the needful. If
observation_uuid
is the only difference across allphoto_id
duplicates, we could group by other attributes and aggregate the taxa accordingly inexport_to_json.template.sql
Benefit
Possibly more complete search results, for species that often appear with others in inaturalist photos.
Implementation
The text was updated successfully, but these errors were encountered: