Create a DAG to generate and insert the new Rekognition tags

## Context

The Rekognition dataset we have available is a [JSON lines](https://jsonlines.org/) file where each line is a JSON object with (roughly) the following shape:

```json
{
  "image_uuid": "960b59e6-63f7-4beb-9cd0-6e3a275991a8",
  "response": {
    "Labels": [
      {
        "Name": "Human",
        "Confidence": 99.82632446289062,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Person",
        "Confidence": 99.82632446289062,
        "Instances": [
          {
            "BoundingBox": {
              "Width": 0.219997838139534,
              "Height": 0.46728312969207764,
              "Left": 0.6179072856903076,
              "Top": 0.39997851848602295
            },
            "Confidence": 99.82632446289062
          },
          ...
        ],
        "Parents": []
      },
      {
        "Name": "Crowd",
        "Confidence": 93.41161346435547,
        "Instances": [],
        "Parents": [
          {
            "Name": "Person"
          }
        ]
      },
      {
        "Name": "People",
        "Confidence": 86.95382690429688,
        "Instances": [],
        "Parents": [
          {
            "Name": "Person"
          }
        ]
      },
      {
        "Name": "Game",
        "Confidence": 68.61305236816406,
        "Instances": [],
        "Parents": [
          {
            "Name": "Person"
          }
        ]
      },
      {
        "Name": "Chess",
        "Confidence": 68.61305236816406,
        "Instances": [
          {
            "BoundingBox": {
              "Width": 0.8339029550552368,
              "Height": 0.7898563742637634,
              "Left": 0.08363451808691025,
              "Top": 0.1719469130039215
            },
            "Confidence": 68.61305236816406
          }
        ],
        "Parents": [
          {
            "Name": "Game"
          },
          {
            "Name": "Person"
          }
        ]
      },
      {
        "Name": "Coat",
        "Confidence": 68.09342193603516,
        "Instances": [],
        "Parents": [
          {
            "Name": "Clothing"
          }
        ]
      },
      {
        "Name": "Suit",
        "Confidence": 68.09342193603516,
        "Instances": [],
        "Parents": [
          {
            "Name": "Overcoat"
          },
          {
            "Name": "Coat"
          },
          {
            "Name": "Clothing"
          }
        ]
      },
      {
        "Name": "Apparel",
        "Confidence": 68.09342193603516,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Clothing",
        "Confidence": 68.09342193603516,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Overcoat",
        "Confidence": 68.09342193603516,
        "Instances": [],
        "Parents": [
          {
            "Name": "Coat"
          },
          {
            "Name": "Clothing"
          }
        ]
      },
      {
        "Name": "Meal",
        "Confidence": 62.59776306152344,
        "Instances": [],
        "Parents": [
          {
            "Name": "Food"
          }
        ]
      },
      {
        "Name": "Food",
        "Confidence": 62.59776306152344,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Furniture",
        "Confidence": 58.1875,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Tablecloth",
        "Confidence": 57.604129791259766,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Party",
        "Confidence": 57.07652282714844,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Dinner",
        "Confidence": 56.07081985473633,
        "Instances": [],
        "Parents": [
          {
            "Name": "Food"
          }
        ]
      },
      {
        "Name": "Supper",
        "Confidence": 56.07081985473633,
        "Instances": [],
        "Parents": [
          {
            "Name": "Food"
          }
        ]
      }
    ],
    "LabelModelVersion": "2.0",
    "ResponseMetadata": {
      "RequestId": "60c4b6f5-3b73-466e-8fa5-e40037661253",
      "HTTPStatusCode": 200,
      "HTTPHeaders": {
        "content-type": "application/x-amz-json-1.1",
        "date": "Thu, 29 Oct 2020 19:46:02 GMT",
        "x-amzn-requestid": "60c4b6f5-3b73-466e-8fa5-e40037661253",
        "content-length": "3526",
        "connection": "keep-alive"
      },
      "RetryAttempts": 0
    }
  }
}
```

This file is about 200GB in total. For more information about the data, see [Analysis Explanation](https://docs.openverse.org/projects/proposals/rekognition_data/20240530-implementation_plan_augment_catalog_with_rekognition_tags.html#analysis-explanation).

## Description

> [!IMPORTANT]
> A snapshot of the catalog database should be created prior to running this step in production.

We will create a DAG (`add_rekognition_labels`) which will perform the following steps:

1.  Create a temporary table in the catalog for storing the tag data. This table will be two columns: `identifier` and `tags` (with data types matching the existing catalog columns).

2.  Iterate over the large Rekognition dataset in a chunked manner using [`smart_open`](https://github.com/piskvorky/smart_open). `smart_open` provides [options for tuning buffer size](https://github.com/piskvorky/smart_open?tab=readme-ov-file#transport-specific-options) so larger chunks can be read into memory.

    1.  For each line, read in the JSON object and pull out the top-level labels & confidence values. **Note**: some records may not have any labels.

    2.  Construct a `tags` JSON object similar to the existing tags data for that image, including accuracy and provider. Ensure that the casing of the labels is preserved and that the confidence value is between 0.0 and 1.0 (e.g. `[{"name": "Cat", "accuracy": 0.9983, "provider": "rekognition"}, ...]`).

    3.  At regular intervals, insert batches of constructed `identifier`/`tags` pairs into the temporary table.

3.  Launch a [batched update run](https://docs.openverse.org/catalog/reference/DAGs.html#batched-update-dag) which merges the existing tags and the new tags from the temporary table for each identifier[[8]](https://docs.openverse.org/projects/proposals/rekognition_data/20240530-implementation_plan_augment_catalog_with_rekognition_tags.html#batch-tag-example). **Note**: the batched update DAG may need to be augmented in order to reference data from an existing table, similar to [#3415](https://github.com/WordPress/openverse/issues/3415 "Use the `batched_update` DAG with stored CSVs to update Catalog URLs").

4.  Delete the temporary table.

For local testing, a small sample of the Rekognition data could be made available in the local S3 server [similar to the iNaturalist sample data](https://github.com/WordPress/openverse/blob/82282a00abdaed21e8381052a874d8ab9a4f7e0a/catalog/compose.yml#L98-L101).




## Additional context



See [this section](https://docs.openverse.org/projects/proposals/rekognition_data/20240530-implementation_plan_augment_catalog_with_rekognition_tags.html#insert-new-rekognition-tags) of the IP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a DAG to generate and insert the new Rekognition tags #4645

AetherUnbound
openedon Jul 22, 2024

Context

Description

Additional context

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Create a DAG to generate and insert the new Rekognition tags #4645

Description

AetherUnboundopenedon Jul 22, 2024

Context

Description

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

AetherUnbound
openedon Jul 22, 2024