Skip to content

Create a DAG to generate and insert the new Rekognition tags #4645

Closed

Description

Context

The Rekognition dataset we have available is a JSON lines file where each line is a JSON object with (roughly) the following shape:

{
  "image_uuid": "960b59e6-63f7-4beb-9cd0-6e3a275991a8",
  "response": {
    "Labels": [
      {
        "Name": "Human",
        "Confidence": 99.82632446289062,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Person",
        "Confidence": 99.82632446289062,
        "Instances": [
          {
            "BoundingBox": {
              "Width": 0.219997838139534,
              "Height": 0.46728312969207764,
              "Left": 0.6179072856903076,
              "Top": 0.39997851848602295
            },
            "Confidence": 99.82632446289062
          },
          ...
        ],
        "Parents": []
      },
      {
        "Name": "Crowd",
        "Confidence": 93.41161346435547,
        "Instances": [],
        "Parents": [
          {
            "Name": "Person"
          }
        ]
      },
      {
        "Name": "People",
        "Confidence": 86.95382690429688,
        "Instances": [],
        "Parents": [
          {
            "Name": "Person"
          }
        ]
      },
      {
        "Name": "Game",
        "Confidence": 68.61305236816406,
        "Instances": [],
        "Parents": [
          {
            "Name": "Person"
          }
        ]
      },
      {
        "Name": "Chess",
        "Confidence": 68.61305236816406,
        "Instances": [
          {
            "BoundingBox": {
              "Width": 0.8339029550552368,
              "Height": 0.7898563742637634,
              "Left": 0.08363451808691025,
              "Top": 0.1719469130039215
            },
            "Confidence": 68.61305236816406
          }
        ],
        "Parents": [
          {
            "Name": "Game"
          },
          {
            "Name": "Person"
          }
        ]
      },
      {
        "Name": "Coat",
        "Confidence": 68.09342193603516,
        "Instances": [],
        "Parents": [
          {
            "Name": "Clothing"
          }
        ]
      },
      {
        "Name": "Suit",
        "Confidence": 68.09342193603516,
        "Instances": [],
        "Parents": [
          {
            "Name": "Overcoat"
          },
          {
            "Name": "Coat"
          },
          {
            "Name": "Clothing"
          }
        ]
      },
      {
        "Name": "Apparel",
        "Confidence": 68.09342193603516,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Clothing",
        "Confidence": 68.09342193603516,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Overcoat",
        "Confidence": 68.09342193603516,
        "Instances": [],
        "Parents": [
          {
            "Name": "Coat"
          },
          {
            "Name": "Clothing"
          }
        ]
      },
      {
        "Name": "Meal",
        "Confidence": 62.59776306152344,
        "Instances": [],
        "Parents": [
          {
            "Name": "Food"
          }
        ]
      },
      {
        "Name": "Food",
        "Confidence": 62.59776306152344,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Furniture",
        "Confidence": 58.1875,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Tablecloth",
        "Confidence": 57.604129791259766,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Party",
        "Confidence": 57.07652282714844,
        "Instances": [],
        "Parents": []
      },
      {
        "Name": "Dinner",
        "Confidence": 56.07081985473633,
        "Instances": [],
        "Parents": [
          {
            "Name": "Food"
          }
        ]
      },
      {
        "Name": "Supper",
        "Confidence": 56.07081985473633,
        "Instances": [],
        "Parents": [
          {
            "Name": "Food"
          }
        ]
      }
    ],
    "LabelModelVersion": "2.0",
    "ResponseMetadata": {
      "RequestId": "60c4b6f5-3b73-466e-8fa5-e40037661253",
      "HTTPStatusCode": 200,
      "HTTPHeaders": {
        "content-type": "application/x-amz-json-1.1",
        "date": "Thu, 29 Oct 2020 19:46:02 GMT",
        "x-amzn-requestid": "60c4b6f5-3b73-466e-8fa5-e40037661253",
        "content-length": "3526",
        "connection": "keep-alive"
      },
      "RetryAttempts": 0
    }
  }
}

This file is about 200GB in total. For more information about the data, see Analysis Explanation.

Description

Important

A snapshot of the catalog database should be created prior to running this step in production.

We will create a DAG (add_rekognition_labels) which will perform the following steps:

  1. Create a temporary table in the catalog for storing the tag data. This table will be two columns: identifier and tags (with data types matching the existing catalog columns).

  2. Iterate over the large Rekognition dataset in a chunked manner using smart_open. smart_open provides options for tuning buffer size so larger chunks can be read into memory.

    1. For each line, read in the JSON object and pull out the top-level labels & confidence values. Note: some records may not have any labels.

    2. Construct a tags JSON object similar to the existing tags data for that image, including accuracy and provider. Ensure that the casing of the labels is preserved and that the confidence value is between 0.0 and 1.0 (e.g. [{"name": "Cat", "accuracy": 0.9983, "provider": "rekognition"}, ...]).

    3. At regular intervals, insert batches of constructed identifier/tags pairs into the temporary table.

  3. Launch a batched update run which merges the existing tags and the new tags from the temporary table for each identifier[8]. Note: the batched update DAG may need to be augmented in order to reference data from an existing table, similar to #3415.

  4. Delete the temporary table.

For local testing, a small sample of the Rekognition data could be made available in the local S3 server similar to the iNaturalist sample data.

Additional context

See this section of the IP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

Type

No type

Projects

  • Status

    ✅ Done

Relationships

None yet

Development

No branches or pull requests

Issue actions