Description
Context
The Rekognition dataset we have available is a JSON lines file where each line is a JSON object with (roughly) the following shape:
{
"image_uuid": "960b59e6-63f7-4beb-9cd0-6e3a275991a8",
"response": {
"Labels": [
{
"Name": "Human",
"Confidence": 99.82632446289062,
"Instances": [],
"Parents": []
},
{
"Name": "Person",
"Confidence": 99.82632446289062,
"Instances": [
{
"BoundingBox": {
"Width": 0.219997838139534,
"Height": 0.46728312969207764,
"Left": 0.6179072856903076,
"Top": 0.39997851848602295
},
"Confidence": 99.82632446289062
},
...
],
"Parents": []
},
{
"Name": "Crowd",
"Confidence": 93.41161346435547,
"Instances": [],
"Parents": [
{
"Name": "Person"
}
]
},
{
"Name": "People",
"Confidence": 86.95382690429688,
"Instances": [],
"Parents": [
{
"Name": "Person"
}
]
},
{
"Name": "Game",
"Confidence": 68.61305236816406,
"Instances": [],
"Parents": [
{
"Name": "Person"
}
]
},
{
"Name": "Chess",
"Confidence": 68.61305236816406,
"Instances": [
{
"BoundingBox": {
"Width": 0.8339029550552368,
"Height": 0.7898563742637634,
"Left": 0.08363451808691025,
"Top": 0.1719469130039215
},
"Confidence": 68.61305236816406
}
],
"Parents": [
{
"Name": "Game"
},
{
"Name": "Person"
}
]
},
{
"Name": "Coat",
"Confidence": 68.09342193603516,
"Instances": [],
"Parents": [
{
"Name": "Clothing"
}
]
},
{
"Name": "Suit",
"Confidence": 68.09342193603516,
"Instances": [],
"Parents": [
{
"Name": "Overcoat"
},
{
"Name": "Coat"
},
{
"Name": "Clothing"
}
]
},
{
"Name": "Apparel",
"Confidence": 68.09342193603516,
"Instances": [],
"Parents": []
},
{
"Name": "Clothing",
"Confidence": 68.09342193603516,
"Instances": [],
"Parents": []
},
{
"Name": "Overcoat",
"Confidence": 68.09342193603516,
"Instances": [],
"Parents": [
{
"Name": "Coat"
},
{
"Name": "Clothing"
}
]
},
{
"Name": "Meal",
"Confidence": 62.59776306152344,
"Instances": [],
"Parents": [
{
"Name": "Food"
}
]
},
{
"Name": "Food",
"Confidence": 62.59776306152344,
"Instances": [],
"Parents": []
},
{
"Name": "Furniture",
"Confidence": 58.1875,
"Instances": [],
"Parents": []
},
{
"Name": "Tablecloth",
"Confidence": 57.604129791259766,
"Instances": [],
"Parents": []
},
{
"Name": "Party",
"Confidence": 57.07652282714844,
"Instances": [],
"Parents": []
},
{
"Name": "Dinner",
"Confidence": 56.07081985473633,
"Instances": [],
"Parents": [
{
"Name": "Food"
}
]
},
{
"Name": "Supper",
"Confidence": 56.07081985473633,
"Instances": [],
"Parents": [
{
"Name": "Food"
}
]
}
],
"LabelModelVersion": "2.0",
"ResponseMetadata": {
"RequestId": "60c4b6f5-3b73-466e-8fa5-e40037661253",
"HTTPStatusCode": 200,
"HTTPHeaders": {
"content-type": "application/x-amz-json-1.1",
"date": "Thu, 29 Oct 2020 19:46:02 GMT",
"x-amzn-requestid": "60c4b6f5-3b73-466e-8fa5-e40037661253",
"content-length": "3526",
"connection": "keep-alive"
},
"RetryAttempts": 0
}
}
}
This file is about 200GB in total. For more information about the data, see Analysis Explanation.
Description
Important
A snapshot of the catalog database should be created prior to running this step in production.
We will create a DAG (add_rekognition_labels
) which will perform the following steps:
-
Create a temporary table in the catalog for storing the tag data. This table will be two columns:
identifier
andtags
(with data types matching the existing catalog columns). -
Iterate over the large Rekognition dataset in a chunked manner using
smart_open
.smart_open
provides options for tuning buffer size so larger chunks can be read into memory.-
For each line, read in the JSON object and pull out the top-level labels & confidence values. Note: some records may not have any labels.
-
Construct a
tags
JSON object similar to the existing tags data for that image, including accuracy and provider. Ensure that the casing of the labels is preserved and that the confidence value is between 0.0 and 1.0 (e.g.[{"name": "Cat", "accuracy": 0.9983, "provider": "rekognition"}, ...]
). -
At regular intervals, insert batches of constructed
identifier
/tags
pairs into the temporary table.
-
-
Launch a batched update run which merges the existing tags and the new tags from the temporary table for each identifier[8]. Note: the batched update DAG may need to be augmented in order to reference data from an existing table, similar to #3415.
-
Delete the temporary table.
For local testing, a small sample of the Rekognition data could be made available in the local S3 server similar to the iNaturalist sample data.
Additional context
See this section of the IP.
Metadata
Assignees
Labels
Type
Projects
Status
✅ Done