Confusion in dataset's data_captions.json

I downloaded RSTPReid dataset using provided Google Drive link. However, in my opinion, _data_captions.json_ contains many mistakes. 
To specify (since all data are stored as one line, I used _img_path_ for orientation in the data file):

**minor issues** (can be manually filtered)
- Several Unicode codes for characters used in Chinese, Japanese, and Korean (0001_c5_0025.jpg, 0008_c14_0022.jpg, 0165_c7_0009.jpg, ...)
- Parts of python(?) code (1064_c7_0005.jpg, 2329_c13_0010.jpg, ...)

**major issues** (can't be manually filtered)
- Some descriptions are identical for multiple images with different _id_, even though it's clear the text does not match the image. In these cases, the other captions of the images are mostly correct. There is significant amount of these cases.

**examples**:

- Text description: _A man with black hair, wearing glasses, a gray and black shirt, black pants and black shoes, carrying a black backpack, is walking, a hand in the pocket_
  - Used for images: 0000_c5_0022.jpg, 0001_c14_0033.jpg,  0001_c1_0014.jpg,  0001_c7_0018.jpg,  0002_c14_0024.jpg,  0002_c1_0003.jpg, 0003_c14_0027.jpg, 0003_c1_0000.jpg, 0005_c14_0030.jpg, i.e. 5 different persons
  - On the images _0002*.jpg_, there is clearly long-haired woman in blue jeans with a crossbody bag

- Text description: _A woman in a white blouse, black pants and white shoes, hands in pockets, was walking, leading the bag quicklyt_
  - Used for images:  0031_c14_0014.jpg, 0031_c1_0003.jpg,  0031_c5_0009.jpg, 0036_c14_0018.jpg, 0036_c1_0002.jpg, 0036_c5_0016.jpg, 0036_c7_0011.jpg
  - In images _0031*.jpg_, there is lady fully dressed in black witch colored scarf (for example the second caption: _A woman with short hair was wearing sunglasses and a colored scarf, a black coat, black trousers and black flat shoes. She walked down the street with her bag in her right hand._ )
  - In images _0036*.jpg_, there is lady fully dressed in black with blue backpack and red bag (for example the second caption: _A woman with curly hair is wearing a black coat, black trousers and black shoes. She is walking in the street with a blue backpack on her back and a red bag in her left hand._)

Has anyone else encountered similar issues with the dataset, particularly with mismatched captions for different _id_? If so, it would be great to know how you handled these inconsistencies. Could there be a mistake in how the data was annotated, or is this a known issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion in dataset's data_captions.json #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Confusion in dataset's data_captions.json #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions