Missing diagnosis labels in episode*.csv generated by `extract_episode_from_subjects` #101

mistycheney · 2020-09-03T21:42:52Z

This bug can be found in the two episode*.csv files generated for patient 49037. In both files, no diagnosis columns have label 1, which is clearly not right.

The cause is in preprocessing.py. In function extract_diagnosis_labels, in the input dataframe diagnosis, the ICD9_CODE column has a numerical dtype. This causes the columns of labels to also be numerical. However the match condition in Line 82 is against the hardcoded list diagnosis_labels which contains strings. This means Line 82 will never be true, and no diagnosis value will be set to 1.

This bug affects all episodes who only have numerical diagnosis ICD codes (i.e. no alpha-numerical codes like V28492). In these cases pandas automatically infers the dtype to be int64, rather than object/str, causing the bug.

This bug however does not seem to affect the labels in task-specific datasets, which still look correct.

A fix is to add this line
diagnoses['ICD9_CODE'] = diagnoses['ICD9_CODE'].astype(str)
before diagnoses['VALUE'] = 1.

The text was updated successfully, but these errors were encountered:

KimballCai · 2021-06-24T09:43:39Z

I find this problem too, and this problem occurs in many episodes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing diagnosis labels in episode*.csv generated by `extract_episode_from_subjects` #101

Missing diagnosis labels in episode*.csv generated by `extract_episode_from_subjects` #101

mistycheney commented Sep 3, 2020 •

edited

Loading

KimballCai commented Jun 24, 2021

Missing diagnosis labels in episode*.csv generated by extract_episode_from_subjects #101

Missing diagnosis labels in episode*.csv generated by extract_episode_from_subjects #101

Comments

mistycheney commented Sep 3, 2020 • edited Loading

KimballCai commented Jun 24, 2021

Missing diagnosis labels in episode*.csv generated by `extract_episode_from_subjects` #101

Missing diagnosis labels in episode*.csv generated by `extract_episode_from_subjects` #101

mistycheney commented Sep 3, 2020 •

edited

Loading