Description
After combining every drug_descriptor
DF from every data set currently available (example code how I did this below) there are 155 different improve_drug_id
entries that resolve to conflicting mordred values, i.e. there is a discrepancy within one or more values.
import coderdata as cd
import pandas as pd
# importing all datasets into a dict
local_path = Path('/tmp/coderdata/data_in_tmp')
data_sets_info = cd.list_datasets(raw=True)
# data_sets_info = {'beataml':0, 'ccle':1 , 'fimm': 2}
data_sets = {}
for data_set in data_sets_info.keys():
data_sets[data_set] = cd.load(name=data_set, local_path=local_path)
# getting all formatted drug_descriptor tables and adding them into a dict
dfs_to_merge = {}
for data_set in data_sets:
if (data_sets[data_set].experiments is not None
and data_sets[data_set].drug_descriptors is not None
):
df_tmp = data_sets[data_set].format(data_type='drug_descriptor', shape='wide')
df_tmp = df_tmp.drop(columns=['morgan fingerprint']).add_prefix('mordred.')
df_tmp['data_set_origin'] = data_set
dfs_to_merge[data_set] = df_tmp
# concatenating the individual drug_descriptor tables and dropping duplicates based on all columns but `data_set_origin`
concat_drugs = pd.concat(dfs_to_merge.values())
concat_drugs = concat_drugs.drop_duplicates(subset=concat_drugs.columns.difference(['data_set_origin']))
# getting the improve_drug_id[s] that have conflicting mordred values
counts = concat_drugs.index.value_counts()
print(counts[counts > 1])
An example for a discrepancy for the improve_drug_id
SMI_39390
looks like this (see column 1, row 2 vs row 1,3&4 for example - every column has at least 1 value that is different):
mordred.AATS0Z | mordred.AATS0are | mordred.AATS0d | mordred.AATS0dv | mordred.AATS0i | mordred.AATS0m | mordred.AATS0p | mordred.AATS0pe | mordred.AATS0s | mordred.AATS0se | ... | mordred.piPC10 | mordred.piPC2 | mordred.piPC3 | mordred.piPC4 | mordred.piPC5 | mordred.piPC6 | mordred.piPC7 | mordred.piPC8 | mordred.piPC9 | data_set_origin |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
24.654545454545456 | 6.410538181818179 | 3.618181818181818 | 8.945454545454545 | 163.65524747645867 | 98.16974362090495 | 1.5316283238459452 | 6.46704727272727 | 5.9595959595959584 | 7.861068145454542 | ... | 7.897340761810271 | 4.499809670330265 | 5.0891383555842 | 5.641907070938114 | 6.191595324113119 | 6.442838688784959 | 6.84587987526405 | 7.22501511587432 | 7.667313615851853 | beataml |
24.654545454545456 | 6.410538181818179 | 3.618181818181818 | 8.945454545454545 | 163.65524747645867 | 98.16974362090495 | 1.5316283238459452 | 6.46704727272727 | 5.9595959595959584 | 7.861068145454542 | ... | 7.897340761810271 | 4.499809670330265 | 5.0891383555842 | 5.641907070938114 | 6.191595324113119 | 6.442838688784959 | 6.84587987526405 | 7.22501511587432 | 7.667313615851853 | ctrpv2 |
24.654545454545456 | 6.410538181818179 | 3.618181818181818 | 8.945454545454545 | 163.65524747645867 | 98.16974362090495 | 1.5316283238459452 | 6.46704727272727 | 5.9595959595959584 | 7.861068145454542 | ... | 7.897340761810271 | 4.499809670330265 | 5.0891383555842 | 5.641907070938114 | 6.191595324113119 | 6.442838688784959 | 6.84587987526405 | 7.22501511587432 | 7.667313615851853 | mpnst |
24.654545454545456 | 6.410538181818179 | 3.618181818181818 | 8.945454545454545 | 163.65524747645867 | 98.16974362090495 | 1.5316283238459452 | 6.46704727272727 | 5.9595959595959584 | 7.861068145454542 | ... | 7.897340761810271 | 4.499809670330265 | 5.0891383555842 | 5.641907070938114 | 6.191595324113119 | 6.442838688784959 | 6.84587987526405 | 7.22501511587432 | 7.667313615851853 | mpnstpdx |
Attached is the full list of 155 drug ids where discrepancies exist, as well as the full subsetted DF containing only the improve_drug_id
that are causing this behaviour. The above example table can be generated by subsetting the attached DF to only one drug_id and running the resulting DF through the code snippet below:
def cols_having_unique(df):
my_cols = []
for col in df.columns:
if df[col].nunique(dropna=False) > 1:
my_cols.append(col)
return df[my_cols].copy()
Metadata
Metadata
Assignees
Labels
Type
Projects
Status