Skip to content

discrepancies in mordred values for the same improve_drug_id across different datasets #321

Open
@ymahlich

Description

@ymahlich

After combining every drug_descriptor DF from every data set currently available (example code how I did this below) there are 155 different improve_drug_id entries that resolve to conflicting mordred values, i.e. there is a discrepancy within one or more values.

import coderdata as cd
import pandas as pd

# importing all datasets into a dict
local_path = Path('/tmp/coderdata/data_in_tmp')
data_sets_info = cd.list_datasets(raw=True)
# data_sets_info = {'beataml':0, 'ccle':1 , 'fimm': 2}
data_sets = {}
for data_set in data_sets_info.keys():
    data_sets[data_set] = cd.load(name=data_set, local_path=local_path)

# getting all formatted drug_descriptor tables and adding them into a dict
dfs_to_merge = {}
for data_set in data_sets:
    if (data_sets[data_set].experiments is not None 
        and data_sets[data_set].drug_descriptors is not None
    ):
        df_tmp = data_sets[data_set].format(data_type='drug_descriptor', shape='wide')
        df_tmp = df_tmp.drop(columns=['morgan fingerprint']).add_prefix('mordred.')
        df_tmp['data_set_origin'] = data_set
        dfs_to_merge[data_set] = df_tmp

# concatenating the individual drug_descriptor tables and dropping duplicates based on all columns but `data_set_origin`
concat_drugs = pd.concat(dfs_to_merge.values())
concat_drugs = concat_drugs.drop_duplicates(subset=concat_drugs.columns.difference(['data_set_origin']))

# getting the improve_drug_id[s] that have conflicting mordred values 
counts = concat_drugs.index.value_counts()
print(counts[counts > 1])

An example for a discrepancy for the improve_drug_id SMI_39390 looks like this (see column 1, row 2 vs row 1,3&4 for example - every column has at least 1 value that is different):

mordred.AATS0Z mordred.AATS0are mordred.AATS0d mordred.AATS0dv mordred.AATS0i mordred.AATS0m mordred.AATS0p mordred.AATS0pe mordred.AATS0s mordred.AATS0se ... mordred.piPC10 mordred.piPC2 mordred.piPC3 mordred.piPC4 mordred.piPC5 mordred.piPC6 mordred.piPC7 mordred.piPC8 mordred.piPC9 data_set_origin
24.654545454545456 6.410538181818179 3.618181818181818 8.945454545454545 163.65524747645867 98.16974362090495 1.5316283238459452 6.46704727272727 5.9595959595959584 7.861068145454542 ... 7.897340761810271 4.499809670330265 5.0891383555842 5.641907070938114 6.191595324113119 6.442838688784959 6.84587987526405 7.22501511587432 7.667313615851853 beataml
24.654545454545456 6.410538181818179 3.618181818181818 8.945454545454545 163.65524747645867 98.16974362090495 1.5316283238459452 6.46704727272727 5.9595959595959584 7.861068145454542 ... 7.897340761810271 4.499809670330265 5.0891383555842 5.641907070938114 6.191595324113119 6.442838688784959 6.84587987526405 7.22501511587432 7.667313615851853 ctrpv2
24.654545454545456 6.410538181818179 3.618181818181818 8.945454545454545 163.65524747645867 98.16974362090495 1.5316283238459452 6.46704727272727 5.9595959595959584 7.861068145454542 ... 7.897340761810271 4.499809670330265 5.0891383555842 5.641907070938114 6.191595324113119 6.442838688784959 6.84587987526405 7.22501511587432 7.667313615851853 mpnst
24.654545454545456 6.410538181818179 3.618181818181818 8.945454545454545 163.65524747645867 98.16974362090495 1.5316283238459452 6.46704727272727 5.9595959595959584 7.861068145454542 ... 7.897340761810271 4.499809670330265 5.0891383555842 5.641907070938114 6.191595324113119 6.442838688784959 6.84587987526405 7.22501511587432 7.667313615851853 mpnstpdx

Attached is the full list of 155 drug ids where discrepancies exist, as well as the full subsetted DF containing only the improve_drug_id that are causing this behaviour. The above example table can be generated by subsetting the attached DF to only one drug_id and running the resulting DF through the code snippet below:

def cols_having_unique(df):
    my_cols = []
    for col in df.columns:
        if df[col].nunique(dropna=False) > 1:
            my_cols.append(col)
    return df[my_cols].copy()

morderd-discrepancy-full-list-incl-data-set-origin.csv

morderd-discrepancy-improve_drug_id-list.csv

Metadata

Metadata

Assignees

No one assigned

    Labels

    invalidThis doesn't seem right

    Type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions