discrepancies in mordred values for the same improve_drug_id across different datasets

After combining every `drug_descriptor` DF from every data set currently available (example code how I did this below) there are 155 different `improve_drug_id` entries that resolve to conflicting mordred values, i.e. there is a discrepancy within one or more values.

```python

import coderdata as cd
import pandas as pd

# importing all datasets into a dict
local_path = Path('/tmp/coderdata/data_in_tmp')
data_sets_info = cd.list_datasets(raw=True)
# data_sets_info = {'beataml':0, 'ccle':1 , 'fimm': 2}
data_sets = {}
for data_set in data_sets_info.keys():
    data_sets[data_set] = cd.load(name=data_set, local_path=local_path)

# getting all formatted drug_descriptor tables and adding them into a dict
dfs_to_merge = {}
for data_set in data_sets:
    if (data_sets[data_set].experiments is not None 
        and data_sets[data_set].drug_descriptors is not None
    ):
        df_tmp = data_sets[data_set].format(data_type='drug_descriptor', shape='wide')
        df_tmp = df_tmp.drop(columns=['morgan fingerprint']).add_prefix('mordred.')
        df_tmp['data_set_origin'] = data_set
        dfs_to_merge[data_set] = df_tmp

# concatenating the individual drug_descriptor tables and dropping duplicates based on all columns but `data_set_origin`
concat_drugs = pd.concat(dfs_to_merge.values())
concat_drugs = concat_drugs.drop_duplicates(subset=concat_drugs.columns.difference(['data_set_origin']))

# getting the improve_drug_id[s] that have conflicting mordred values 
counts = concat_drugs.index.value_counts()
print(counts[counts > 1])
```

An example for a discrepancy for the `improve_drug_id` `SMI_39390` looks like this (see column 1, row 2 vs row 1,3&4 for example - every column has at least 1 value that is different):


mordred.AATS0Z | mordred.AATS0are | mordred.AATS0d | mordred.AATS0dv | mordred.AATS0i | mordred.AATS0m | mordred.AATS0p | mordred.AATS0pe | mordred.AATS0s | mordred.AATS0se | ... | mordred.piPC10 | mordred.piPC2 | mordred.piPC3 | mordred.piPC4 | mordred.piPC5 | mordred.piPC6 | mordred.piPC7 | mordred.piPC8 | mordred.piPC9 | data_set_origin
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
24.654545454545456 | 6.410538181818179 | 3.618181818181818 | 8.945454545454545 | 163.65524747645867 | 98.16974362090495 | 1.5316283238459452 | 6.46704727272727 | 5.9595959595959584 | 7.861068145454542 | ... | 7.897340761810271 | 4.499809670330265 | 5.0891383555842 | 5.641907070938114 | 6.191595324113119 | 6.442838688784959 | 6.84587987526405 | 7.22501511587432 | 7.667313615851853 | beataml
24.654545454545456 | 6.410538181818179 | 3.618181818181818 | 8.945454545454545 | 163.65524747645867 | 98.16974362090495 | 1.5316283238459452 | 6.46704727272727 | 5.9595959595959584 | 7.861068145454542 | ... | 7.897340761810271 | 4.499809670330265 | 5.0891383555842 | 5.641907070938114 | 6.191595324113119 | 6.442838688784959 | 6.84587987526405 | 7.22501511587432 | 7.667313615851853 | ctrpv2
24.654545454545456 | 6.410538181818179 | 3.618181818181818 | 8.945454545454545 | 163.65524747645867 | 98.16974362090495 | 1.5316283238459452 | 6.46704727272727 | 5.9595959595959584 | 7.861068145454542 | ... | 7.897340761810271 | 4.499809670330265 | 5.0891383555842 | 5.641907070938114 | 6.191595324113119 | 6.442838688784959 | 6.84587987526405 | 7.22501511587432 | 7.667313615851853 | mpnst
24.654545454545456 | 6.410538181818179 | 3.618181818181818 | 8.945454545454545 | 163.65524747645867 | 98.16974362090495 | 1.5316283238459452 | 6.46704727272727 | 5.9595959595959584 | 7.861068145454542 | ... | 7.897340761810271 | 4.499809670330265 | 5.0891383555842 | 5.641907070938114 | 6.191595324113119 | 6.442838688784959 | 6.84587987526405 | 7.22501511587432 | 7.667313615851853 | mpnstpdx

Attached is the full list of 155 drug ids where discrepancies exist, as well as the full subsetted DF containing only the `improve_drug_id` that are causing this behaviour. The above example table can be generated by subsetting the attached DF to only one drug_id and running the resulting DF through the code snippet below:

```python
def cols_having_unique(df):
    my_cols = []
    for col in df.columns:
        if df[col].nunique(dropna=False) > 1:
            my_cols.append(col)
    return df[my_cols].copy()
```

[morderd-discrepancy-full-list-incl-data-set-origin.csv](https://github.com/user-attachments/files/18649849/morderd-discrepancy-full-list-incl-data-set-origin.csv)

[morderd-discrepancy-improve_drug_id-list.csv](https://github.com/user-attachments/files/18649740/morderd-discrepancy-improve_drug_id-list.csv)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

discrepancies in mordred values for the same improve_drug_id across different datasets #321

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

mordred.AATS0Z	mordred.AATS0are	mordred.AATS0d	mordred.AATS0dv	mordred.AATS0i	mordred.AATS0m	mordred.AATS0p	mordred.AATS0pe	mordred.AATS0s	mordred.AATS0se	...	mordred.piPC10	mordred.piPC2	mordred.piPC3	mordred.piPC4	mordred.piPC5	mordred.piPC6	mordred.piPC7	mordred.piPC8	mordred.piPC9	data_set_origin
24.654545454545456	6.410538181818179	3.618181818181818	8.945454545454545	163.65524747645867	98.16974362090495	1.5316283238459452	6.46704727272727	5.9595959595959584	7.861068145454542	...	7.897340761810271	4.499809670330265	5.0891383555842	5.641907070938114	6.191595324113119	6.442838688784959	6.84587987526405	7.22501511587432	7.667313615851853	beataml
24.654545454545456	6.410538181818179	3.618181818181818	8.945454545454545	163.65524747645867	98.16974362090495	1.5316283238459452	6.46704727272727	5.9595959595959584	7.861068145454542	...	7.897340761810271	4.499809670330265	5.0891383555842	5.641907070938114	6.191595324113119	6.442838688784959	6.84587987526405	7.22501511587432	7.667313615851853	ctrpv2
24.654545454545456	6.410538181818179	3.618181818181818	8.945454545454545	163.65524747645867	98.16974362090495	1.5316283238459452	6.46704727272727	5.9595959595959584	7.861068145454542	...	7.897340761810271	4.499809670330265	5.0891383555842	5.641907070938114	6.191595324113119	6.442838688784959	6.84587987526405	7.22501511587432	7.667313615851853	mpnst
24.654545454545456	6.410538181818179	3.618181818181818	8.945454545454545	163.65524747645867	98.16974362090495	1.5316283238459452	6.46704727272727	5.9595959595959584	7.861068145454542	...	7.897340761810271	4.499809670330265	5.0891383555842	5.641907070938114	6.191595324113119	6.442838688784959	6.84587987526405	7.22501511587432	7.667313615851853	mpnstpdx

discrepancies in mordred values for the same improve_drug_id across different datasets #321

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions