Skip to content

Commit

Permalink
Merge pull request #37 from MoTrPAC/fix/metab-flat
Browse files Browse the repository at this point in the history
fix shuffled sample IDs
  • Loading branch information
nicolerg authored Dec 14, 2022
2 parents 3976694 + 428d2c7 commit 2b98720
Show file tree
Hide file tree
Showing 9 changed files with 100 additions and 57 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Type: Package
Package: MotrpacRatTraining6moData
Title: Data for analysis of the MoTrPAC endurance exercise training study
in 6-month-old rats
Version: 1.6.0
Version: 1.6.1
Authors@R: c(
person("Nicole", "Gay", , "nicole.r.gay@gmail.com", role = c("cre", "aut")),
person("Pierre", "Jean Beltran", , "pjeanbeltran@gmail.com", role = "aut"),
Expand Down
7 changes: 7 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
# MotrpacRatTraining6moData 1.6.1 (2022-12-13)

* Fix shuffled PIDs in `METAB_NORM_DATA_FLAT` and `IMMUNO_NORM_DATA_FLAT` introduced by
not reordering the list returned by `MotrpacRatTraining6mo::viallabel_to_pid()`.
`METAB_NORM_DATA_NESTED` and `IMMUNO_NORM_DATA_NESTED` are unchanged.
* Regenerate `TRAINING_REGULATED_NORM_DATA` and `TRAINING_REGULATED_NORM_DATA_NO_OUTLIERS` to account for changes above.

# MotrpacRatTraining6moData 1.6.0 (2022-11-08)

* Add `TRAINING_REGULATED_NORM_DATA` and `TRAINING_REGULATED_NORM_DATA_NO_OUTLIERS`
Expand Down
13 changes: 11 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,13 +146,22 @@ head(TRNSCRPT_LIVER_DA)
Due to file size, only normalized sample-level data and differential analysis results
corresponding to **training-regulated features** (5% IHW FDR) are contained in this package
for chromatin accessibility (ATAC) and DNA methylation (METHYL). The full sets of epigenetic results
may be downloaded through the following public URLs.
may be downloaded either with R functions in the [MotrpacRatTraining6mo package](https://motrpac.github.io/MotrpacRatTraining6mo/)
or through the following public URLs.

**Note that clicking on a [Link]() automatically starts a download.**
To instead copy the URL for an object, right-click the [Link]() and select `Copy Link Address`.

To download and load epigenetic data within R, use one of the following functions in the
[MotrpacRatTraining6mo package](https://motrpac.github.io/MotrpacRatTraining6mo/):

* [load_sample_data()](https://motrpac.github.io/MotrpacRatTraining6mo/reference/load_sample_data.html): For sample-level data from a single tissue and ome.
* [combine_normalized_data()](https://motrpac.github.io/MotrpacRatTraining6mo/reference/combine_normalized_data.html): For sample-level data from multiple tissues or omes. Use `include_epigen = TRUE`.
* [combine_da_results()](https://motrpac.github.io/MotrpacRatTraining6mo/reference/combine_da_results.html): For differential analysis results from multiple tissues or omes. Use `include_epigen = TRUE`.
* Several other functions specifically for loading epigenetic data are documented [here](https://motrpac.github.io/MotrpacRatTraining6mo/reference/index.html#load-epigenetic-data).

Note that the size in this table is the compressed size. Each object occupies several times more
memory when loaded into R. The total compressed size for all of these objects is 8.68 GiB (~9.32 GB).
memory when loaded into R. The total compressed size for all of these objects is 8.68 GiB (~9.32 GB).

Type|Assay|Tissue|Object|Size (MiB)|Click to download|
---|---|---|---|---|---|
Expand Down
Binary file modified data/IMMUNO_NORM_DATA_FLAT.rda
Binary file not shown.
Binary file modified data/METAB_NORM_DATA_FLAT.rda
Binary file not shown.
Binary file modified data/TRAINING_REGULATED_NORM_DATA.rda
Binary file not shown.
Binary file modified data/TRAINING_REGULATED_NORM_DATA_NO_OUTLIERS.rda
Binary file not shown.
51 changes: 46 additions & 5 deletions inst/scripts/save_sample_level_data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ gitdir = "nicolerg/src/MOTRPAC/"
scratch = 'nicolerg/tmp'
datadir = 'motrpac/internal_releases/motrpac-data-freeze-pass/v1.1/extracted_sample_level_data'
gsutil = 'gsutil'
source(sprintf("%s/motrpac-mawg/pass1b-06/tools/get_fx.R", gitdir))
source(sprintf("%s/motrpac-mawg/pass1b-06/integrative/clustering/cluster_viz_fx.R", gitdir))
#source(sprintf("%s/motrpac-mawg/pass1b-06/tools/get_fx.R", gitdir))
#source(sprintf("%s/motrpac-mawg/pass1b-06/integrative/clustering/cluster_viz_fx.R", gitdir))
#load_all("../../R")
```

Expand Down Expand Up @@ -195,6 +195,8 @@ combine_normalized_data = function(gsutil_path,
return(norm)
}
# note this is now a function in MotrpacRatTraining6mo
viallabel_to_pid = function(viallabels){
pheno = as.data.table(MotrpacRatTraining6moData::PHENO)
pheno = unique(pheno[,.(viallabel, pid)])
Expand Down Expand Up @@ -374,9 +376,43 @@ usethis::use_data(IMMUNO_NORM_DATA_NESTED, overwrite=T)

#### Single normalized data table

This next chunk was added at a much later date (12/13/22) to troubleshoot an issue.
Note that this code no longer reveals any issues because the RData have been updated.
See: https://github.com/MoTrPAC/MotrpacRatTraining6moData/issues/36
```{r check for error, eval=F}
# update as of 12/13/22
# it is likely that viallabels got shuffled given the use of viallabel_to_pid()
# spot-check a single dataset
load("../../data/IMMUNO_NORM_DATA_NESTED.rda")
load("../../data/IMMUNO_NORM_DATA_FLAT.rda")
# format nested data
data_nested = IMMUNO_NORM_DATA_NESTED$`rat-mag27plex`$LIVER
rownames(data_nested) = data_nested$viallabel
data_nested$viallabel = NULL
data_nested = as.data.frame(t(data_nested))
# format flat data
data_flat = IMMUNO_NORM_DATA_FLAT[IMMUNO_NORM_DATA_FLAT$dataset == "rat-mag27plex" & IMMUNO_NORM_DATA_FLAT$tissue == "LIVER",]
rownames(data_flat) = data_flat$feature_ID
data_flat[,c("feature","tissue","assay","dataset","feature_ID")] = NULL
# remove all-NA cols
for(c in colnames(data_flat)){
if(all(is.na(data_flat[,c]))){
data_flat[,c] = NULL
}
}
dim(data_flat) == dim(data_nested)
table(data_flat == data_nested)
# so values are the same, but do samples match?
expected = MotrpacRatTraining6mo::viallabel_to_pid(colnames(data_nested))
expected = expected[colnames(data_nested)]
table(as.character(expected) == colnames(data_flat))
# as expected, they do not. so we need to fix the code and regenerate IMMUNO_NORM_DATA_FLAT
```

Because there are multiple viallabels per animal per tissue for IMMUNO data,
make a single normalized data table where column names are PID instead of viallabel.

```{r format IMMUNO for viz}
load("../../data/IMMUNO_NORM_DATA_NESTED.rda")
load("../../data/REPEATED_FEATURES.rda")
Expand All @@ -399,8 +435,11 @@ for(panel in names(IMMUNO_NORM_DATA_NESTED)){
new_features = sprintf("IMMUNO;%s;%s:%s", tissue, panel, feature_ID)
features[new_features %in% rep[,new_feature]] = new_features[new_features %in% rep[,new_feature]]
}
# replace colname with pid
colnames(df_t) = viallabel_to_pid(colnames(df_t))
# replace colnames with pid
vl_to_pid = MotrpacRatTraining6mo::viallabel_to_pid(colnames(df_t))
vl_to_pid = vl_to_pid[colnames(df_t)]
stopifnot(all(colnames(df_t) == names(vl_to_pid)))
colnames(df_t) = unname(vl_to_pid)
# make data.table
dt = as.data.table(cbind(feature=features,
feature_ID=feature_ID,
Expand All @@ -423,6 +462,8 @@ immuno_data[!feature %in% differential_features, feature := NA]
# save
IMMUNO_NORM_DATA_FLAT = as.data.frame(immuno_data)
usethis::use_data(IMMUNO_NORM_DATA_FLAT, overwrite=T)
# recompress
tools::resaveRdaFiles(paths="../../data/IMMUNO_NORM_DATA_FLAT.rda")
```

## TRNSCRPT raw counts
Expand Down
84 changes: 35 additions & 49 deletions inst/scripts/save_sample_level_data_metabolomics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ output: html_document
library(data.table)
library(MotrpacBicQC)
library(MotrpacRatTraining6moData)
library(MotrpacRatTraining6mo)
library(dplyr)
library(testit)
library(usethis)
Expand Down Expand Up @@ -372,22 +373,13 @@ mutate(feature_metadata = pmap(list(tissue,assay_code,data_for_dea), function(ti
```{r}
METAB_SAMPLE_DATA <- metab_data %>%
rename(sample_data = data_for_dea)
# usethis::use_data(METAB_SAMPLE_DATA,overwrite=T)
#
# # rename
# load("../../data/METAB_SAMPLE_DATA.rda")
METAB_SAMPLE_DATA_NESTED = METAB_SAMPLE_DATA
# usethis::use_data(METAB_SAMPLE_DATA_NESTED,overwrite=T)
# convert to nested list
# load("../../data/METAB_SAMPLE_DATA_NESTED.rda")
# change format to nested list: data[[platform]][[tissue]]
data_list = list()
for(i in 1:nrow(METAB_SAMPLE_DATA_NESTED)){
tissue_code = METAB_SAMPLE_DATA_NESTED$tissue[i]
for(i in 1:nrow(METAB_SAMPLE_DATA)){
tissue_code = METAB_SAMPLE_DATA$tissue[i]
tissue = TISSUE_CODE_TO_ABBREV[[tissue_code]]
platform = METAB_SAMPLE_DATA_NESTED$assay_code[i]
data = as.data.frame(METAB_SAMPLE_DATA_NESTED$sample_data[i][[1]])
platform = METAB_SAMPLE_DATA$assay_code[i]
data = as.data.frame(METAB_SAMPLE_DATA$sample_data[i][[1]])
data_list[[platform]][[tissue]] = data
}
METAB_NORM_DATA_NESTED = data_list
Expand All @@ -397,8 +389,6 @@ usethis::use_data(METAB_NORM_DATA_NESTED,overwrite=T)
# 4. Make metabolite-to-RefMet map from DEA
```{r}
load("../../data/METAB_NORM_DATA_NESTED.rda")
#load_all("../../../MotrpacRatTraining6mo") # for viallabel_to_pid, list_available_data
library(MotrpacRatTraining6mo)
# load all DEA tables
obj_list = list_available_data(package="MotrpacRatTraining6moData")
metab_dea = obj_list[grepl("^METAB_",obj_list) & grepl("_DA$",obj_list)]
Expand All @@ -418,13 +408,16 @@ differential_features = unique(TRAINING_REGULATED_FEATURES$feature)
# 5. Make "flat" version of normalized data and metabolite metadata
```{r}
data_list = list()
feature_meta_list = list()
for(assay_code in names(METAB_NORM_DATA_NESTED)){
for(tissue in names(METAB_NORM_DATA_NESTED[[assay_code]])){
sample_data = as.data.frame(METAB_NORM_DATA_NESTED[[assay_code]][[tissue]])
# replace colnames with pid
colnames(sample_data) = viallabel_to_pid(colnames(sample_data))
vl = viallabel_to_pid(colnames(sample_data))
# order viallabels
vl = vl[colnames(sample_data)]
stopifnot(all(names(vl) == colnames(sample_data)))
colnames(sample_data) = unname(vl)
feature_IDs = rownames(sample_data)
# make original and "fixed" feature for ALL feature_ID
Expand All @@ -450,7 +443,6 @@ for(assay_code in names(METAB_NORM_DATA_NESTED)){
}
}
metab_data = rbindlist(data_list, fill=T)
#feature_meta = rbindlist(feature_meta_list)
# fix duplicate feature names for differential metabolites
rep_feat = data.table(REPEATED_FEATURES)
metab_data[,feature := ifelse(new_feature %in% rep_feat[,new_feature], new_feature, feature)]
Expand Down Expand Up @@ -490,14 +482,6 @@ metareg = rbindlist(dea_list)
metareg = unique(metareg[,.(feature, tissue, feature_ID, dataset, metabolite_refmet, metabolite)])
# is refmet NA for any?
table(is.na(metareg[,metabolite_refmet]))
## this is now fixed in an upstream script
# metareg[is.na(metabolite_refmet)]
# METAB_FEATURE_ID_MAP[feature_ID_da=="TG(36:1)>TG(4:0_16:0_16:1)_and_TG(2:0_16:0_18:1)_feature4"]
# # can we get the refmet from the feature metadata?
# head(feature_meta)
# feature_meta[metabolite_name == "TG(36:1)>TG(4:0_16:0_16:1)_and_TG(2:0_16:0_18:1)_feature4"] # no
# # how about from MotrpacBicQC?
# head(metabolomics_data_dictionary)
mdd = data.table(metabolomics_data_dictionary)
# mdd[metabolite_name == "TG(36:1)>TG(4:0_16:0_16:1)_and_TG(2:0_16:0_18:1)_feature4 "]
# # yes!
Expand Down Expand Up @@ -575,36 +559,38 @@ METAB_FEATURE_ID_MAP = rbindlist(list(METAB_FEATURE_ID_MAP, standard, metareg1),
backup = copy(METAB_FEATURE_ID_MAP)
```

```{r save flat data}
```{r save feature to ID map}
METAB_FEATURE_ID_MAP = as.data.frame(METAB_FEATURE_ID_MAP[,.(metabolite_name, metabolite_refmet, tissue, dataset,
feature_ID_sample_data, feature_ID_da, feature_ID_metareg,
dataset_metareg, feature)])
# merge with feature meta
head(feature_meta)
feature_meta[,tissue := TISSUE_CODE_TO_ABBREV[tissue]]
feature_meta[,refmet_name := NULL]
table(feature_meta[,metabolite_name] %in% METAB_FEATURE_ID_MAP$feature_ID_sample_data)
table(METAB_FEATURE_ID_MAP$feature_ID_sample_data %in% feature_meta[,metabolite_name])
METAB_FEATURE_ID_MAP$feature_ID_sample_data[! METAB_FEATURE_ID_MAP$feature_ID_sample_data %in% feature_meta[,metabolite_name]]
# does mdd have meta for this feature?
head(mdd)
mdd[metabolite_name == "TG(36:1)>TG(4:0_16:0_16:1)_and_TG(2:0_16:0_18:1)_feature4 "]
missing = data.table(tissue = "WAT-SC",
assay_code = "metab-u-lrppos",
metabolite_name = "TG(36:1)>TG(4:0_16:0_16:1)_and_TG(2:0_16:0_18:1)_feature4",
formula = "C39H72O6")
feature_meta = rbindlist(list(feature_meta, missing), fill=T)
METAB_FEATURE_ID_MAP = merge(METAB_FEATURE_ID_MAP, feature_meta,
by.x=c("tissue","dataset","metabolite_name"),
by.y=c("tissue","assay_code","metabolite_name"),
all.x=TRUE)
METAB_FEATURE_ID_MAP[is.na(METAB_FEATURE_ID_MAP$rt),]
# woohoo!
use_data(METAB_FEATURE_ID_MAP, overwrite = TRUE)
table(metab_data[,feature_ID] %in% METAB_FEATURE_ID_MAP$feature_ID_sample_data)
# save METAB_FEATURE_ID_MAP
use_data(METAB_FEATURE_ID_MAP, overwrite = TRUE)
sinew::makeOxygen(METAB_FEATURE_ID_MAP)
```

```{r save flat data}
# sanity check
feature = melt(metab_data[feature_ID == "TG(36:2)>TG(2:0_16:0_18:2)_feature4"],
measure.vars = colnames(metab_data)[grepl("^1", colnames(metab_data))],
id.vars = c("feature_ID"),
variable.name = "pid")
# remove NA
feature = feature[complete.cases(feature)]
# merge with pheno
PHENO$pid = as.character(PHENO$pid)
feature[,pid := as.character(pid)]
feature = merge(feature, unique(PHENO[,c("pid","sex","group")]), by="pid")
# plot
ggplot(feature, aes(x=group, y=value)) +
geom_point() +
facet_wrap(~ sex) +
labs(title="WAT TG(36:2)>TG(2:0_16:0_18:2)_feature4")
# save METAB_NORM_DATA_FLAT
METAB_NORM_DATA_FLAT = as.data.frame(metab_data)
use_data(METAB_NORM_DATA_FLAT, overwrite = TRUE)
sinew::makeOxygen(METAB_FEATURE_ID_MAP)
sinew::makeOxygen(METAB_NORM_DATA_FLAT)
```

Expand Down

0 comments on commit 2b98720

Please sign in to comment.