Remove old Cerebro files from example data sets; switch to .h5 file f…

…ormat in pbmc_10k_v3 example data set; update Dockerfile.
romanhaa · Sep 25, 2019 · 3695abb · 3695abb
1 parent 89d083b
commit 3695abb
Show file tree

Hide file tree

Showing 12 changed files with 29 additions and 80 deletions.
diff --git a/Docker/Dockerfile b/Docker/Dockerfile
@@ -48,7 +48,7 @@ RUN Rscript -e 'BiocManager::install("satijalab/seurat@develop")'
 RUN Rscript -e 'BiocManager::install("monocle")'
 
 # cerebroApp
-RUN Rscript -e 'devtools::install_github("romanhaa/cerebroApp@merge_packages")'
+RUN Rscript -e 'devtools::install_github("romanhaa/cerebroApp@develop")'
 
 # Seurat v2.3.4
 RUN mkdir /other_R_packages

diff --git a/examples/README.md b/examples/README.md
@@ -1,5 +1,6 @@
 # Example data sets
 
-Examples of the Cerebro workflow are vailable for the following public data sets:
+Examples of the Cerebro workflow are available for the following public data sets:
 
 * [`pbmc_10k_v3`](pbmc_10k_v3)
+* [`GSE108041`](GSE108041)
diff --git a/examples/pbmc_10k_v3/README.md b/examples/pbmc_10k_v3/README.md
@@ -34,7 +34,6 @@ Lastly, from the Seurat object we export a Cerebro file (`.crb` extension) that
 
 ## How to reproduce
 
-The example data sets were generated in a container using [Singularity](https://singularity.lbl.gov/) (here I used Singularity 2.6.0).
-The container was built with Docker and can be downloaded from the [Docker Hub](https://cloud.docker.com/u/romanhaa/repository/docker/romanhaa/cerebro-example).
+The example data sets were generated using the official Cerebro Docker image which was built in Docker ([Docker Hub](https://cloud.docker.com/u/romanhaa/repository/docker/romanhaa/cerebro)) and imported into [Singularity](https://singularity.lbl.gov/) (here I used Singularity 2.6.0).
 The workflows for Seurat v2 and Seurat v3 are conceptually identical with some differences due to changes in the Seurat package.
 Details and descriptions for all workflows can be found in the respective directories [Seurat v2](Seurat_v2), [Seurat v3](Seurat_v3), and [scanpy](scanpy).
diff --git a/examples/pbmc_10k_v3/Seurat_v2/README.md b/examples/pbmc_10k_v3/Seurat_v2/README.md
@@ -11,8 +11,8 @@ Then, we pull the Docker image from the Docker Hub, convert it to Singularity, a
 git clone https://github.com/romanhaa/Cerebro
 cd Cerebro/examples/pbmc_10k_v3
 # download GMT file (if you want) and place it inside this folder
-singularity build <path_to>/cerebro-example_2019-09-20.simg docker://romanhaa/cerebro-example:2019-09-20
-singularity exec --bind ./:/data <path_to>/cerebro-example_2019-09-20.simg R
+singularity build <path_to>/cerebro_v1.1.simg docker://romanhaa/cerebro:v1.1
+singularity exec --bind ./:/data <path_to>/cerebro_v1.1.simg R
 ```
 
 Then, we set the console width to `100`, change the working directory, and set the seed.
@@ -34,41 +34,27 @@ library('cerebroApp')
 
 ## Load transcript counts
 
-First, we load the raw data and add gene names / cell barcodes.
+Unfortunately, the `Read10X_h5()` function of Seurat v2 has problems with the `.h5` file downloaded from the 10x Genomics website so instead we load it manually and convert it to a sparse matrix.
 
 ```r
-path_to_data <- './raw_data'
-
-feature_matrix <- Matrix::readMM(paste0(path_to_data, '/matrix.mtx.gz'))
-feature_matrix <- as.matrix(feature_matrix)
-feature_matrix <- as.data.frame(feature_matrix)
-
-colnames(feature_matrix) <- readr::read_tsv(paste0(path_to_data, '/barcodes.tsv.gz'), col_names = FALSE) %>%
-  dplyr::select(1) %>%
-  t() %>%
-  as.vector()
-
-gene_names <- readr::read_tsv(paste0(path_to_data, '/features.tsv.gz'), col_names = FALSE) %>%
-  dplyr::select(2) %>%
-  t() %>%
-  as.vector()
-
-feature_matrix <- feature_matrix %>%
-  dplyr::mutate(gene = gene_names) %>%
-  dplyr::select('gene', everything()) %>%
-  dplyr::group_by(gene) %>%
-  dplyr::summarise_all(sum)
-
-genes <- feature_matrix$gene
-
-feature_matrix <- dplyr::select(feature_matrix, -c('gene'))
-feature_matrix <- as.data.frame(feature_matrix)
-rownames(feature_matrix) <- genes
+h5_data <- hdf5r::H5File$new('raw_data/filtered_feature_bc_matrix.h5', mode = 'r')
+
+feature_matrix <- Matrix::sparseMatrix(
+  i = h5_data[['matrix/indices']][],
+  p = h5_data[['matrix/indptr']][],
+  x = h5_data[['matrix/data']][],
+  dimnames = list(
+    h5_data[['matrix/features/name']][],
+    h5_data[['matrix/barcodes']][]
+  ),
+  dims = h5_data[['matrix/shape']][],
+  index1 = FALSE
+)
 ```
 
 ## Pre-processing with Seurat
 
-With the transcript count loaded, we create a Seurat object, remove cells with less than `100` transcripts or fewer than `50` expressed genes.
+With the transcript count loaded, we create a Seurat object and remove cells with less than `100` transcripts or fewer than `50` expressed genes.
 Then, we follow the standard Seurat workflow, including normalization, identifying highly variably genes, scaling expression values and regressing out the number of transcripts per cell, perform principal component analysis (PCA), find neighbors and clusters.
 Furthermore, we build a cluster tree that represents the similarity between clusters and create a dedicated `cluster` column in the meta data.