Skip to content

Commit

Permalink
Remove old Cerebro files from example data sets; switch to .h5 file f…
Browse files Browse the repository at this point in the history
…ormat in pbmc_10k_v3 example data set; update Dockerfile.
  • Loading branch information
romanhaa committed Sep 25, 2019
1 parent 89d083b commit 3695abb
Show file tree
Hide file tree
Showing 12 changed files with 29 additions and 80 deletions.
2 changes: 1 addition & 1 deletion Docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ RUN Rscript -e 'BiocManager::install("satijalab/seurat@develop")'
RUN Rscript -e 'BiocManager::install("monocle")'

# cerebroApp
RUN Rscript -e 'devtools::install_github("romanhaa/cerebroApp@merge_packages")'
RUN Rscript -e 'devtools::install_github("romanhaa/cerebroApp@develop")'

# Seurat v2.3.4
RUN mkdir /other_R_packages
Expand Down
3 changes: 2 additions & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Example data sets

Examples of the Cerebro workflow are vailable for the following public data sets:
Examples of the Cerebro workflow are available for the following public data sets:

* [`pbmc_10k_v3`](pbmc_10k_v3)
* [`GSE108041`](GSE108041)
3 changes: 1 addition & 2 deletions examples/pbmc_10k_v3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,6 @@ Lastly, from the Seurat object we export a Cerebro file (`.crb` extension) that

## How to reproduce

The example data sets were generated in a container using [Singularity](https://singularity.lbl.gov/) (here I used Singularity 2.6.0).
The container was built with Docker and can be downloaded from the [Docker Hub](https://cloud.docker.com/u/romanhaa/repository/docker/romanhaa/cerebro-example).
The example data sets were generated using the official Cerebro Docker image which was built in Docker ([Docker Hub](https://cloud.docker.com/u/romanhaa/repository/docker/romanhaa/cerebro)) and imported into [Singularity](https://singularity.lbl.gov/) (here I used Singularity 2.6.0).
The workflows for Seurat v2 and Seurat v3 are conceptually identical with some differences due to changes in the Seurat package.
Details and descriptions for all workflows can be found in the respective directories [Seurat v2](Seurat_v2), [Seurat v3](Seurat_v3), and [scanpy](scanpy).
48 changes: 17 additions & 31 deletions examples/pbmc_10k_v3/Seurat_v2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ Then, we pull the Docker image from the Docker Hub, convert it to Singularity, a
git clone https://github.com/romanhaa/Cerebro
cd Cerebro/examples/pbmc_10k_v3
# download GMT file (if you want) and place it inside this folder
singularity build <path_to>/cerebro-example_2019-09-20.simg docker://romanhaa/cerebro-example:2019-09-20
singularity exec --bind ./:/data <path_to>/cerebro-example_2019-09-20.simg R
singularity build <path_to>/cerebro_v1.1.simg docker://romanhaa/cerebro:v1.1
singularity exec --bind ./:/data <path_to>/cerebro_v1.1.simg R
```

Then, we set the console width to `100`, change the working directory, and set the seed.
Expand All @@ -34,41 +34,27 @@ library('cerebroApp')

## Load transcript counts

First, we load the raw data and add gene names / cell barcodes.
Unfortunately, the `Read10X_h5()` function of Seurat v2 has problems with the `.h5` file downloaded from the 10x Genomics website so instead we load it manually and convert it to a sparse matrix.

```r
path_to_data <- './raw_data'

feature_matrix <- Matrix::readMM(paste0(path_to_data, '/matrix.mtx.gz'))
feature_matrix <- as.matrix(feature_matrix)
feature_matrix <- as.data.frame(feature_matrix)

colnames(feature_matrix) <- readr::read_tsv(paste0(path_to_data, '/barcodes.tsv.gz'), col_names = FALSE) %>%
dplyr::select(1) %>%
t() %>%
as.vector()

gene_names <- readr::read_tsv(paste0(path_to_data, '/features.tsv.gz'), col_names = FALSE) %>%
dplyr::select(2) %>%
t() %>%
as.vector()

feature_matrix <- feature_matrix %>%
dplyr::mutate(gene = gene_names) %>%
dplyr::select('gene', everything()) %>%
dplyr::group_by(gene) %>%
dplyr::summarise_all(sum)

genes <- feature_matrix$gene

feature_matrix <- dplyr::select(feature_matrix, -c('gene'))
feature_matrix <- as.data.frame(feature_matrix)
rownames(feature_matrix) <- genes
h5_data <- hdf5r::H5File$new('raw_data/filtered_feature_bc_matrix.h5', mode = 'r')

feature_matrix <- Matrix::sparseMatrix(
i = h5_data[['matrix/indices']][],
p = h5_data[['matrix/indptr']][],
x = h5_data[['matrix/data']][],
dimnames = list(
h5_data[['matrix/features/name']][],
h5_data[['matrix/barcodes']][]
),
dims = h5_data[['matrix/shape']][],
index1 = FALSE
)
```

## Pre-processing with Seurat

With the transcript count loaded, we create a Seurat object, remove cells with less than `100` transcripts or fewer than `50` expressed genes.
With the transcript count loaded, we create a Seurat object and remove cells with less than `100` transcripts or fewer than `50` expressed genes.
Then, we follow the standard Seurat workflow, including normalization, identifying highly variably genes, scaling expression values and regressing out the number of transcripts per cell, perform principal component analysis (PCA), find neighbors and clusters.
Furthermore, we build a cluster tree that represents the similarity between clusters and create a dedicated `cluster` column in the meta data.

Expand Down
Loading

0 comments on commit 3695abb

Please sign in to comment.