-
Notifications
You must be signed in to change notification settings - Fork 544
Description
Hello,
I have a feature request.
The plate-based one-cell-per-well format is still very useful, e.g. SMART-seq2/3 etc. I know that for this type of method, we could demultiplex each cell into individual fastq files based on the indexing reads, and provide a manifest to STARsolo.
However, with the help of automation and robotics, even plate-based method can go with thousands of cells per run. It is kind of awkward to split into thousands (sometime over >10k fastq files) of files at the bcl2fastq stage. Therefore, what we normally do now is to just dump all fastq into four files without demultiplexing:
R1.fastq.gz
R2.fastq.gz
I1.fastq.gz
I2.fastq.gz
In this case, each cell can be identified by stitching I1 and I2. Can STARsolo support this format?
In my mind, I'm thinking of the following once we get the data:
# prepare the cell barcodes by combine I1 and I2
paste <(zcat I1.fastq.gz) <(zcat I2.fastq.gz) | \
awk -F '\t' '{ if(NR%4==1||NR%4==3) {print $1} else {print $1 $2} }' | \
gzip > cell_barcodes.fastq.gz
Then, I can feed the R1.fastq.gz, R2.fastq.gz and cell_barcodes.fastq.gz to STARsolo. The "whitelist" can be very easily generated using the primers used in the experiment.
I actually realised that the SMART-seq3 public data is already in the un-demultiplexed format: https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-8735/samples/
So, I guess people do need this option especially when thousands of cells are done.
Thank you very much !!
Regards,
Xi