Open
Description
combining and sorting (i.e. generating the union of) entrez_ids from nci60 transcriptomics, proteomics, mutations & cnvs generates a set of gene ids which is not contained in genes.tsv
.
specifically:
- unique entrez_ids in the union of transcriptomics, proteomics, mutations & cnvs: 19,387
- unique entrez_ids in
genes.csv.gz
: 19,811 - intersection of the two: 19,145
steps to reproduce:
$ zcat < nci60_transcriptomics.csv.gz | cut -d, -f1 | tail -n+2 > nci60_genes.tsv
$ zcat < nci60_proteomics.csv.gz | cut -d, -f1 | tail -n+2 >> nci60_genes.tsv
$ zcat < nci60_mutations.csv.gz | cut -d, -f1 | tail -n+2 >> nci60_genes.tsv
$ zcat < nci60_copy_number.csv.gz | cut -d, -f1 | tail -n+2 >> nci60_genes.tsv
$ zcat < genes.csv.gz | cut -d, -f1 | tail -n+2 | tr -d '"' | sort -u > genes.tsv
$ wc -l genes.tsv
19811 genes.tsv
$ sort -u nci60_genes.tsv > nci60_genes_sorted.tsv
$ wc -l nci60_genes_sorted.tsv
19387 nci60_genes_sorted.tsv
$ grep -Fxf nci60_genes_sorted.tsv genes.tsv | wc -l
19145
Metadata
Metadata
Assignees
Labels
No labels