Skip to content

NCI60 transcriptomics, proteomics, mutations, cnvs contain genes not in genes.csv #405

Open
@ymahlich

Description

@ymahlich

combining and sorting (i.e. generating the union of) entrez_ids from nci60 transcriptomics, proteomics, mutations & cnvs generates a set of gene ids which is not contained in genes.tsv.

specifically:

  • unique entrez_ids in the union of transcriptomics, proteomics, mutations & cnvs: 19,387
  • unique entrez_ids in genes.csv.gz: 19,811
  • intersection of the two: 19,145

steps to reproduce:

$ zcat < nci60_transcriptomics.csv.gz | cut -d, -f1 | tail -n+2 > nci60_genes.tsv

$ zcat < nci60_proteomics.csv.gz | cut -d, -f1 | tail -n+2 >> nci60_genes.tsv

$ zcat < nci60_mutations.csv.gz | cut -d, -f1 | tail -n+2 >> nci60_genes.tsv

$ zcat < nci60_copy_number.csv.gz | cut -d, -f1 | tail -n+2 >> nci60_genes.tsv

$ zcat < genes.csv.gz | cut -d, -f1 | tail -n+2 | tr -d '"' | sort -u > genes.tsv

$ wc -l genes.tsv
     19811 genes.tsv

$ sort -u nci60_genes.tsv > nci60_genes_sorted.tsv

$ wc -l nci60_genes_sorted.tsv
     19387 nci60_genes_sorted.tsv

$ grep -Fxf nci60_genes_sorted.tsv genes.tsv | wc -l 
     19145

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions