Schema

Document Status: Approved

Version: 2.0.0

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED" "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14, RFC2119, and RFC8174 when, and only when, they appear in all capitals, as shown here.

Background

cellxgene aims to support the publication, sharing, and exploration of single-cell datasets. Building on those published datasets, cellxgene seeks to create references of the phenotypes and composition of cells that make up human tissues.

Creating references from multiple datasets requires some harmonization of metadata and features in the cellxgene Data Portal. But if that harmonization is too onerous, it will burden the goal of rapid data sharing. cellxgene balances publishing and reference creation needs by requiring datasets hosted in the cellxgene Data Portal to include a small set of metadata readily available from data submitters.

This document describes the schema, a type of contract, that cellxgene requires all datasets to adhere to so that it can enable searching, filtering, and integration of datasets it hosts.

Note that the requirements in the schema are just the minimum required information. Datasets often have additional metadata, which is preserved in datasets submitted to the cellxgene Data Portal.

Overview

This schema supports multiple assay types. Each assay takes the form of one or more two-dimensional matrices whose values are quantitative measures of the phenotypes of cells.

The schema additionally describes how the dataset, genes, and cells are annotated to describe the biological and technical characteristics of the data.

This document is organized by:

General requirements
X (Matrix layers), which describe the data required for different assays
obs (Cell metadata), which describe each cell in the dataset
var and raw.var (Gene metadata), which describe each gene in the dataset
obsm (Embeddings), which describe each embedding in the dataset
uns (Dataset metadata), which describe the dataset as a whole

General Requirements

AnnData - The canonical data format for the cellxgene Data Portal is HDF5-backed AnnData as written by version 0.7 of the anndata library. Part of the rationale for selecting this format is to allow cellxgene to access both the data and metadata within a single file. The schema requirements and definitions for the AnnData X, obs, var, raw.var, obsm, and uns attributes are described below.

All data submitted to the cellxgene Data Portal is automatically converted to a Seurat V3 object that can be loaded by the R package Seurat. See the Seurat encoding for further information.
Organisms. Data MUST be from a Metazoan organism or SARS-COV-2 and defined in the NCBI organismal classification. For data that is neither Human, Mouse, nor SARS-COV-2, features MUST be translated into orthologous genes from the pinned Human and Mouse gene annotations.
Reserved Names. The names of the metadata keys specified by the schema are reserved and MUST be unique. For example, duplicate "feature_biotype" keys in AnnData var are not allowed.
Redundant Metadata. It is STRONGLY RECOMMENDED to avoid multiple metadata fields containing identical or similar information.
No PII. Curators agree to this requirement as part of the data submission policy. However, it is not strictly enforced in our validation tooling because it is difficult for software to predict what is and is not PII. It is up to the submitter to ensure that no metadata can be personally identifiable: no names, dates of birth, specific locations, etc. See this list for guidance.

Note on types

The types below are python3 types. Note that a python3 str is a sequence of Unicode code points, which is stored null-terminated and UTF-8-encoded by anndata.

`X` (Matrix Layers)

The data stored in the X data matrix is the data that is viewable in cellxgene Explorer. cellxgene does not impose any additional constraints on the X data matrix.

In any layer, if a matrix has 50% or more values that are zeros, it is STRONGLY RECOMMENDED that the matrix be encoded as a scipy.sparse.csr_matrix.

cellxgene's matrix layer requirements are tailored to optimize data reuse. Because each assay has different characteristics, the requirements differ by assay type. In general, cellxgene requires submission of "raw" data suitable for computational reuse when a standard raw matrix format exists for an assay and strongly recommends that a "final" matrix suitable for visualization in cellxgene Explorer be included. So that cellxgene's data can be provided in download formats suitable for both R and Python, the schema imposes the following requirements:

All matrix layers MUST have the same shape, and have the same cell labels and gene labels.
Because it is impractical to retain all barcodes in raw and final matrices, any cell filtering MUST be applied to both. By contrast, those wishing to reuse datasets require access to raw gene expression values, so genes SHOULD NOT be filtered from either dataset. Summarizing, any cell barcodes that are removed from the data MUST be filtered from both raw and final matrices and genes SHOULD NOT be filtered from the raw matrix.
Any genes that publishers wish to filter from the final matrix MAY have their values replaced by zeros and MUST be flagged in the column feature_is_filtered of var, which will mask them from exploration.
Additional layers provided at author discretion MAY be stored using author-selected keys, but MUST have the same cells and genes as other layers. It is STRONGLY RECOMMENDED that these layers have names that accurately summarize what the numbers in the layer represent (e.g. "counts_per_million", "SCTransform_normalized", or "RNA_velocity_unspliced").

The following table describes the matrix data and layers requirements that are assay-specific. If an entry in the table is empty, the schema does not have any other requirements on data in those layers beyond the ones listed above.

Assay	"raw" required?	"raw" location	"final" required?	"final" location
scRNA-seq (UMI, e.g. 10x v3)	REQUIRED. Values MUST be de-duplicated molecule counts.	`AnnData.raw.X` unless no "final" is provided, then `AnnData.X`	STRONGLY RECOMMENDED	`AnnData.X`
scRNA-seq (non-UMI, e.g. SS2)	REQUIRED. Values MUST be one of read counts (e.g. FeatureCounts) or estimated fragments (e.g. output of RSEM).	`AnnData.raw.X` unless no "final" is provided, then `AnnData.X`	STRONGLY RECOMMENDED	`AnnData.X`
Accessibility (e.g. ATAC-seq, mC-seq)	NOT REQUIRED		REQUIRED	`AnnData.X`

Integration Metadata

cellxgene requires ontology terms to enable search, comparison, and integration of data. Ontology terms for cell metadata MUST use OBO-format identifiers, meaning a CURIE (prefixed identifier) of the form Ontology:Identifier. For example, EFO:0000001 is a term in the Experimental Factor Ontology (EFO).

The most accurate ontology term MUST always be used. This is true even in cases where there may not be an exact or approximate ontology term.

For example if cell_type_ontology_term_id describes a relay interneuron, but the most accurate available term in the CL ontology is CL:0000099 for interneuron, then the interneuron term can be used to fulfill this requirement and ensures that users searching for "neuron" are able to find these data. If no appropriate high-level term can be found or the cell type is unknown, then the most accurate term is CL:0000003 for native cell.

Users will still be able to access more specific cell type annotations that have been submitted with the dataset (but aren't required by the schema).

Terms documented as obsolete in an ontology MUST NOT be used. For example, EFO:0009310 for obsolete_10x v2 was marked as obsolete in EFO version 3.31.0 and replaced by EFO:0009899 for 10x 3' v2.

Required Ontologies

The following ontology dependencies are pinned for this version of the schema.

Ontology	OBO Prefix	Required version
Cell Ontology	CL	cl.owl : 2021-08-10
Experimental Factor Ontology	EFO	efo.owl : 2021-08-16 EFO 3.33.0
Human Ancestry Ontology	HANCESTRO	hancestro.owl : 2021-01-04 (2.5)
Human Developmental Stages	HsapDv	hsapdv.owl : 2020-03-10
Mondo Disease Ontology	MONDO	mondo.owl : 2021-08-11
Mouse Developmental Stages	MmusDv	mmusdv.owl : 2020-03-10
NCBI organismal classification	NCBITaxon	ncbitaxon.owl : 2021-06-10
Phenotype And Trait Ontology	PATO	pato.owl : 2021-08-06
Uberon multi-species anatomy ontology	UBERON	uberon.owl : 2021-07-27

Required Gene Annotations

cellxgene requires ENSEMBL identifiers for genes and External RNA Controls Consortium (ERCC) identifiers for RNA Spike-In Control Mixes to ensure that all datasets it stores measure the same features and can therefore be integrated.

The following gene annotation dependencies are pinned for this version of the schema. For multi-organism experiments, cells from any Metazoan organism are allowed as long as orthologs from the following organism annotations are used.

Source	Required version	Download
GENCODE (Human)	Human reference GRCh38 (GENCODE v38/Ensembl 104)	gencode.v38.primary_assembly.annotation.gtf
GENCODE (Mouse)	Mouse reference GRCm39 (GENCODE vM27/Ensembl 104)	gencode.vM27.primary_assembly.annotation.gtf
ENSEMBL (COVID-19)	SARS-CoV-2 reference (ENSEMBL assembly: ASM985889v3)	Sars_cov_2.ASM985889v3.101.gtf
ThermoFisher ERCC Spike-Ins	ThermoFisher ERCC RNA Spike-In Control Mixes (Cat # 4456740, 4456739)	cms_095047.txt

`obs` (Cell Metadata)

obs is a pandas.DataFrame.

Curators MUST annotate the following columns in the obs dataframe:

assay_ontology_term_id

Key assay_ontology_term_id

Annotator Curator

Value

categorical with str categories. This MUST be an EFO term and either:

"EFO:0002772" for assay by molecule or preferably its most accurate child
"EFO:0010183" for single cell library construction or preferably its most accurate child

An assay based on 10X Genomics products SHOULD either be "EFO:0008995" for 10x technology or preferably its most accurate child. An assay based on SMART (Switching Mechanism at the 5' end of the RNA Template) or SMARTer technology SHOULD either be "EFO:0010184" for Smart-like or preferably its most accurate child.

If there is not an exact match for the assay, clarifying text MAY be enclosed in parentheses and appended to the most accurate term. For example, the sci-plex assay could be curated as "EFO:0010183 (sci-plex)".

Recommended values for specific assays:

For	Use
10x 3' v2	`"EFO:0009899"`
10x 3' v3	`"EFO:0009922"`
10x 5' v1	`"EFO:0011025"`
10x 5' v2	`"EFO:0009900"`
Smart-seq	`"EFO:0008930"`
Smart-seq2	`"EFO:0008931"`

cell_type_ontology_term_id

Key	cell_type_ontology_term_id
Annotator	Curator
Value	categorical with `str` categories. This MUST be a CL term.

development_stage_ontology_term_id

Key development_stage_ontology_term_id

Annotator Curator

Value

categorical with str categories. If unavailable, this MUST be "unknown"

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, this MUST be the most
accurate HsapDv term with the following STRONGLY RECOMMENDED:

For	Use
Embryonic stage	A term from the set of Carnegie stages 1-23 (up to 8 weeks after conception; e.g. HsapDv:0000003)
Fetal development	A term from the set of 9 to 38 week post-fertilization human stages (9 weeks after conception and before birth; e.g. HsapDv:0000046)
After birth for the first 12 months	A term from the set of 1 to 12 month-old human stages (e.g. HsapDv:0000174)
After the first 12 months post-birth	A term from the set of year-old human stages (e.g. HsapDv:0000246)

If organism_ontolology_term_id is "NCBITaxon:10090" for Mus musculus, this MUST be the most
accurate MmusDv term with the following STRONGLY RECOMMENDED:

For	Use
From the time of conception to 1 month after birth	A term from the set of Theiler stages (e.g. MmusDv:0000003)
From 2 months after birth	A term from the set of month-old stages (e.g. MmusDv:0000062)

Otherwise, for all other organisms this MUST be the most accurate child of UBERON:0000105 for life cycle stage, excluding UBERON:0000071 for death stage.

disease_ontology_term_id

Key	disease_ontology_term_id
Annotator	Curator
Value	categorical with `str` categories. This MUST be a MONDO term or `"PATO:0000461"` for normal or healthy.

ethnicity_ontology_term_id

Key	ethnicity_ontology_term_id
Annotator	Curator
Value	categorical with `str` categories. If `organism_ontolology_term_id` is `"NCBITaxon:9606"` for Homo sapiens, this MUST be either a HANCESTRO term or `"unknown"` if unavailable. Otherwise, for all other organisms this MUST be `"na"`.

is_primary_data

Key	is_primary_data
Annotator	Curator
Value	`bool`. This MUST be `True` if this is the canonical instance of this cellular observation and `False` if not. This is commonly `False` for meta-analyses reusing data or for secondary views of data.

organism_ontology_term_id

Key	organism_ontology_term_id
Annotator	Curator
Value	categorical with `str` categories. This MUST be a child of NCBITaxon:33208 for Metazoa.

sex_ontology_term_id

Key	sex_ontology_term_id
Annotator	Curator
Value	categorical with `str` categories. This MUST be a child of PATO:0001894 for phenotypic sex or `"unknown"` if unavailable.

tissue_ontology_term_id

Key tissue_ontology_term_id

Annotator Curator

Value

categorical with str categories. This MUST be the UBERON or CL term that best describes the tissue that this cell was derived from, depending on the type of biological sample:

For	Use
Tissue	STRONGLY RECOMMENDED to be an UBERON term (e.g. `"UBERON:0008930"` for a sematosensory cortex tissue sample)
Cell Culture	MUST be a CL term appended with `" (cell culture)"` (e.g. `"CL:0000057 (cell culture)"` for the WTC-11 cell line)
Organoid	MUST be an UBERON term appended with `" (organoid)"` (e.g. `"UBERON:0000955 (organoid)"` for a brain organoid)
Enriched, Sorted,or Isolated Cells from a Tissue	MUST be an UBERON or CL term and SHOULD NOT use terms that do not capture the tissue of origin (e.g. In the case of CD3+ kidney cells, use `"UBERON:0002113"` for kidney instead of `"CL:000084"` for T cell. However, in the case of EPCAM+ cervical cells, use `"CL:000066"` for epithelial cell of the cervix.)

When a dataset is uploaded, the cellxgene Data Portal MUST automatically add the matching human-readable name for the corresponding ontology term to the obs dataframe. Curators MUST NOT annotate the following columns.

assay

Key	assay
Annotator	Data Portal
Value	categorical with `str` categories. This MUST be the human-readable name assigned to the value of `assay_ontology_term_id`. Any clarifying text enclosed in parentheses and appended to `assay_ontology_term_id` MUST be appended to `assay`. For example, if the sci-plex assay was curated as `"EFO:0010183 (sci-plex)"`, then the value would be `"single-cell library construction (sci-plex)"`.

cell_type

Key	cell_type
Annotator	Data Portal
Value	categorical with `str` categories. This MUST be the human-readable name assigned to the value of `cell_type_ontology_term_id`.

development_stage

Key	development_stage
Annotator	Data Portal
Value	categorical with `str` categories. This MUST be `"unknown"` if set in `development_stage_ontology_term_id`; otherwise, this MUST be the human-readable name assigned to the value of `development_stage_ontology_term_id`.

disease

Key	disease
Annotator	Data Portal
Value	categorical with `str` categories. This MUST be the human-readable name assigned to the value of `disease_ontology_term_id`.

ethnicity

Key	ethnicity
Annotator	Data Portal
Value	categorical with `str` categories. This MUST be `"na"` or `"unknown"` if set in `ethnicity_ontology_term_id`; otherwise, this MUST be the human-readable name assigned to the value of `ethnicity_ontology_term_id`.

organism

Key	organism
Annotator	Data Portal
Value	categorical with `str` categories. This MUST be the human-readable name assigned to the value of `organism_ontology_term_id`.

sex

Key	sex
Annotator	Data Portal
Value	categorical with `str` categories. This MUST be `"unknown"` if set in `sex_ontology_term_id`; otherwise, this MUST be the human-readable name assigned to the value of `sex_ontology_term_id`.

tissue

Key	tissue
Annotator	Data Portal
Value	categorical with `str` categories. This MUST be the human-readable name assigned to the value of `tissue_ontology_term_id`. `" (cell culture)"` or `" (organoid)"` MUST be appended if present in `tissue_ontology_term_id`. For example, if the `tissue_ontology_term_id` was curated as `"CL:0000057 (cell culture)"`, then the value would be `"fibroblast (cell culture)"`.

`var` and `raw.var` (Gene Metadata)

var and raw.var are both of type pandas.DataFrame.

Curators MUST annotate the following columns in the var dataframe and if present, the raw.var dataframe.

feature_biotype

Key	feature_biotype
Annotator	Curator
Value	This MUST be `"gene"` or `"spike-in"`.

index of pandas.DataFrame

Key	index of `pandas.DataFrame`
Annotator	Curator
Value	`str`. If the `feature_biotype` is `"gene"` then this MUST be an ENSEMBL term. If the `feature_biotype` is `"spike-in"` then this MUST be an ERCC Spike-In identifier. The index of the `pandas.DataFrame` MUST contain unique identifiers for features. If present, the index of `raw.var` MUST be identical to the index of `var`.

Curators MUST annotate the following column only in the var dataframe. This column MUST NOT be present in raw.var:

feature_is_filtered

Key	feature_is_filtered
Annotator	Curator
Value	`bool`. This MUST be `True` if the feature was filtered out in the final matrix (`X`) but is present in the raw matrix (`raw.X`). The value for all cells of the given feature in the final matrix MUST be `0`. Otherwise, this MUST be `False`.

When a dataset is uploaded, cellxgene Data Portal MUST automatically add the matching human-readable name for the corresponding feature identifier and the inferred NCBITaxon term for the reference organism to the var and raw.var dataframes. Curators MUST NOT annotate the following columns:

feature_name

Key	feature_name
Annotator	Data Portal
Value	`str`. If the `feature_biotype` is `"gene"` then this MUST be the human-readable ENSEMBL gene name assigned to the feature identifier in `var.index`. If the `feature_biotype` is `"spike-in"` then this MUST be the ERCC Spike-In identifier appended with `" (spike-in control)"`.

feature_reference

Key feature_reference

Annotator Data Portal

Value

str. This MUST be the reference organism for a feature:

Reference Organism	MUST Use
Homo sapiens	`"NCBITaxon:9606"`
Mus musculus	`"NCBITaxon:10090"`
SARS-CoV-2	`"NCBITaxon:2697049"`
ERCC Spike-Ins	`"NCBITaxon:32630"`

`obsm` (Embeddings)

For each str key, obsm stores a numpy.ndarray of shape (n_obs, m), where n_obs is the number of rows in X and m >= 1.

To display a dataset in cellxgene Explorer, Curators MUST annotate one or more two-dimensional (m >= 2) embeddings (e.g. tSNE, UMAP, PCA, spatial coordinates) as numpy.ndarrays in obsm. The keys for these embedding MUST be prefixed with "X_". The text that follows this prefix is presented to users in the Embedding Choice selector in cellxgene Explorer.

To illustrate, the Krasnow Lab Human Lung Cell Atlas, 10X dataset in the A molecular cell atlas of the human lung from single cell RNA sequencing collection defines two embeddings in obsm:

"X_Compartment_tSNE"
"X_tSNE"

Users can then choose which embedding is visualized in cellxgene Explorer:

`uns` (Dataset Metadata)

uns is a ordered dictionary with a str key. Curators MUST annotate the following keys and values in uns:

schema_version

Key	schema_version
Annotator	Curator
Value	This MUST be `"2.0.0"`.

title

Key	title
Annotator	Curator
Value	`str`. This text describes and differentiates the dataset from other datasets in the same collection. It is displayed on a page in the cellxgene Data Portal that also has the collection name. To illustrate, the first dataset name in the Cells of the adult human heart collection is "All — Cells of the adult human heart". It is STRONGLY RECOMMENDED that each dataset `title` in a collection is unique and does not depend on other metadata such as a different `assay` to disambiguate it from other datasets in the collection.

X_normalization

Key	X_normalization
Annotator	Curator
Value	`str`. This SHOULD describe the method used to normalize the data stored in AnnData `X`. If data in `X` are raw, this SHOULD be `"none"`.

Curators MAY also annotate the following optional keys and values in uns. If the key is present, then its value MUST NOT be empty.

batch_condition

Key	batch_condition
Annotator	Curator
Value	`list[str]`. `str` values MUST refer to cell metadata keys in `obs`. Together, these keys define the batches that a normalization or integration algorithm should be aware of. For example if `"patient"` and `"seqBatch"` are keys of vectors of cell metadata, either `["patient"]`, `["seqBatch"]`, or `["patient", "seqBatch"]` are valid values.

default_embedding

Key	default_embedding
Annotator	Curator
Value	`str`. The value MUST match a key to an embedding in `obsm` for the embedding to display by default in cellxgene Explorer.

X_approximate_distribution

Key	X_approximate_distribution
Annotator	Curator
Value	`str`. cellxgene runs a heuristic to detect the approximate distribution of the data in X so that it can accurately calculate statistical properties of the data. This field enables the curator to override this heuristic and specify the data distribution explicitly. The value MUST be `"count"` (for data whose distributions are best approximated by counting distributions like Poisson, Binomial, or Negative Binomial) or `"normal"` (for data whose distributions are best approximated by the Gaussian distribution.)

Appendix A. Changelog

schema v2.0.0 substantially remodeled schema v1.1.0:

"must", "should", and select other words have a defined, standard meaning.
Curators are responsible for annotating ontology and gene identifiers. The cellxgene Data Portal adds the assigned human-readable names for all identifiers.
Documented and pinned the required versions of ontologies and gene annotations used in schema validation.
General Requirements
- AnnData is now the canonical data format. The schema outline and descriptions are AnnData-centric.
- Metazoan multi-organism data is accepted by the cellxgene Data Portal. For data that is neither Human, Mouse, nor SARS-COV-2, features MUST be translated into orthologous genes from the Human and Mouse gene annotations.
- Policies for reserved names and redundant metadata are documented.
- #45 Updated reference to new PII content
X (matrix layers)
- Added guidance for sparse matrices
- Clarified matrix requirements by assay
obs (cell metadata)
- Empty ontology fields are no longer permitted.
- Moved organism from uns to obs
- Clarified requirements and added detailed guidance for assays, tissue, and development stages
- Added ontology for mouse development stages
- Added ontology for sex
- Added is_primary_data
var
- Replaced HGNC gene symbols as var.index with ENSEMBL or ERCC spike-in identifiers
- Added feature_name, index, and feature_reference
- Added feature_is_filtered
- Added requirements for raw.var which must be identical to var
uns
- Added batch_condition
- Added X_approximate_distribution
- Replaced layer_descriptions with X_normalization
- Replaced version which included corpora_schema_version and corpora_encoding_version with schema_version
- Deprecated tags and default_field presentation metadata
- Removed obs_column_colors

Files

schema.md

Latest commit

History

schema.md

File metadata and controls

Schema

Background

Overview

General Requirements

Note on types

X (Matrix Layers)

Integration Metadata

Required Ontologies

Required Gene Annotations

obs (Cell Metadata)

assay_ontology_term_id

cell_type_ontology_term_id

development_stage_ontology_term_id

disease_ontology_term_id

ethnicity_ontology_term_id

is_primary_data

organism_ontology_term_id

sex_ontology_term_id

tissue_ontology_term_id

assay

cell_type

development_stage

disease

ethnicity

organism

sex

tissue

var and raw.var (Gene Metadata)

feature_biotype

index of pandas.DataFrame

feature_is_filtered

feature_name

feature_reference

obsm (Embeddings)

uns (Dataset Metadata)

schema_version

title

X_normalization

batch_condition

default_embedding

X_approximate_distribution

Appendix A. Changelog

`X` (Matrix Layers)

`obs` (Cell Metadata)

`var` and `raw.var` (Gene Metadata)

`obsm` (Embeddings)

`uns` (Dataset Metadata)