Add 'equal_frequency' option to highly_variable_genes #572

davidhbrann · 2019-03-30T02:05:18Z

This fixes #415, by allowing one to find variable genes using the equal_frequency option. It also adds and option to change the number of bins for cell ranger flavor.

I originally tried to copy the implementation in Seurat, which would allow a test similar to what's already present for the equal_width implementation. However the Seurat code has an error:

else if (binning.method=="equal_frequency") {
        data_x_bin <- cut(x = gene.mean, breaks = c(-1,quantile(gene.mean[gene.mean>0],probs=seq(0,1,length.out=num.bin))))
}

The -1 in the code makes it such that there is always only one value in the first bin, which goes from -1 to the minimum value. Not sure why they have this, but then we get different answers since the Scanpy code in highly_variable_genes always makes bins that have only one gene significant (to correct the other error from Seurat that normally excludes these bins/genes, which often contain some highly-expressed genes). Additionally, the cut function in R sometimes returns bin edges with different rounding than the Seurat implementation since Seurat does not modify the default dig.lab = 3. In contrast, I believe pandas uses the actual cutoffs in the data.

Also add option to change number of bins for cell ranger flavor.

Adding drop duplicates to avoid raising an error that the bins don't have unique edges. In practice this could make the first bin have twice as many genes. You could also do the Seurat thing with `pd.cut(df['mean'], bins = [-np.inf, np.percentile(df['mean'], q=np.linspace(0,100,bins+1))])` but I don't know if that's a better approach.

falexwolf · 2019-03-31T19:40:33Z

Great, thank you!

It will be merged very soon; the change in the cell ranger option might be a backwards breaking change, hence, I added this here: #453.

Alex

gokceneraslan · 2019-05-04T17:58:06Z

Hey @falexwolf, if starting the percentile bins from 10 is not intentional (as mentioned in #624 ) I can resolve the conflicts and merge this along with the fix in #624 .

falexwolf · 2019-05-06T08:39:28Z

Sounds good, @gokceneraslan!

davidhbrann added 2 commits March 29, 2019 21:53

Add 'equal_frequency' option to highly_variable_genes

4e676e2

Also add option to change number of bins for cell ranger flavor.

falexwolf mentioned this pull request Mar 31, 2019

TODO: Backwards-compat breaking changes #453

Open

15 tasks

davidhbrann mentioned this pull request Oct 30, 2019

n_bins not respected in highly_variable_genes(..., flavour='cell_ranger') #888

Open

cyrus303 approved these changes Dec 19, 2019

View reviewed changes

falexwolf force-pushed the master branch from aa3acd7 to fd4bc99 Compare December 30, 2019 00:53

davidhbrann closed this Mar 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add 'equal_frequency' option to highly_variable_genes #572

Add 'equal_frequency' option to highly_variable_genes #572

Uh oh!

davidhbrann commented Mar 30, 2019 •

edited

Loading

Uh oh!

falexwolf commented Mar 31, 2019

Uh oh!

gokceneraslan commented May 4, 2019 •

edited

Loading

Uh oh!

falexwolf commented May 6, 2019

Uh oh!

Uh oh!

Add 'equal_frequency' option to highly_variable_genes #572

Add 'equal_frequency' option to highly_variable_genes #572

Uh oh!

Conversation

davidhbrann commented Mar 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

falexwolf commented Mar 31, 2019

Uh oh!

gokceneraslan commented May 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

falexwolf commented May 6, 2019

Uh oh!

Uh oh!

davidhbrann commented Mar 30, 2019 •

edited

Loading

gokceneraslan commented May 4, 2019 •

edited

Loading