Skip to content

Add 'equal_frequency' option to highly_variable_genes #572

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

davidhbrann
Copy link
Contributor

@davidhbrann davidhbrann commented Mar 30, 2019

This fixes #415, by allowing one to find variable genes using the equal_frequency option. It also adds and option to change the number of bins for cell ranger flavor.

I originally tried to copy the implementation in Seurat, which would allow a test similar to what's already present for the equal_width implementation. However the Seurat code has an error:

else if (binning.method=="equal_frequency") {
        data_x_bin <- cut(x = gene.mean, breaks = c(-1,quantile(gene.mean[gene.mean>0],probs=seq(0,1,length.out=num.bin))))
}

The -1 in the code makes it such that there is always only one value in the first bin, which goes from -1 to the minimum value. Not sure why they have this, but then we get different answers since the Scanpy code in highly_variable_genes always makes bins that have only one gene significant (to correct the other error from Seurat that normally excludes these bins/genes, which often contain some highly-expressed genes). Additionally, the cut function in R sometimes returns bin edges with different rounding than the Seurat implementation since Seurat does not modify the default dig.lab = 3. In contrast, I believe pandas uses the actual cutoffs in the data.

Also add option to change number of bins for cell ranger flavor.
Adding drop duplicates to avoid raising an error that the bins don't have unique edges. In practice this could make the first bin have twice as many genes. You could also do the Seurat thing with `pd.cut(df['mean'], bins = [-np.inf, np.percentile(df['mean'], q=np.linspace(0,100,bins+1))])` but I don't know if that's a better approach.
@falexwolf
Copy link
Member

Great, thank you!

It will be merged very soon; the change in the cell ranger option might be a backwards breaking change, hence, I added this here: #453.

Alex

@gokceneraslan
Copy link
Collaborator

gokceneraslan commented May 4, 2019

Hey @falexwolf, if starting the percentile bins from 10 is not intentional (as mentioned in #624 ) I can resolve the conflicts and merge this along with the fix in #624 .

@falexwolf
Copy link
Member

Sounds good, @gokceneraslan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

equal_frequency bins in highly_variable_genes
4 participants