GitHub - PaulVanDev/HoeffdingD: Efficient implementation of Hoeffding Distance in Python for large datasets

Just Another Correlation Coefficient (Python|Pandas)

Hoeffding’s test for dependence was proposed by Wassily Hoeffding (1948) as a test of correlation for two variables with continuous distribution functions. Hoeffding’s D is a nonparametric measure of the distance between the joint distribution F(x, y) and the product of marginal distributions F1(x)F2(y). The advantage of this statistic lies in the fact that it has more power to detect non-monotonic dependency structures compared to other more common measures (Pearson, Kendall, Spearman)

The following graph presents what we can expect to discover by using more advanced correlation coefficients.

Adaptation of Hoeffding's D in Python - for large datasets

A proper implementation in python is not existing so far. The original algorithm presents more complexity (O(n2)) than other popular correlation coefficients like Pearson or Spearman and was basically done for the statistics in the second part of the 20th century (which means a size order of ~100 points). Nowadays, the "big data period", it's common to search relations between time-series with millions of points. On could say that it is not often relevant (if there's no stationarity), but the move for automated analytical tools plays in favor of such develoment. The remark is anyway valid for other correlation coefficients. This coefficient is available in some premium analytics sofware (SAS, JMP; in an efficient implementation), in R (which is originally more dedicated to statistics), in Matlab (for small data), but unavailable in Python. So we propose here an efficient implementation in python3, which is callable as a method by the corr() function for dataframe in Pandas.

The development steps were the followings:

Starting algorithm in Matlab
Rough adaptation in python
Code optimisation - complexity O(n²) (33x faster on 1000 points -> 2,40ms)
Constrained algorithm-> binning on entries (50) and resampling if n >100000 – acceptable approximation
Support dataFrame as input
Compatible with correlation function correlation in Pandas (from v0.24)

The graph below gives measured coefficients for the tested relations. The interesting point is the discriminative behaviour between a complex non-monotonic relation and pure independance.

How to use:

-> Download & Import XtendedCorrel.py

Examples:

import numpy as np
from  XtendedCorrel import hoeffding

x=np.random.randint(70, size=200000)
y=(np.random.randint(70, size=200000)*x)
df=pd.DataFrame({ 'x': x, 'y': y } )

# Direct on numpy series
hoeffding(x,y)     
#out: 0.19021676396645915

# Pandas dataframe as input
hoeffding(df)
#out:
#        x	        y
#x	0.944652	0.188405
#y	0.188405	0.950448

# hoeffding(x,x) not equal to 1 :
#From SAS explanations : "When ties occur, the statistic might result in a smaller value. That is, for a pair of variables 
#with identical values, the Hoeffding’s D statistic might be less than 1"

#If pandas version > or =  to 0.24
#as callable method in pandas.corr function similar to df.corr( ʺpearson ʺ) or df.corr( ʺSpearmanʺ) 

df.corr(method=hoeffding)

#out:
#        x	        y
#x	1        	0.188405
#y	0.188405	1

# hoeffding(x,x) equal to 1 : Forced by corr() function

References:

https://stackoverflow.com/questions/9270496/ideas-for-gpu-implementation-of-hoeffdings-d-dependence-coefficient/9322657#9322657 -> Code in matlab

http://support.sas.com/documentation/cdl/en/procstat/67528/HTML/default/viewer.htm#procstat_corr_details07.htm -> Formula proposed by SAS HoeffdingD is proposed as non-parametric correaltion method in SAS and in JMP softwares

pandas-dev/pandas#22086 -> a request for hoeffdingD in pandas

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4539721/ -> more explanations on the need for non-pearson correlation methods

http://math.furman.edu/~dcs/courses/math47/R/library/Hmisc/html/hoeffd.html -> statistical explanation

https://stats.stackexchange.com/questions/20011/can-the-mic-algorithm-for-detecting-non-linear-correlations-be-explained-intuiti ->more discussion and comparison with MIC algorithm - Mutual Information Coefficient

https://journal.r-project.org/archive/2018/RJ-2018-057/RJ-2018-057.pdf

https://www4.stat.ncsu.edu/~boos/library/mimeo.archive/ISMS_1994_2310.pdf -> Essay on contribution of Wassily Hoeffding to statistics and probability

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
pictures		pictures
EfficientHoeffdingD.ipynb		EfficientHoeffdingD.ipynb
README.md		README.md
XtendedCorrel.py		XtendedCorrel.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Just Another Correlation Coefficient (Python|Pandas)

Adaptation of Hoeffding's D in Python - for large datasets

How to use:

Examples:

References:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

PaulVanDev/HoeffdingD

Folders and files

Latest commit

History

Repository files navigation

Just Another Correlation Coefficient (Python|Pandas)

Adaptation of Hoeffding's D in Python - for large datasets

How to use:

Examples:

References:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages