Forget about Pearson

Together with the Student’s t-test, Pearson’s ‘r’ correlation coefficient has made its way through all statistical text books. While the test seems straightforward to apply, things are never as easy as they might seem.

The wikipedia entry for Pearson’s correlation coefficient is rather explicit, and defines it as

“a measure of the linear correlation (dependence) between two variables X and Y”

One important part in this definition is the term “linear”, meaning that if variables X and Y are dependent on each other, but in a non linear manner, the test will fail. In addition, the significance assessment of the correlation requires the two variables to be normally distributed.

All these assumptions (double-normality + linearity) are unlikely to be met with genomic data. I show here, from a real-life real-research example, how using Pearson’s coefficient can be misleading. First let’s have a look at the data:

Example data to be tested for correlation

From this plot, it is obvious that the two variable ‘x’ and ‘y’ are not linked in a linear manner. They are also not normally distributed either:

DataCor_dists

And a Pearson correlation test between the two gives the following results:

	Pearson's product-moment correlation

data:  dat$x and dat$y
t = -8.2883, df = 3153, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.1800034 -0.1116973
sample estimates:
       cor 
-0.1460244 

that is, a highly significant negative correlation! On the contrary, Kendall’s tau, a rank-based correlation test which assumes neither linearity nor normality does not report any significant association:

	Kendall's rank correlation tau

data:  dat$x and dat$y
z = 0.255, p-value = 0.7987
alternative hypothesis: true tau is not equal to 0
sample estimates:
        tau 
0.003095408

We can see a similar effect with a linear regression (y ~ x):

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.984e-01  9.839e-03  20.164   <2e-16 ***
dat$x       -1.466e-05  1.768e-06  -8.288   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The ‘x’ variable is found to have a highly significant effect on variable ‘y’. Yet the condition of the linear model are not met either, the residuals being far from normally distributed. Using a Box-Cox transform improves things, yet not to the extant of being satisfying:

DataCor_qqplotsWhile the normalization procedure does not do miracles, its effect is visible on the result of the test:

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.382e+00  9.917e-02 -34.106  < 2e-16 ***
dat$x      5.223e-05  1.782e-05   2.931  0.00341 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The p-value remains significant, but much less than before.

Conclusion: in order to test the correlation of two variables, it is safer to use non-parametric tests such as Kendall’s tau. Given the typical size of genomic data sets, the loss of power due to the use of non-parametric tests is usually negligible, while the cost of unmet underlying hypotheses can be extremely high. In some cases models are unavoidable (for instance to test more than one explanatory factor), but this is another story…

Advertisements