Together with the Student’s t-test, Pearson’s ‘r’ correlation coefficient has made its way through all statistical text books. While the test seems straightforward to apply, things are never as easy as they might seem.
The wikipedia entry for Pearson’s correlation coefficient is rather explicit, and defines it as
“a measure of the linear correlation (dependence) between two variables X and Y”
One important part in this definition is the term “linear”, meaning that if variables X and Y are dependent on each other, but in a non linear manner, the test will fail. In addition, the significance assessment of the correlation requires the two variables to be normally distributed.
All these assumptions (double-normality + linearity) are unlikely to be met with genomic data. I show here, from a
real-life real-research example, how using Pearson’s coefficient can be misleading. First let’s have a look at the data:
From this plot, it is obvious that the two variable ‘x’ and ‘y’ are not linked in a linear manner. They are also not normally distributed either:
And a Pearson correlation test between the two gives the following results:
Pearson's product-moment correlation data: dat$x and dat$y t = -8.2883, df = 3153, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.1800034 -0.1116973 sample estimates: cor -0.1460244
that is, a highly significant negative correlation! On the contrary, Kendall’s tau, a rank-based correlation test which assumes neither linearity nor normality does not report any significant association:
Kendall's rank correlation tau data: dat$x and dat$y z = 0.255, p-value = 0.7987 alternative hypothesis: true tau is not equal to 0 sample estimates: tau 0.003095408
We can see a similar effect with a linear regression (y ~ x):
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.984e-01 9.839e-03 20.164 <2e-16 *** dat$x -1.466e-05 1.768e-06 -8.288 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The ‘x’ variable is found to have a highly significant effect on variable ‘y’. Yet the condition of the linear model are not met either, the residuals being far from normally distributed. Using a Box-Cox transform improves things, yet not to the extant of being satisfying:
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.382e+00 9.917e-02 -34.106 < 2e-16 *** dat$x 5.223e-05 1.782e-05 2.931 0.00341 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The p-value remains significant, but much less than before.
Conclusion: in order to test the correlation of two variables, it is safer to use non-parametric tests such as Kendall’s tau. Given the typical size of genomic data sets, the loss of power due to the use of non-parametric tests is usually negligible, while the cost of unmet underlying hypotheses can be extremely high. In some cases models are unavoidable (for instance to test more than one explanatory factor), but this is another story…