Not very good at blogging for sure. But here is a trick I just found for which I have been fighting for a long time: how to align boxplot and violin plots in ggplot2? Nothing really tricky a priori thanks to the great “geom” system of ggplot2, but problems start to arise when grouping factors are involve. geom_boxplot will “stack” boxes side by side within a group, while violin_plot will set them apart. I finally found out this is due to default “dodging” behaviors, which can be fixed as follow:

p <- ggplot(data = dat, aes(x=group1, y=value))
p <- p + geom_violin(aes(fill=group2), position = position_dodge(width = 0.8))
p <- p + geom_boxplot(notch = FALSE, alpha = 0.5, aes(colour = group2), position = position_dodge(width = 0.8), width = 0.5)
p <- p + theme_bw() + guides(color = "none")
p

Sorry, no picture… but if you came across that problem you’d know what I’m talking about. And the trick is “position = position_dodge(width = 0.8))”!

Jumping between genomes

When it turns to do comparative genomics, a very common task is to look at homologous positions in several genomes. While this may sound as a trivial task, in practice it is not! Two challenges are to be faced when attempting to do so:

  1. Genomes to be compare have to be aligned,
  2. Looking up for individual coordinates can end up very inefficient when several thousands coordinates are to be searched.

Several tools have been proposed for tackling that issue. The most versatile one is the LiftOver utility from the UCSC genome browser utilities. Using LiftOver requires however multiple steps, and several (rather complex) protocoles are available online. One important note is that LiftOver should in principle be used only form comparing distinct assemblies of the same genome, not to compare genomes from distinct species for instance.

Confronted to the task, I recently implemented a simple approach in our MafFilter software (http://biopp.univ-montp2.fr/forge/maffilter). MafFilter can do all sort of processing of MAF files. With this new tool, one can simply convert between coordinates of one species to another based on a pairwise genome alignment of the two species (provided, of course, as a MAF file!). MafFilter takes coordinates from feature files in GFF, GTF or BedGraph, and outputs a table with the original coordinates and the translated ones. The syntax is rather short:

DATA=pairwise_aln
input.file=$(DATA).maf.gz
input.file.compression=gzip
input.format=Maf
output.log=$(DATA).maffilter-liftover.log
maf.filter= \
    LiftOver( \
        ref_species=species1, \
        target_species=species2, \
        feature.file=features_species1.bed, \
        feature.file.compression=none, \
        feature.format=BedGraph, \
        file=sp1_to_sp2.tln) \

One will typically want to combine coordinates, for instance in R:

features <- read.table("features_species1.bed")
tln <- read.table("sp1_to_sp2.tln")
features2 <- merge(features, tln, by.x = c("Chr", "Start", "Stop", "Strand"), by.y = c("chr.ref", "strand.ref", "begin.ref", "end.ref"), all = TRUE)

The conversion is relatively fast for hundreds of thousands of features to convert. The only requirement is a MAF file sorted according to the reference species (for instance using maf_project from the TBA package). Pairwise alignments can be obtained with BlastZ (for instance).

Turn your workstation into a mini-grid (with Slurm)

Most of modern desktop workstations come with quite nice configurations: multicores, hyperthreading… To take the best of such configurations, it is very convenient to set them up as a grid. This will enable to queue tasks, a feature most convenientΒ to empower your multitasking skills πŸ™‚

There is good old Sun Grid Engine (SGE), and its ‘qsub’ command. Now that Sun got bought by Oracle, and that Oracle got rid of SGE, the market of task scheduling got a bit confused:

  • The last version of SGE is still around. Packages for Debian and Ubuntu are still getting dust in the repository. For Ubuntu at least, some additional fonts were needed to get the thing working. In the latest vintage (14.04), this trick does not even seem to work any longer. The command line version still works however.
  • A fork of the latest OpenSource version of SGE can be found on SourceForge (Open Grid Scheduler. It has to be compiled from source however, at task that is far from being straightforward due to the numerous dependencies.
  • Another fork named as Son of Grid Engine (!) is available from the university of Liverpool. ‘deb’ and ‘rpm’ packages are provided.

As an alternative, I will here blog on Slurm. Slurm stands for Simple Linux Utility for Resource Management. Slurm is famous enough so that many posts are already dedicated to it on the blogosphere. Here is a short wrap up:

Installation:

Packages for ubuntu an debian are available:

sudo apt-get install slurm-llnl

You will also need the munge software, also available in the repository:

sudo apt-get install munge

Generating the configuration file

The installation comes with a couple of HTML pages allowing to generate the configuration file.
They can be found at:

/usr/share/doc/slurm-llnl/slurm-llnl-configurator.easy.html
/usr/share/doc/slurm-llnl/slurm-llnl-configurator.html

Just open one of them in your webbrowser, and start to fill in the required fields. Default options are provided. To get information about your particular machine, you can run

slurmd -C

This should come handy for the last part of the option file. Then save the resulting file in

/etc/slurm-llnl/slurm.conf

Generate munge key

This is done with the command

sudo /usr/sbin/create-munge-key

For some reason, a permission change is needed to avoid some later warnings:

sudo chmod g-w /var/log
sudo chmod g-w /var/log/munge

Starting services

With the commands:

/etc/init.d/slurm-llnl start
/etc/init.d/munge start

Multi-core, multi-node and multi-prog

There we are… now come the main topic of this post. Slurm as a major limitation: as opposed to SGE, it is not meant to be run on a single machine. One machine, be it with several cores, will be considered as a single node, as it will run one instance of the slurmd daemon (Note: there seems to exist a mode to circumvent this, which needs to be enabled at compilation time… maybe the topic of a later post).

Fortunately, Slurm has a “multi-prog” mode, allowing to launch several programs in one run. I here illustrate how it can be used to mimic the behavior of several node son a single machine.

In this example, one would like to run 2000 independent analyses, typically, one program execution on 2000 data sets corresponding to 2000 genes for instance. This is conveniently achieved using a job array (just like with SGE). In Slurm, job arrays are set up using the line:

#SBATCH --array=1:2000

With only one node, the array is going to execute only one job after the other, not making use of multiple cores. If we have 20 cores available, we can decide to run 20 genes simultaneously. This is achieved using the multi-prog option, via a special configuration file listing the 20 programs to run. The trick is then to have the job array generate this file for you:

#!/bin/bash
# file slurm_example.sh
# run with sbatch -o slurm-%A_%a.out slurm_example.sh
#SBATCH --job-name=slurm_example
#SBATCH --output=slurm_example.txt
#SBATCH --array=1-2000:20
#SBATCH --ntasks=20

#Create file:
rm multi.conf
for ((i=0; i < 20; i++)); do
  #Get gene name:
  GENE=`sed "$((SLURM_ARRAY_TASK_ID + i))q;d" genes.txt`
  echo $GENE
  echo "$i myprog $GENE" >> multi.conf
done

srun --multi-prog multi.conf

A few clarifications:

  • The file multi.conf is generated on-the-go and contains the 20 current execution lines
  • Note the syntax of the job array, with a step of 20
  • The –ntasks=20 is required to say we will run 20 tasks simultaneously
  • In this example, I assume that all 20 gene names are stored in a file named “genes.txt”. The nth gene is retrieved using a sed command.

Conclusion

This small trick will allow you to make a good use of your 20 (or more) cores. Yet there are limitations:

  • The next batch of 20 tasks will only be started once the current 20 are finished. This may be a serious limitation in case of unbalanced tasks, with some taking much more time than others (genes of different sizes for instance).
  • It is not possible to launch an analysis using 10 cores, and another using 10 other cores simultaneously.

Forget about Pearson

Together with the Student’s t-test, Pearson’s ‘r’ correlation coefficient has made its way through all statistical text books. While the test seems straightforward to apply, things are never as easy as they might seem.

The wikipedia entry for Pearson’s correlation coefficient is rather explicit, and defines it as

“a measure of the linear correlation (dependence) between two variables X and Y”

One important part in this definition is the term “linear”, meaning that if variables X and Y are dependent on each other, but in a non linear manner, the test will fail. In addition, the significance assessment of the correlation requires the two variables to be normally distributed.

All these assumptions (double-normality + linearity) are unlikely to be met with genomic data. I show here, from a real-life real-research example, how using Pearson’s coefficient can be misleading. First let’s have a look at the data:

Example data to be tested for correlation

From this plot, it is obvious that the two variable ‘x’ and ‘y’ are not linked in a linear manner. They are also not normally distributed either:

DataCor_dists

And a Pearson correlation test between the two gives the following results:

	Pearson's product-moment correlation

data:  dat$x and dat$y
t = -8.2883, df = 3153, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.1800034 -0.1116973
sample estimates:
       cor 
-0.1460244 

that is, a highly significant negative correlation! On the contrary, Kendall’s tau, a rank-based correlation test which assumes neither linearity nor normality does not report any significant association:

	Kendall's rank correlation tau

data:  dat$x and dat$y
z = 0.255, p-value = 0.7987
alternative hypothesis: true tau is not equal to 0
sample estimates:
        tau 
0.003095408

We can see a similar effect with a linear regression (y ~ x):

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.984e-01  9.839e-03  20.164   <2e-16 ***
dat$x       -1.466e-05  1.768e-06  -8.288   <2e-16 ***
---
Signif. codes:  0 β€˜***’ 0.001 β€˜**’ 0.01 β€˜*’ 0.05 β€˜.’ 0.1 β€˜ ’ 1

The ‘x’ variable is found to have a highly significant effect on variable ‘y’. Yet the condition of the linear model are not met either, the residuals being far from normally distributed. Using a Box-Cox transform improves things, yet not to the extant of being satisfying:

DataCor_qqplotsWhile the normalization procedure does not do miracles, its effect is visible on the result of the test:

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.382e+00  9.917e-02 -34.106  < 2e-16 ***
dat$x      5.223e-05  1.782e-05   2.931  0.00341 ** 
---
Signif. codes:  0 β€˜***’ 0.001 β€˜**’ 0.01 β€˜*’ 0.05 β€˜.’ 0.1 β€˜ ’ 1

The p-value remains significant, but much less than before.

Conclusion:Β in order to test the correlation of two variables, it is safer to use non-parametric tests such as Kendall’s tau. Given the typical size of genomic data sets, the loss of power due to the use of non-parametric tests is usually negligible, while the cost of unmet underlying hypotheses can be extremely high. In some cases models are unavoidable (for instance to test more than one explanatory factor), but this is another story…

Merging PDF files

Here another trick I found today. I needed to merge several PDF files (supplementary figures) into one big file to ease the submission of an article. As for playing with PDF files, the pdf toolkit (pdftk) is a wonderful swiss knif, so that one can simple do

pdftk FigureS1.pdf FigureS2.pdf FigureS3.pdf cat output AllSupFigures.pdf

This does the trick. Yet it would be nice to know which page corresponds to which figure, by creating bookmarks. I found out that pdftk also handle this, with a bit more efforts:

pdftk AllSupFigures.pdf dump_data > info.txt

for i in {1..3}; do
  echo "BookmarkBegin" >> info.txt
  echo "BookmarkTitle: Figure S$i" >> info.txt
  echo "BookmarkLevel: 1" >> info.txt
  echo "BookmarkPageNumber: $i" >> info.txt
done

pdftk AllSupFigures.pdf update_info info.txt output AllSupFiguresIndexed.pdf
rm info.txt

The resulting PDF file contains one index entry per figure.

Fancier beamer presentations

No need to introduce the beamer package for making presentations with LaTeX. One side effect of its popularity, however, is that the included themes have all become seen and seen again. While a presentation should be judged mainly for its content and not for its “packaging”, one cannot deny that a nice theme makes a better impact on your audience. As for beamer, the conclusion currently is therefore to avoid theme and to keep it as simple as possible in terms of design.

Yet, even without using a theme, there is always room for originality and personality. The tcolorbox package for instance is an impressively complete piece of software whose only purpose is… to draw boxes (!). Yet it does so with great flexibility and should be able to fulfill many your artistic aspirations.

Another possibility is to (moderately) play with colors. I like for instance to switch the color theme when I change part, introducing a method section for instance, or coming to important conclusions and “take home” messages. This can be achieved using the colourchange package. The current version of this package however has a small bug affecting boxes (both the beamer builtin boxes and the tcolorbox boxes), they will change colors with 1 slide delay :s to cope with this issue, I wrote this simply macro, which allows you to assign one color per section, and make a title slide for each section. It uses some code from the tcolorbox manual to make a fancy box:

\newcommand{\sectiontitlepage}[1]{
  \selectmanualcolour{#1}
  \begin{frame}
    \begin{tcolorbox}[enhanced,
        title=\sc Part \thesection, center title,
        fonttitle=\bfseries,
        coltitle=black,
        colbacktitle=#1!50!white,
        colback=#1,
        colframe=#1!50!black,
        attach boxed title to top center={yshift=-0.25mm-\tcboxedtitleheight/2,yshifttext=2mm-\tcboxedtitleheight/2},
        boxed title style={enhanced,boxrule=0.5mm,frame code={ \path[tcb fill frame] ([xshift=-4mm]frame.west) -- (frame.north west) -- (frame.north east) -- ([xshift=4mm]frame.east)-- (frame.south east) -- (frame.south west) -- cycle; },
        interior code={ \path[tcb fill interior] ([xshift=-2mm]interior.west)-- (interior.north west) -- (interior.north east)-- ([xshift=2mm]interior.east) -- (interior.south east) -- (interior.south west)-- cycle;}}
    ]
    \center\LARGE\textcolor{black}{\secname}
  \end{tcolorbox}
\end{frame}
}

It can be used by typing

\section{My new section}
\sectiontitlepage{blue}

Finally, there is a third-party tikz library porting the Brewer set of colors to tikz (and therefore to beamer, as both beamer and tikz use the PGF package) https://github.com/vtraag/tikz-colorbrewer.

Presentation_tcolorbox_example

CLang or GCC?

Recently Mac users have been reporting a lot of new warnings while compiling the Bio++ code. Apparently they switched from GCC to CLang as a default compiler. Since the time I moved from Windows and Borland, I had never even considered using another compiler than GCC, but to be honest, at the time of today, I have to admit that CLang compares impressively well to GCC: additional warnings do make a lot of sense and helped me find hidden bugs in the code, and are often more explicit than their GC counterparts. So why not giving it a try?

Switching compiler in Ubuntu appears to be as easy as

sudo apt-get install clang
sudo update-alternatives --config c++

And there you go! I did not have to change anything in my makefiles to start compiling with CLang instead of GCC. the only exception is that I still cannot compile with static linkage…