Turn your workstation into a mini-grid (with Slurm)

Most of modern desktop workstations come with quite nice configurations: multicores, hyperthreading… To take the best of such configurations, it is very convenient to set them up as a grid. This will enable to queue tasks, a feature most convenient to empower your multitasking skills 🙂

There is good old Sun Grid Engine (SGE), and its ‘qsub’ command. Now that Sun got bought by Oracle, and that Oracle got rid of SGE, the market of task scheduling got a bit confused:

  • The last version of SGE is still around. Packages for Debian and Ubuntu are still getting dust in the repository. For Ubuntu at least, some additional fonts were needed to get the thing working. In the latest vintage (14.04), this trick does not even seem to work any longer. The command line version still works however.
  • A fork of the latest OpenSource version of SGE can be found on SourceForge (Open Grid Scheduler. It has to be compiled from source however, at task that is far from being straightforward due to the numerous dependencies.
  • Another fork named as Son of Grid Engine (!) is available from the university of Liverpool. ‘deb’ and ‘rpm’ packages are provided.

As an alternative, I will here blog on Slurm. Slurm stands for Simple Linux Utility for Resource Management. Slurm is famous enough so that many posts are already dedicated to it on the blogosphere. Here is a short wrap up:

Installation:

Packages for ubuntu an debian are available:

sudo apt-get install slurm-llnl

You will also need the munge software, also available in the repository:

sudo apt-get install munge

Generating the configuration file

The installation comes with a couple of HTML pages allowing to generate the configuration file.
They can be found at:

/usr/share/doc/slurm-llnl/slurm-llnl-configurator.easy.html
/usr/share/doc/slurm-llnl/slurm-llnl-configurator.html

Just open one of them in your webbrowser, and start to fill in the required fields. Default options are provided. To get information about your particular machine, you can run

slurmd -C

This should come handy for the last part of the option file. Then save the resulting file in

/etc/slurm-llnl/slurm.conf

Generate munge key

This is done with the command

sudo /usr/sbin/create-munge-key

For some reason, a permission change is needed to avoid some later warnings:

sudo chmod g-w /var/log
sudo chmod g-w /var/log/munge

Starting services

With the commands:

/etc/init.d/slurm-llnl start
/etc/init.d/munge start

Multi-core, multi-node and multi-prog

There we are… now come the main topic of this post. Slurm as a major limitation: as opposed to SGE, it is not meant to be run on a single machine. One machine, be it with several cores, will be considered as a single node, as it will run one instance of the slurmd daemon (Note: there seems to exist a mode to circumvent this, which needs to be enabled at compilation time… maybe the topic of a later post).

Fortunately, Slurm has a “multi-prog” mode, allowing to launch several programs in one run. I here illustrate how it can be used to mimic the behavior of several node son a single machine.

In this example, one would like to run 2000 independent analyses, typically, one program execution on 2000 data sets corresponding to 2000 genes for instance. This is conveniently achieved using a job array (just like with SGE). In Slurm, job arrays are set up using the line:

#SBATCH --array=1:2000

With only one node, the array is going to execute only one job after the other, not making use of multiple cores. If we have 20 cores available, we can decide to run 20 genes simultaneously. This is achieved using the multi-prog option, via a special configuration file listing the 20 programs to run. The trick is then to have the job array generate this file for you:

#!/bin/bash
# file slurm_example.sh
# run with sbatch -o slurm-%A_%a.out slurm_example.sh
#SBATCH --job-name=slurm_example
#SBATCH --output=slurm_example.txt
#SBATCH --array=1-2000:20
#SBATCH --ntasks=20

#Create file:
rm multi.conf
for ((i=0; i < 20; i++)); do
  #Get gene name:
  GENE=`sed "$((SLURM_ARRAY_TASK_ID + i))q;d" genes.txt`
  echo $GENE
  echo "$i myprog $GENE" >> multi.conf
done

srun --multi-prog multi.conf

A few clarifications:

  • The file multi.conf is generated on-the-go and contains the 20 current execution lines
  • Note the syntax of the job array, with a step of 20
  • The –ntasks=20 is required to say we will run 20 tasks simultaneously
  • In this example, I assume that all 20 gene names are stored in a file named “genes.txt”. The nth gene is retrieved using a sed command.

Conclusion

This small trick will allow you to make a good use of your 20 (or more) cores. Yet there are limitations:

  • The next batch of 20 tasks will only be started once the current 20 are finished. This may be a serious limitation in case of unbalanced tasks, with some taking much more time than others (genes of different sizes for instance).
  • It is not possible to launch an analysis using 10 cores, and another using 10 other cores simultaneously.
Advertisements

Merging PDF files

Here another trick I found today. I needed to merge several PDF files (supplementary figures) into one big file to ease the submission of an article. As for playing with PDF files, the pdf toolkit (pdftk) is a wonderful swiss knif, so that one can simple do

pdftk FigureS1.pdf FigureS2.pdf FigureS3.pdf cat output AllSupFigures.pdf

This does the trick. Yet it would be nice to know which page corresponds to which figure, by creating bookmarks. I found out that pdftk also handle this, with a bit more efforts:

pdftk AllSupFigures.pdf dump_data > info.txt

for i in {1..3}; do
  echo "BookmarkBegin" >> info.txt
  echo "BookmarkTitle: Figure S$i" >> info.txt
  echo "BookmarkLevel: 1" >> info.txt
  echo "BookmarkPageNumber: $i" >> info.txt
done

pdftk AllSupFigures.pdf update_info info.txt output AllSupFiguresIndexed.pdf
rm info.txt

The resulting PDF file contains one index entry per figure.

Fancier beamer presentations

No need to introduce the beamer package for making presentations with LaTeX. One side effect of its popularity, however, is that the included themes have all become seen and seen again. While a presentation should be judged mainly for its content and not for its “packaging”, one cannot deny that a nice theme makes a better impact on your audience. As for beamer, the conclusion currently is therefore to avoid theme and to keep it as simple as possible in terms of design.

Yet, even without using a theme, there is always room for originality and personality. The tcolorbox package for instance is an impressively complete piece of software whose only purpose is… to draw boxes (!). Yet it does so with great flexibility and should be able to fulfill many your artistic aspirations.

Another possibility is to (moderately) play with colors. I like for instance to switch the color theme when I change part, introducing a method section for instance, or coming to important conclusions and “take home” messages. This can be achieved using the colourchange package. The current version of this package however has a small bug affecting boxes (both the beamer builtin boxes and the tcolorbox boxes), they will change colors with 1 slide delay :s to cope with this issue, I wrote this simply macro, which allows you to assign one color per section, and make a title slide for each section. It uses some code from the tcolorbox manual to make a fancy box:

\newcommand{\sectiontitlepage}[1]{
  \selectmanualcolour{#1}
  \begin{frame}
    \begin{tcolorbox}[enhanced,
        title=\sc Part \thesection, center title,
        fonttitle=\bfseries,
        coltitle=black,
        colbacktitle=#1!50!white,
        colback=#1,
        colframe=#1!50!black,
        attach boxed title to top center={yshift=-0.25mm-\tcboxedtitleheight/2,yshifttext=2mm-\tcboxedtitleheight/2},
        boxed title style={enhanced,boxrule=0.5mm,frame code={ \path[tcb fill frame] ([xshift=-4mm]frame.west) -- (frame.north west) -- (frame.north east) -- ([xshift=4mm]frame.east)-- (frame.south east) -- (frame.south west) -- cycle; },
        interior code={ \path[tcb fill interior] ([xshift=-2mm]interior.west)-- (interior.north west) -- (interior.north east)-- ([xshift=2mm]interior.east) -- (interior.south east) -- (interior.south west)-- cycle;}}
    ]
    \center\LARGE\textcolor{black}{\secname}
  \end{tcolorbox}
\end{frame}
}

It can be used by typing

\section{My new section}
\sectiontitlepage{blue}

Finally, there is a third-party tikz library porting the Brewer set of colors to tikz (and therefore to beamer, as both beamer and tikz use the PGF package) https://github.com/vtraag/tikz-colorbrewer.

Presentation_tcolorbox_example

CLang or GCC?

Recently Mac users have been reporting a lot of new warnings while compiling the Bio++ code. Apparently they switched from GCC to CLang as a default compiler. Since the time I moved from Windows and Borland, I had never even considered using another compiler than GCC, but to be honest, at the time of today, I have to admit that CLang compares impressively well to GCC: additional warnings do make a lot of sense and helped me find hidden bugs in the code, and are often more explicit than their GC counterparts. So why not giving it a try?

Switching compiler in Ubuntu appears to be as easy as

sudo apt-get install clang
sudo update-alternatives --config c++

And there you go! I did not have to change anything in my makefiles to start compiling with CLang instead of GCC. the only exception is that I still cannot compile with static linkage…