When it turns to do comparative genomics, a very common task is to look at homologous positions in several genomes. While this may sound as a trivial task, in practice it is not! Two challenges are to be faced when attempting to do so:
- Genomes to be compare have to be aligned,
- Looking up for individual coordinates can end up very inefficient when several thousands coordinates are to be searched.
Several tools have been proposed for tackling that issue. The most versatile one is the LiftOver utility from the UCSC genome browser utilities. Using LiftOver requires however multiple steps, and several (rather complex) protocoles are available online. One important note is that LiftOver should in principle be used only form comparing distinct assemblies of the same genome, not to compare genomes from distinct species for instance.
Confronted to the task, I recently implemented a simple approach in our MafFilter software (http://biopp.univ-montp2.fr/forge/maffilter). MafFilter can do all sort of processing of MAF files. With this new tool, one can simply convert between coordinates of one species to another based on a pairwise genome alignment of the two species (provided, of course, as a MAF file!). MafFilter takes coordinates from feature files in GFF, GTF or BedGraph, and outputs a table with the original coordinates and the translated ones. The syntax is rather short:
DATA=pairwise_aln input.file=$(DATA).maf.gz input.file.compression=gzip input.format=Maf output.log=$(DATA).maffilter-liftover.log maf.filter= \ LiftOver( \ ref_species=species1, \ target_species=species2, \ feature.file=features_species1.bed, \ feature.file.compression=none, \ feature.format=BedGraph, \ file=sp1_to_sp2.tln) \
One will typically want to combine coordinates, for instance in R:
features <- read.table("features_species1.bed") tln <- read.table("sp1_to_sp2.tln") features2 <- merge(features, tln, by.x = c("Chr", "Start", "Stop", "Strand"), by.y = c("chr.ref", "strand.ref", "begin.ref", "end.ref"), all = TRUE)
The conversion is relatively fast for hundreds of thousands of features to convert. The only requirement is a MAF file sorted according to the reference species (for instance using maf_project from the TBA package). Pairwise alignments can be obtained with BlastZ (for instance).