Tag Archives: Bioinformatics

HMMER3 released – with a pre-compiled binary!

As I have recently complained about open source software coming without pre-compiled binaries, I salute the release of HMMER beta 3, which has a pre-compiled Intel/Linux tar-ball. This is exactly the kind of convenience measures I have asked for, and thus I wanted to state that clearly here. Well done, Sean Eddy and company!

HMMER is a bioinformatics software for finding sequence homology. People into bioinformatics may appreciate some of the new features, like multicore support, and better-than-BLAST speeds (unbelievable but true!). For those of you that are interested, the full range of features, as well as the software download can be found here:

HMMER 3.0b3: the final beta test release

Advertisements

An unusual error?

I just ran into a problem using HMMER, where it aborts the building of an HMM-profile (using hmmbuild) with this line: “FATAL: illegal state transition B->E in traceback” Is there anyone who has seen this HMMER error before? Anyone who know what it means and/or how to solve it? Please let me know as soon as possible.

hmmbuild - build a hidden Markov model from an alignment
HMMER 2.3.2 (Oct 2003)
Copyright (C) 1992-2003 HHMI/Washington University School of Medicine
Freely distributed under the GNU General Public License (GPL)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Alignment file: alignment1.aln
File format: Clustal
Search algorithm configuration: Multiple domain (hmmls)
Model construction strategy: MAP (gapmax hint: 0.50)
Null model used: (default)
Prior used: (default)
Sequence weighting method: G/S/C tree weights
New HMM file: alignment1.hmm
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Alignment: #1
Number of sequences: 5517
Number of columns: 2305

WARNING: Looks like amino acid sequence, hope that's right
Determining effective sequence number ... done. [1]
Weighting sequences heuristically ... [big alignment! doing PB]... done.
Constructing model architecture ...
FATAL: illegal state transition B->E in traceback

Solving problems in seconds

Sometimes a given solution to a problem lies much closer at hand than you expect. In my work I usually do the same task repeatedly with between 6 and 50 files. Even though Unix is very efficient in many ways, this still takes time to do by hand. I have thought of various ways around that problem, including using wildcards (*), but never got fully satisfied. But this week, I finally came up with the simplest solution this far. And it took about a minute or two to implement. I don’t know why I didn’t think about this a year ago. Maybe I thought that I would only do these repetitive tasks a couple of times. I was wrong, but thanks to Perl I can now be much more efficient (and write this instead of typing Unix commands…) The good thing about my implementation (in my opinion) is that it’s so flexible. Here’s my code, please comment if you feel that there is more efficient ways. “{}” is replaced by a number for each file name:


#!/usr/bin/perl

## LOOP COMMAND
$versionID = "Version 1.0";
print "LoopCommand\n";
print "Version $versionID\n";
print "Written by Johan Bengtsson, October 2009\n";
print "-----------------------------------------\n";

## GET USER INPUT
print 'Execute command: ';
chomp($command = <STDIN>);
print 'From number: ';
chomp($start = <STDIN>);
print 'To number: ';
chomp($end = <STDIN>);

## EXECUTE
for ($i = $start ; $i <= $end ; $i++) {
$exec = $command;
$exec =~ s/\{\}/$i/g;
$result = `$exec`;
print $result;
}

LogoMat-M, or how I started to hate source code and opted for precompiled binaries

LogoMat-M and its uses
I have recently struggled to install a bioinformatics program called LogoMat-M. LogoMat-M is a command line based program that creates visual representations of HMM-profiles. An excellent example of the program in action can be viewed at Sanger’s LogoMat-M website. It creates images that looks a bit like this:

The resulting images make it easy to interpret how common a given amino acid is at each position of a sequence alignment, where the alignment usually represent a protein family. So far, so good.

The problem is that the web service was not designed to work with large amounts of sequences, and thus returns nada when such sequence alignments are used. To solve this problem, I thought I would try to install the program locally, on my own computer, at least to receive a proper error message. This was a big mistake.

The “install” process
I started by downloading the LogoMat-M package (i.e. the source code – this is open source software, which often means that there are no pre-compiled binaries). However, the build files for the program complained that my computer missed certain libraries and programs required for the LogoMat software to compile. Well, alright, I went out to find the pieces of missing software. Quite fast I could track down the two missing components and download these. Once again, these were open source programs – meaning no pre-compiled binaries. I tried to compile the first of those and rapidly got the answer that a component called PDL was required and could be obtained via a service called cpan.

I started to get a bit frustrated, since I didn’t want to spend the whole day installing software – I wanted to construct images like the one above. However, I did as the instructions said and text started flashing down my screen. Suddenly, cpan exited and said “Could not compile. Compiler returned bad status.” Wow. How informative! How do you expect me to know what caused that?! So, now I was stuck. I could not compile LogoMat because I was missing another program that was required, and I couldn’t install that program because I lacked a component that wasn’t, for some unknown reason, able to compile.

Now, the big problem here is that there is no way for me to get around this, because the documentation does not mention this kind of situation. I could, of course, contact the developers, but I was on a tight time schedule, and needed this to work. It was possible, if not likely, that it would take days for the developers (who do not get paid for this software, i.e. there is no official support channel) to sort out my problem.

Again, a mentality problem
Many times, open source software is praised for being open, but what people tend to forget is that a lot of this software is not at all easy to use. Or, in this case, even install. On Windows or Mac OS X, I would have fired up an installer, which would have installed a working pre-compiled binary on my system, with all its required libraries. It would work out-of-the-box. And if it didn’t, there would be someone to call.

Now, I don’t want to call for open source developers to set up call centres for supporting their programs, that would just be ridiculous. But I beg you to please make pre-compiled, working versions, including required libraries, and supply these for at least the most common platforms. Depending on the kind of software, that could be Windows, Mac OS X, Ubuntu and Red Hat Linux, for example. Don’t bother with pre-compiled software for strange and uncommon architecture, people running these things probably know how to compile their software anyway. But please, supply some easy to use, pre-compiled program for the rest of us. Because otherwise we will never be able to get our work done using open source alternatives, and that does not benefit either our work or the open source community in general. The situation described above only benefit big corporations selling overpriced software. And that is really, really sad.

Microsoft WORD format is not a sequence format

I found this on a bioinformatics info site related to the EMBOSS package. I find the tone of it rather amusing, especially as people usually refers to Word-files simply as “text”:

Sequences

Before reading the rest of this document, please note:
Microsoft WORD format is not a sequence format.

Sequences can be read and written in a variety of formats. These can be very confusing for users, but EMBOSS aims to make life easier by automatically recognising the sequence format on input.

That means that if you are converting from using another sequencing package to EMBOSS and you have your existing sequences in a format that is specific for that package, for example GCG format, you will have no problem reading them in.

If you don’t hold your sequence in a recognised standard format, you will not be able to analyse your sequence easily.

What a sequence format is NOT

When we talk about ‘sequence format’ we are NOT talking about any sort of program-specific format like a word processor format or text formatting language , so we are not talking about things like: ‘NOTEPAD’, ‘WORD’, ‘WORDPAD’, ‘PostScript’, ‘PDF’, ‘RTF’, ‘TeX’, ‘HTML’

If you have somehow managed to type a sequence into a word-processor (!) you should:

  • Save the sequence to a file as ASCII text (try selecting: File, SaveAs, Text)
  • Stop using word-processors to write sequences.
  • Investigate a sequence editor, such as mse
  • Investigate using simple text editors, such as pico, nedit or, at a pinch, wordpad

Now, repeat after me:
Microsoft WORD format is not a sequence format

EMBOSS programs will not read in anything which is held in Microsoft WORD files.

So, remember that Word format is not a sequence format, and be careful with you bioinformatics research! Original text found at: http://emboss.sourceforge.net/docs/themes/SequenceFormats.html