SHORTS
☘☘☘ Letters by Vivek Rai
home science code scribbles books about

Diving Deeper Into BLAST - III

BLAST+ and its inconsistencies

BLAST+ is an incredible tool and does it job in a pretty decent way. However, anyone who has worked with it for some time with some non trivial details has their own horror stories to tell. For the past few months where I have been engaged in development of SequenceServer, we got a lot of opportunities to see the ugly inconsistencies buried within an otherwise beautiful program.

BLAST+ is the newer and improved version of legacy BLAST executables with improved performance and new features. It goes through close to two cycles of release every year and includes performance patches or fixes introduced against community bug reports.

I will talk in brief about few problems that we encountered while developing SequenceServer.

What’s up with the XML format? 

One of my major tasks with SequenceServer was to create a data layer which will be helpful in creating flexible graphical interface. This required me to run BLAST and obtain the results in XML format, and parse and store it accordingly.

One of the curious things to notice in the XML file is the presence of <Iteration_stat> tag within each query. Although if you compare the values in the tag across all queries,you will find that they are all the same. Indeed, the data contained in <Iteration_stat> appears to represent statistical metrics related to the entire user query and not every query. queries are same, and indeed they represent statistical information based on the entire query and not per query. You can examine one sample XML file here.

This to me suggests an erroneous schema design for the XML output.

Problematic start and end coordinates 

If you examined the XML file, you can notice that the sequence alignments, start, and end co-ordinates are all provided very nicely. When using the HTML format output option (provided with all BLAST programs) like prior versions of SequenceServer, we receive a well formatted output where longer sequences are properly broken on multiple lines along with start and end coordinates for the same. But as we were generating the HTML ourselves, we had to manually calculate the start and stop ends for each line.

In general, the start and end coordinates are agnostic of the reading frame and one has to properly infer them from the qframe and hframe values. In other words, you need to see whether you are moving on the positive strand or the negative stand. However, blastn makes an exception to this rule and automatically makes this informed decision. I don't know if that should be the scenario but whatever be the case, it certainly is not consistent. This had to be taken in care of separately in the commit shown below.

index def28ff..cb236fa 100644
--- a/lib/sequenceserver/blast.rb
+++ b/lib/sequenceserver/blast.rb
@@ -211,6 +211,9 @@ module SequenceServer
                  hsp.qstart.to_s.length, hsp.qend.to_s.length].max

        s = ''
+        # blastn results are inconsistent with the other methods as it
+        # automatically reverse the start and end coordinates (based on
+        # frame), while for others it has to be inferred.
        if @program != 'blastn'
            nqseq = hsp.qframe >= 0 ? hsp.qstart : hsp.qend
            nsseq = hsp.sframe >= 0 ? hsp.sstart : hsp.send

Number-only FASTA ids with blastdbcmd causes trouble 

In an issue1 opened by our early contributer @wwood, we found that FASTA ids with only numbers in it didn’t play very well with blastdbcmd, the command we use to format FASTA files and create BLAST+ compatible databases.

The bug, since then, went on without activity until recently when we tried to fix it using a simple hack. As hits can also be queried using accession number and not only ids, we simply started using accession number with a lcl| added to the start. This not only preserved the previous desired behavior but also solved the problem of error with numeric ids SHA: 6d83a0833c42ec3a9e9 .

However, with the release of BLAST+ 2.2.302 we noticed that our hack didn’t play well with the -target-only option of blastdbcmd. Since, our implementation moves according to the latest version of BLAST+, we had to revert the commit.

As of now, the issue is still open and we wait for it to be fixed upstream. Apparently, the BLAST+ team is aware of the issue but it seems to be poorly documented. Here is what they had to say when I reached out to them with the issue:

... but when lcl|integer is used (either directly on the definition line, or when
only an integer is used), the programs assume that the integer is a gi
number. We will investigate the feasibility of applying a check.

Continued issues with funky ids 

Another weird issue popped us when we were playing around with some funky ids. These ids contained some symbols and weird characters like hashes and pipes. I’m not yet sure if this is an undesired behavior but it is certainly absurd. Here is a small demo:

$ cat funky_ids.fa
>abc|def
GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAG
>abcdef#
GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAG
>abc#def
GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAG

$ blastdbcmd -db funky_ids.fa -entry all -outfmt '%a'
abc:def
abcdef#
abc#def

If you noticed the first line of both commands you can see that the pipe character in first case is being weirdly displayed as a colon. As I said, I don’t know yet if it is a desired or intentional behavior but whatever it is, it sure needs some explanation or documentation. I have already emailed the BLAST+ team about this and expecting their reply anytime soon.

I will have some more updates with my recent adventures with Afra (an annotation editor) in coming week.

Stay tuned!


  1. Retrieving blast sequences doesn't work well with numbers #88, SequenceServer, GitHub ↩︎

  2. NCBI BLAST+ Release Notes, October 2nd, 2014. Christiam Camacho, NCBI, http://www.ncbi.nlm.nih.gov/books/NBK131777/ ↩︎