Practical

How to compile Ncbi IgBlast with GCC >= 5.X

If you are trying compiling igblast from sources and get this error:

configure: error: Do not know how to build MT-safe with compiler /usr/bin/g++  5.2.1

the fix is very simple. Open the file src/build-system/configure and search for this section:

if test "$GCC" = "yes" ; then
   compiler_ver="`$real_CXX -dumpversion 2>&1`"
   case "$compiler_ver" in
     2.95* | 2.96* | 3.* | 4.* )
       compiler="GCC"
       ncbi_compiler="GCC"
       ncbi_compiler_ver="$compiler_ver"
                 WithFeatures="$WithFeatures${WithFeaturesSep}GCC"; WithFeaturesSep=" "
       ;;
   esac

What it does is it checks that GCC has one of the specified versions, including any version 3 or version 4. No version 5, sadly.. so just change the following line:

2.95* | 2.96* | 3.* | 4.* )

with

2.95* | 2.96* | 3.* | 4.* | 5.* )

Now save and go back to the release/ directory, and run the usual ./configure && make && sudo make install command.

Short C++ remainder on passing objects to functions

Suppose you have an Object class. You instantiate three objects locally:

Object o1, o2, o3;

Then you have a function:

void f(Object o1, Object &o2, Object * o3);

This summarises the three ways you can pass an object to a function: by value, by reference and by reference via a pointer (let’s call it for simplicity call by pointer). Say you call the function like this:

f(o1, o2, &o3);

Inside the function, you may access these objects’ fields like this (note o3):

o1.x  += 3;
o2.x  += 3;
o3->x += 3;


These are the take home messages:

  • [Call by value]: o1 is going to be copied inside the function f, then after f returns the copy is going to be destroyed. This means that any changes made by onto o1 are not going to be visible outside fIn addition to being unsuitable to persist changes, the copy overhead can be very deleterius in certain high-performance applications – however, there is no access overhead.
  • [Call by reference]: o2 is NOT going to be copied, so any changes made to o2 inside f are going to remain even after f returns. This is very efficient: no copy and methods/fields have no access overhead.
  • [Call by pointer]: you are actually passing a pointer to o3 to f, so to access the objects’ fields and methods you need to dereferentiate every time (using the -> operator). This makes it averagely efficient: no copy but access overhead.

It’s clear that, even when you don’t actually need to modify an object, call by reference should be your default approach.

KryoNet: very fast and easy to use network library

I’ve been experimenting a bit with this library and it’s a relief: I don’t have to maintain my own (crappy) one within ZenithMUD.

Easy to use, documented and very efficient, according to benchmarks I’ve found around. One can use the start() method that runs the library in a Thread and takes care of getting data from the connected clients and running user-defined actions on them, or just call the update() function in your own game loop (if you are more of a control freak :))

A good starting point, besides the README, is this Chat client-server example the author has provided.

Good job!

Update and misc

Added a few things in the About page. I’ve a job in UK, I’m working with Erlang, an amazing language designed and tweaked to deal with extremely concurrent, distributed systems. I love the actor model built in Erlang, and I’m getting used to functional programming (pattern matching is just about the best tool I’ve ever used).

Added some more information about my projects, just explore the menu up there. Nothing too impressive but you may be curious, who knows.

If you want to learn Erlang, start with the book Erlang/OTP in Action. It has a pragmatic approach and helps you write a decent OTP application quickly. If you want to start from the fundamentals, OReilly’s Erlang Programming is very thorough and is also nice to read. I recommend Geany as an editor, if you are GUI-inclined but like simple and efficient stuff (I’ve done a pull request for a few more Erlang-oriented configurations, if they don’t accept it you’ll find info in my own github account).

Intellij-Idea or SublimeText (which sadly is shareware) are more appropriate if you are into fancy GUIs in addition to functionality. There’s also a plugin for Eclipse called Erlide.

For the old and pure, instead, there are good plugins for Emacs and VIM. For the latter, search on github for “vimerl“, that’s the best.

How to remotely use your workstation with ssh and vnc

I assume that you use a Unix-like operative system, but this apply to any OS.

Let’s say about a problem: you are in your office, you are waiting for some kind of simulation to end. Everything seems to go well, so you go home, quite confident that the next morning you’ll have some results to analyze.

Unfortunately, it’s unlikely to happen. Ten seconds after your departure, the program ends for a low-frequency bug that you missed.

Luckily, we have a solution for the future. Instead of losing a long night (or even a weekend) of program executions, you can check your office workstation when you are at home, in your warm bed, or everywhere you like to be (with an internet connection, ça va sans dire).

What do you need in your workstation?

You need to install (or ask the sysadmin to install):

  1. the ssh server
  2. a vnc server: x11vnc

Remember to ask the sysadmin:

  1. the public IP of your workstation
  2. the range of unused open ports, if the workstation is behind a router

What do you have to do from home?

Ok, now it’s very simple. You’ll only need to install a vncviewer, but you’ll probably already have one. Let’s say that your workstation username is goofy, the public IP is 1.0.0.1, and the free open port on the workstation is the 5900. Now we need another port, this time a local one (of your home computer), to create the http tunnel. I suggest the port 5901:

ssh -p 42 -L 5901:localhost:5900 goofy@1.0.0.1

You’ll have to insert your workstation password, and then you can explore with the shell your remote operative system. If you want to use it graphically, only execute the command:

x11vnc

Wait for 5-6 seconds, then open another shell without closing the current one, and type the following:

vncviewer localhost::5901

Et voilà! You have to move your mouse on the new black window, and then you’ll see the remote display to appear.

How to do a word alignment with GIZA++ or MGIZA++ from parallel corpus

I assume that you are working with a *nix box, and that you use a bash-like shell.

You need the sentence aligned europarl corpora for each language you like to train the word alignment. Please check that the corpora have the same number of lines and that they are correctly aligned.

If you don’t want to do it, you can use the sentence aligned europarl corpora built by Els Lefever. They are raw (no xml tags, but capital letter and words not well separated), so if you want the word alignment you have to follow all of next steps. Note that they are compressed in a tar.gz archive, and that are only six languages: english, italian, french, spanish, german and dutch. If you want to use different languages but you don’t know how to do, please comment this post.

First of all

You want to do a word alignment between two languages. We call the two languages the source language and the target language. This is important in order to correctly do the word alignment, so decide which language will be the source and which the target.

I can help you saying that the word alignment is only one-to-one, NULL-to-one and many-to-one. So if you choose english as source language and french as target, you can have an alignment like this:

Word alignment example

Image via Wikipedia

You may want to make a function like this:

f(english) = french

that is impossible with the alignment before. In this case you have to use the french as source language, and english as target.

In the next sections, I’ll use for each file name this convention: source = .src and target = .trg

So for example, if you downloaded my raw corpora and you want to do an english (source) to french (target) alignment (like in the image above), you can think raw_corpus.src as raw_corpus.en and raw_corpus.trg as raw_corpus.fr.

Pre-processing

We have to clean up the corpora, set every word in lower case and separate every word from each other (or we can say “tokenizing”). We need the tools of the europarl maintainers, you can download it here:

http://www.statmt.org/europarl/v5/tools.tgz

Now enter the subdirectory tools, and take the script tokenizer.perl and the directory nonbreaking_prefix (they should be in the same directory!).

The nonbreaking_prefix let the tokenizer keep together words like “Mr.”. Normally the tokenizer would have broken it into two words: “Mr” and “.”, but we know that the final dot is useful, not a real punctuation.

Into tools.tgz there aren’t prefixes for every language, so I did my own. You can freely use it, and if you correct it please contact me.

Now, let’s tokenize!

tokenizer.perl -l src < raw_corp.src > corp.tok.src
tokenizer.perl -l trg < raw_corp.trg > corp.tok.trg

And now you can lowercase every word:

tr '[:upper:]' '[:lower:]' < corp.tok.src > corp.tok.low.src
tr '[:upper:]' '[:lower:]' < corp.tok.trg > corp.tok.low.trg

Making class and cooccurrence

Now you have to choose: MGIZA or GIZA?

They are equals, but MGIZA is multi-threaded, GIZA not. My advice is to choose MGIZA, but if you have to align lot of languages you can execute multiple times GIZA for each language, so it’s your choice. I’ll write explicitly when an option if for MGIZA only.

After you have downloaded, built and installed your favourite tool, we can go forward.

Making classes (necessary for algorithm HMM):

mkcls -n10 -pcorp.tok.low.src -Vcorp.tok.low.src.vcb.classes
mkcls -n10 -pcorp.tok.low.trg -Vcorp.tok.low.trg.vcb.classes

Translate the corpora into GIZA format:

plain2snt corp.tok.low.src corp.tok.low.trg

Create the cooccurence:

snt2cooc corp.tok.low.src_corp.tok.low.trg.cooc corp.tok.low.src.vcb corp.tok.low.trg.vcb corp.tok.low.src_corp.tok.low.trg.snt

Finally aligning!

You only need, now, a configuration file for MGIZA or GIZA. I use this, you only have to change “.src” and “.trg” with the correct language strings: “it”, “en”, “fr”, etc.

If you use GIZA, you have to delete the line “ncpus” from this config file. Otherwise, with MGIZA, set it to the number of cpu/core that you have. Remember that if you have a cpu with hyperthreading, you can multiply the number of core by two (I’ve an Intel i740 quad-core, so I’ve “ncpus 8”).

Cross your fingers and type:

mgiza configfile

After many hours, you’ll get as many output files as “ncpus”, in this format:

src_trg.dict.A3.final.part0
src_trg.dict.A3.final.part1
src_trg.dict.A3.final.part2
...

You only have to concatenate them, and you have your word alignment!

Little script for lazy ones

I did a simple script that does the things I said before, you only need to adapt it to your languages. Now it makes five word alignments, from “italian/dutch/french/german/spanish” languages to “english”. You can freely use it if you want.