You want to do a word alignment between two languages. We call the two languages the source language and the target language. This is important in order to correctly do the word alignment, so decide which language will be the source and which the target.
I can help you saying that the word alignment is only one-to-one, NULL-to-one and many-to-one. So if you choose english as source language and french as target, you can have an alignment like this:
Image via Wikipedia
You may want to make a function like this:
that is impossible with the alignment before. In this case you have to use the french as source language, and english as target.
In the next sections, I’ll use for each file name this convention: source = .src and target = .trg
So for example, if you downloaded my raw corpora and you want to do an english (source) to french (target) alignment (like in the image above), you can think raw_corpus.src as raw_corpus.en and raw_corpus.trg as raw_corpus.fr.
We have to clean up the corpora, set every word in lower case and separate every word from each other (or we can say “tokenizing”). We need the tools of the europarl maintainers, you can download it here:
Now enter the subdirectory tools, and take the script tokenizer.perl and the directory nonbreaking_prefix (they should be in the same directory!).
The nonbreaking_prefix let the tokenizer keep together words like “Mr.”. Normally the tokenizer would have broken it into two words: “Mr” and “.”, but we know that the final dot is useful, not a real punctuation.
Into tools.tgz there aren’t prefixes for every language, so I did my own. You can freely use it, and if you correct it please contact me.
Now, let’s tokenize!
tokenizer.perl -l src < raw_corp.src > corp.tok.src
tokenizer.perl -l trg < raw_corp.trg > corp.tok.trg
And now you can lowercase every word:
tr '[:upper:]' '[:lower:]' < corp.tok.src > corp.tok.low.src
tr '[:upper:]' '[:lower:]' < corp.tok.trg > corp.tok.low.trg
Making class and cooccurrence
Now you have to choose: MGIZA or GIZA?
They are equals, but MGIZA is multi-threaded, GIZA not. My advice is to choose MGIZA, but if you have to align lot of languages you can execute multiple times GIZA for each language, so it’s your choice. I’ll write explicitly when an option if for MGIZA only.
After you have downloaded, built and installed your favourite tool, we can go forward.
Making classes (necessary for algorithm HMM):
mkcls -n10 -pcorp.tok.low.src -Vcorp.tok.low.src.vcb.classes
mkcls -n10 -pcorp.tok.low.trg -Vcorp.tok.low.trg.vcb.classes
Translate the corpora into GIZA format:
plain2snt corp.tok.low.src corp.tok.low.trg
Create the cooccurence:
snt2cooc corp.tok.low.src_corp.tok.low.trg.cooc corp.tok.low.src.vcb corp.tok.low.trg.vcb corp.tok.low.src_corp.tok.low.trg.snt
You only need, now, a configuration file for MGIZA or GIZA. I use this, you only have to change “.src” and “.trg” with the correct language strings: “it”, “en”, “fr”, etc.
If you use GIZA, you have to delete the line “ncpus” from this config file. Otherwise, with MGIZA, set it to the number of cpu/core that you have. Remember that if you have a cpu with hyperthreading, you can multiply the number of core by two (I’ve an Intel i740 quad-core, so I’ve “ncpus 8”).
Cross your fingers and type:
After many hours, you’ll get as many output files as “ncpus”, in this format:
You only have to concatenate them, and you have your word alignment!
Little script for lazy ones
I did a simple script that does the things I said before, you only need to adapt it to your languages. Now it makes five word alignments, from “italian/dutch/french/german/spanish” languages to “english”. You can freely use it if you want.