How to do a word alignment with GIZA++ or MGIZA++ from parallel corpus

I assume that you are working with a *nix box, and that you use a bash-like shell.

You need the sentence aligned europarl corpora for each language you like to train the word alignment. Please check that the corpora have the same number of lines and that they are correctly aligned.

If you don’t want to do it, you can use the sentence aligned europarl corpora built by Els Lefever. They are raw (no xml tags, but capital letter and words not well separated), so if you want the word alignment you have to follow all of next steps. Note that they are compressed in a tar.gz archive, and that are only six languages: english, italian, french, spanish, german and dutch. If you want to use different languages but you don’t know how to do, please comment this post.

First of all

You want to do a word alignment between two languages. We call the two languages the source language and the target language. This is important in order to correctly do the word alignment, so decide which language will be the source and which the target.

I can help you saying that the word alignment is only one-to-one, NULL-to-one and many-to-one. So if you choose english as source language and french as target, you can have an alignment like this:

Image via Wikipedia

You may want to make a function like this:

$f(english) = french$

that is impossible with the alignment before. In this case you have to use the french as source language, and english as target.

In the next sections, I’ll use for each file name this convention: source = .src and target = .trg

So for example, if you downloaded my raw corpora and you want to do an english (source) to french (target) alignment (like in the image above), you can think raw_corpus.src as raw_corpus.en and raw_corpus.trg as raw_corpus.fr.

Pre-processing

We have to clean up the corpora, set every word in lower case and separate every word from each other (or we can say “tokenizing”). We need the tools of the europarl maintainers, you can download it here:

http://www.statmt.org/europarl/v5/tools.tgz

Now enter the subdirectory tools, and take the script tokenizer.perl and the directory nonbreaking_prefix (they should be in the same directory!).

The nonbreaking_prefix let the tokenizer keep together words like “Mr.”. Normally the tokenizer would have broken it into two words: “Mr” and “.”, but we know that the final dot is useful, not a real punctuation.

Into tools.tgz there aren’t prefixes for every language, so I did my own. You can freely use it, and if you correct it please contact me.

Now, let’s tokenize!

tokenizer.perl -l src < raw_corp.src > corp.tok.src tokenizer.perl -l trg < raw_corp.trg > corp.tok.trg

And now you can lowercase every word:

tr '[:upper:]' '[:lower:]' < corp.tok.src > corp.tok.low.src tr '[:upper:]' '[:lower:]' < corp.tok.trg > corp.tok.low.trg

Making class and cooccurrence

Now you have to choose: MGIZA or GIZA?

They are equals, but MGIZA is multi-threaded, GIZA not. My advice is to choose MGIZA, but if you have to align lot of languages you can execute multiple times GIZA for each language, so it’s your choice. I’ll write explicitly when an option if for MGIZA only.

After you have downloaded, built and installed your favourite tool, we can go forward.

Making classes (necessary for algorithm HMM):

mkcls -n10 -pcorp.tok.low.src -Vcorp.tok.low.src.vcb.classes mkcls -n10 -pcorp.tok.low.trg -Vcorp.tok.low.trg.vcb.classes

Translate the corpora into GIZA format:

plain2snt corp.tok.low.src corp.tok.low.trg

Create the cooccurence:

snt2cooc corp.tok.low.src_corp.tok.low.trg.cooc corp.tok.low.src.vcb corp.tok.low.trg.vcb corp.tok.low.src_corp.tok.low.trg.snt

Finally aligning!

You only need, now, a configuration file for MGIZA or GIZA. I use this, you only have to change “.src” and “.trg” with the correct language strings: “it”, “en”, “fr”, etc.

If you use GIZA, you have to delete the line “ncpus” from this config file. Otherwise, with MGIZA, set it to the number of cpu/core that you have. Remember that if you have a cpu with hyperthreading, you can multiply the number of core by two (I’ve an Intel i740 quad-core, so I’ve “ncpus 8”).

Cross your fingers and type:

mgiza configfile

After many hours, you’ll get as many output files as “ncpus”, in this format:

src_trg.dict.A3.final.part0 src_trg.dict.A3.final.part1 src_trg.dict.A3.final.part2 ...

You only have to concatenate them, and you have your word alignment!

Little script for lazy ones

I did a simple script that does the things I said before, you only need to adapt it to your languages. Now it makes five word alignments, from “italian/dutch/french/german/spanish” languages to “english”. You can freely use it if you want.

45 comments

alissa says:

January 25, 2012 at 20:51

Great thanks for this step-by-step guide! It’s really useful!

LikeLike

Reply
1. Fabio T. says:
  
  January 25, 2012 at 20:59
  
  happy to have been of help :)
  
  LikeLike
  
  Reply
alissa says:

January 25, 2012 at 21:27

oops! some strange error has occurred:
utf8 “\xF3” does not map to Unicode at ./tokenizer.perl line 45, line 88.
Malformed UTF-8 character (fatal) at ./tokenizer.perl line 64, line 88.

The sentence in line 88 is:
[i]But I would also like to make it very clear that President Prodi made a commitment to this Parliament to introduce a new debate, as Mr Barón Crespo has reminded us, which would be in addition to the annual debate on the Commission ‘s legislative programme, on the broad areas of action for the next five years, that is to say, for this legislature. [/i]

As I understand it’s because of the word Barón (symbol ó).
Has this occurred before? Don’t you have a ready-made answer what to do?
Thanks!

LikeLike

Reply
1. alissa says:
  
  January 25, 2012 at 21:37
  
  Heh, I’ve already solved it: I only had to change the encoding of my initial file!
  Sorry for bothering! :)
  
  LikeLike
  
  Reply
  1. Fabio T. says:
    
    January 25, 2012 at 21:38
    
    ok perfect! no bothering at all :)
    
    LikeLike
2. Fabio T. says:
  
  January 25, 2012 at 21:38
  
  Since megaupload is gone, I guess you are not using Els Lefever’s corpora that I linked but the original ones from http://www.statmt.org/europarl/ …is that right? If you tell me exactly which corpora are you using I’ll give a look into it: I remember to have had some problems with character encoding, but it’s been a long time and I can’t remember :)
  
  LikeLike
  
  Reply
  1. alissa says:
    
    January 25, 2012 at 21:49
    
    How akward again: it seems, I didn’t refresh the page in time!
    
    I use the corpora from http://opus.lingfil.uu.se/ . They have a wide-wide range of different parallel (sentence aligned) corpora including Europarl. Have a look, it’s a great repository of corpora!
    
    LikeLike
alissa says:

January 25, 2012 at 21:45

Just one more comment :)
Since MEGAUPLOAD is not available any more, would you please reupload your scripts to some other filesharing host? I can recommend http://www.rapidshare.ru or http://www.rapidshare.de or http://www.ifolder.ru … etc.

LikeLike

Reply
1. Fabio T. says:
  
  January 26, 2012 at 02:25
  
  done :) thanks for the link full of corpora, I didn’t know it. Bye!
  
  LikeLike
  
  Reply
alissa says:

January 26, 2012 at 22:05

Hi! It’s me again :) I’ve just started playing with GIZA++ and word alignment and stuck with a number of questions. I do not expect you to answer them, but may be give me some hint where to find a clue. I suppose, everybody comes up with the same questions…
To resolve them I am looking now through “A Systematic Comparison of Various Stat. Alignment Models” by F.J. Och now, but it’s quite theoretical and kind of raises even more questions. That’s why so far your config file is the best practical guidance :)
How do you choose what values of the alignment parameters to use? Why did you choose exactly those that were in your config file? Are there any recommendations/ works/ publications on the best combination of parameter values? Are they language dependent? I am looking for the English-Spanish pair of languages.

LikeLike

Reply
1. Fabio T. says:
  
  February 9, 2012 at 19:24
  
  Hi! Really sorry for the long time to reply..and for the bad news of this answer :)
  
  I worked with this stuff more than one year ago, and I couldn’t find good information online so I tried empirically. The pages of MGIZA that I linked were very good but now the guy seems to have messed up his own site, which was a wiki and now is on wordpress…so the little (but very useful) guide he wrote about the options to MGIZA is gone. Maybe you may find it into google cache or something like that.
  
  Another point to start may be the project MOSES: http://www.statmt.org/moses/?n=FactoredTraining.RunGIZA since it uses GIZA in one of the steps towards building the translation model. It seems a very good project and has quite a lot of pages on how to setup the system (and the other packages it uses, including GIZA). MGIZA (and its parallel brother, PGIZA) use mostly the same options than GIZA.
  
  Probably in the future I will work again on word sense disambiguation and I’ll try to write some tutorials on this kind of things. By the way, recently I’ve given a look to a toolkit, NLTK: http://www.nltk.org/ that I find extremely easy and powerful to manipulate datasets. If the same steps we did here are doable with nltk, I will write a post on how to do it :)
  
  Sorry not to be able to help you more.
  
  Best wishes,
  Fabio
  
  LikeLike
  
  Reply
Azon says:

March 5, 2012 at 20:13

Hi!
First of all big thanks for your well-explained tutorial.
I am trying to do a one-to-one correspondence translation from Europarl-de-en
I have already aligned the corpus and get an aligned.grow-diag-final-and file. How can I get a one-to-one correspondence using Europarl-de und the aligned.grow-diag-final-and file?
I would be glad if you can help me.
Best regards.

Azon

LikeLike

Reply
1. Fabio T. says:
  
  March 5, 2012 at 20:42
  
  Hi Azon! Can you elaborate a bit about what you want to do? If you start with two, sentence-aligned corpora (“de” and “en”) you can get a word alignment, as explained in the tutorial..and you can also have one with the cooccurences, if you want. What are you exactly trying to obtain, and from what kind of corpora?
  
  Hope to be able to help you,
  Fabio
  
  LikeLike
  
  Reply
Azon says:

March 11, 2012 at 09:15

Hi Fabio,
thanks a lot and sorry for the late reply!
This is what I want to do:
for each german word in “de” I would like to find out if the word was aligned only ONCE in the “aligned.grow-diag-final-and” output file. I have already trained moses with the parallel corpus “de-en” and got the above alignment file. Is there any issue to get this one-to-one correspondence?
Best regards!
Azon

LikeLike

Reply
Eleni Teshome says:

April 10, 2012 at 06:34

hi,
i wanted to use the word alignment for Amharic Language. could you please tell me how?

LikeLike

Reply
1. Negacy. says:
  
  March 20, 2013 at 18:42
  
  Hi Eleni,
  Yous till working on Amharic?
  
  LikeLike
  
  Reply
Swapnil Jadhav says:

September 11, 2012 at 06:52

How to “test” the model that has been built (trained) using the procedure given above ?
I am able to do every step given above. Now I want to test the model for unobserved data.
Please explain the steps for that.
I am using giza++
Thanks

LikeLike

Reply
1. Negacy. says:
  
  March 21, 2013 at 19:59
  
  That is also my question, how do I run GIZA++ for test data? There is a flag -tc but I have no idea what the format is gonna be? I put one sentence from the source language into testcorpusfile and running the following is giving me empty file *.dict.tst.A3.final
  
  ./GIZA++-v2/GIZA++ configfile -tc testcorpusfile
  
  Thanks.
  
  Neg.
  
  LikeLike
  
  Reply
  1. Fabio T. says:
    
    March 21, 2013 at 21:11
    
    I have only used GIZA to create the word alignments, and the “Weka” framework to train models and do predictions. I don’t know if GIZA provides anything of the sort, but you can look up the MOSES project that embeds and extends GIZA.
    
    LikeLike
Sarah says:

January 29, 2013 at 16:01

You are an angel. Thanks for your effort! It saved me so much time!

LikeLike

Reply
1. Fabio T. says:
  
  March 21, 2013 at 21:11
  
  you are welcome, glad to see people still using this old tutorial! :)
  
  LikeLike
  
  Reply
Negacy. says:

March 20, 2013 at 18:16

Hi,
I am getting error msg when running snt2cooc. Here is what I did:

Admins-MacBook-Pro-2:giza-pp negacy$ ./GIZA++-v2/snt2cooc.out corp.tok.low.src_corp.tok.low.trg.cooc corp.tok.low.src.vcb corp.tok.low.trg.vcb corp.tok.low.src_corp.tok.low.trg.snt
ERROR: wrong option

I believe snt2cooc takes three arguments, as shown below:
Usage: ./GIZA++-v2/snt2cooc.out vcb1 vcb2 snt12

Why is corp.tok.low.src_corp.tok.low.trg.cooc given as an argument for snt2cooc in the tutorial?
Assuming corp.tok.low.src_corp.tok.low.trg.cooc is output of snt2cooc, can I do:
snt2cooc.out corp.tok.low.src.vcb corp.tok.low.trg.vcb corp.tok.low.src_corp.tok.low.trg.snt > snt2cooc.out corp.tok.low.src_corp.tok.low.trg.cooc

Basically, I am redirecting output of snt2cooc into *.cooc

Thanks.

Negacy.

LikeLike

Reply
1. Fabio T. says:
  
  March 21, 2013 at 21:18
  
  The source code might have changed (or the version you are using is different from mine), when I wrote the tutorial snt2cooc took four arguments where the first was the output file. You may certainly pass only three arguments (vcb1, vcb2 and snt12) and redirect the output to a file, sure ;)
  
  LikeLike
  
  Reply
arrkaa says:

April 24, 2013 at 20:46

Dear Fabio
I need running giza++ for doing my project. I follow this guid but after doing GIZA++ configFile, it gets me a lot of errors in this form:
ERROR: no word index for “very”
ERROR: no word index for “please”
ERROR: no word index for ….
Do u have any suggestion for fixing this problem?

Thank you

LikeLike

Reply
iykeln says:

June 25, 2013 at 14:40

Hi Fabio
Thanks for this piece.
It is really okay!
Pls, can I ask you how to visualize/examine the giza++ alignment apart from the *.dict.A3.final.part.final file generated.
Regards.

LikeLike

Reply
1. Fabio T. says:
  
  February 8, 2014 at 13:34
  
  Sorry for the late reply. I’m afraid the visualization or analysis would be worth a new article (or set of articles) and I’m not working on this stuff anymore.
  
  I personally used it with the machine learning suite WeKa, that is freely available and very powerful (also decently documented). It also has many visualisation tools for datasets.
  
  Here it is: http://www.cs.waikato.ac.nz/ml/weka/
  
  LikeLike
  
  Reply
Hamdi says:

October 31, 2013 at 17:28

“If you want to use different languages but you don’t know how to do, please comment this post.”
I need Arabic Language , could you Please…
Actually explanation was so useful ,but for me without Arabic language its difficult to Practice it. Than you alot

LikeLike

Reply
1. Fabio T. says:
  
  February 8, 2014 at 13:36
  
  If you can find a parallel corpus between Arabic and another language, you can follow the steps above :) I’ve only used europarl corpora, but I’m sure there are arabic datasets out there.
  
  LikeLike
  
  Reply
jyoti says:

February 8, 2014 at 13:10

Hi
I tried word alignment for English-Hindi with GIZA++ by using steps you have given but at the end of it. when I run GIZA++ configfile, it gives some parameters and in last it said “segmentation fault (core dumped)”. Why am I getting this error. Please reply me its urgent

LikeLike

Reply
1. Fabio T. says:
  
  February 8, 2014 at 13:30
  
  That seems likely to be a problem with the GIZA++ version you are using, I’m afraid.
  
  Can you write on pastebin the whole config file you use and the COMPLETE set of steps you take (including output)?
  
  I’ll have a look but it’s been such a long time, I can’t guarantee anything :)
  
  LikeLike
  
  Reply
2. Fabio T. says:
  
  February 8, 2014 at 13:31
  
  Also, you might want to try MGIZA instead of GIZA, maybe that can help.
  
  LikeLike
  
  Reply
jyoti says:

February 8, 2014 at 13:38

My problem is solved by using negacy’s suggestion.
did you get any process to test it. if yes then tell me please.

LikeLike

Reply
1. jyoti says:
  
  February 8, 2014 at 13:40
  
  Hi Fabio
  my previous problem is solved now. thank you so much for your step wise process to install GIZA. Now I am trying to find out a method to test it.
  Thanks again.
  
  LikeLike
  
  Reply
  1. Fabio T. says:
    
    February 8, 2014 at 13:45
    
    Oh I see. You want to train a model. I haven’t written any tutorial on the subject..
    
    Here there is the old code I used to create a Word Sense disambiguator using the WeKa framework and datasets created with GIZA:
    
    https://github.com/fabioticconi/cl-wsd
    
    Maybe it can give you some hints.
    
    LikeLike
2. Fabio T. says:
  
  February 8, 2014 at 13:42
  
  What do you mean by “a process to test it”?
  
  Were you able to get a word alignment or are you still stuck?
  
  LikeLike
  
  Reply
Omwoma Vinent says:

April 12, 2014 at 13:05

I am undertaking a word alignment MSc. Comp Science project. Am using Bayesian Word alignment because of its benefits compared to Expectation Maximization model. I would like to know the folowing:
1. How to configure GIZA++ for Bayesian Word alignment
2.How to incorporate Gibbs sampler algorithm into the model.

Thank you in advance

LikeLike

Reply
Catherine says:

April 14, 2014 at 07:57

Hi everyone!

I don’t know how easy this would be, given the size of the corpus, but has anyone a word-aligned version of Europarl (I’m interested in fr-en specifically, but any other pair would do) she wants to share? This is taking so long…

Thank you so much in advance!

LikeLike

Reply
1. Catherine says:
  
  April 15, 2014 at 16:33
  
  Actually, it finally ended. You can download the en-fr aligned corpus here: http://catherinegasnier.blogspot.ch/2014/04/europarl-corpus-v7-en-fr-word-aligned.html
  
  LikeLike
  
  Reply
Catherine says:

April 28, 2014 at 15:21

Dear all, I have made a configurable script, from the one provided by our blogger, where you can configure various paths and source and target languages by modifying a few variables. Feel free to use it, share it or do whatever please you.
The script: https://dl.dropboxusercontent.com/u/64718434/NLP/giza_script.sh
The config file that goes with it: https://dl.dropboxusercontent.com/u/64718434/NLP/cfg.gizacfg

LikeLike

Reply
1. Fabio T. says:
  
  April 8, 2015 at 15:48
  
  With culpable delay.. thanks! I still can’t believe how many people still reach this blog post and ask for help. Your contribution will surely be helpful.
  
  LikeLike
  
  Reply
Pingback: GIZA++およびMGIZAの使い方 – Liberal Life
dereje mulugeta says:

February 3, 2017 at 07:35

hay i am using some different local language and i have the file in word format how can i do the alignment ?. I see that you are using files who have .src and .trg file extensions.

LikeLike

Reply
dereje mulugeta says:

February 26, 2017 at 12:59

hi fabio i have got the word alignment file by using mgiza .A3.final but where can i get the .ti files that are found on the giza ++ . I need the dictionary file . should i use a script ??

LikeLike

Reply
Bragunetzki says:

April 2, 2024 at 16:19

Hello! This seems like an old post, so I don’t know if I’ll get a response, but I’ll try anyway. When changing the config file, you mention that “you only have to change “.src” and “.trg” with the correct language strings: “it”, “en”, “fr”, etc.”

I don’t quite understand what you mean here. Aren’t “.trg” and “.src” parts of the filenames that will be used during alignment? The files created with previous commands all reference “.src” and “.trg”, so the config should reference them as well.

I can make the config file work with giza, but now I’m wondering if I missed a separate language setting somewhere, and my alignment isn’t actually generated correctly.

LikeLike

Reply
1. Fabio T. says:
  
  April 2, 2024 at 20:17
  
  Hi! What I meant is that you will likely have a bunch of corpus_raw.XX files. In the instructions I use .src for the source target you care about and .trg for the target. In this way you know the order of each when running the commands.
  
  Say you want to align from English to French. You’ll have corpus_raw.en and corpus_raw.fr in the scripts above, not .src and .trg
  They will be .en and .fr respectively for each of the lines of code I put above (if you do it like I did)
  
  But in the end it’s up to you. It was just a suggestion on my side 😁
  
  LikeLike
  
  Reply