A simple implementation, using the Weka framework for Machine Learning and a couple of other tools, to solve the Task 3 of SemEval 2010 about Cross-lingual Word Sense Disambiguation.
I did not partecipated in the competition since this project has been assigned to me by Prof. Roberto Navigli after the deadline was over, but I used exactly the same information and data the participants had. Here is the code, without the corpora of course (they amount to a few GB of data).
The competition worked in this way: a team could choose only one language to be tested against english, or joining the multilingual subtask (english versus all the other five languages). A team could also choose to be evaluated as “best” or “out of five”. Click here for details.
My program is only on the Bilingual task (but could be easily extended to multilingual), and I reached good results on both the “best” and “out of five” evaluations.
Here I describe how to perform a GIZA/MGIZA word alignment using europarl parallel corpora for this task.