A simple implementation, using the Weka framework for Machine Learning and a couple of other tools, to solve the Task 3 of SemEval 2010 about Cross-lingual Word Sense Disambiguation.

I did not partecipated in the competition since this project has been assigned to me by Prof. Roberto Navigli after the deadline was over, but I used exactly the same information and data the participants had. Here is the code, without the corpora of course (they amount to a few GB of data).

The competition worked in this way: a team could choose only one language to be tested against english, or joining the multilingual subtask (english versus all the other five languages). A team could also choose to be evaluated as “best” or “out of five”. Click here for details.

My program is only on the Bilingual task (but could be easily extended to multilingual), and I reached good results on both the “best” and “out of five” evaluations.

Here I describe how to perform a GIZA/MGIZA word alignment using europarl parallel corpora for this task.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.