Wednesday, September 26, 2007

Compiling GIZA++

Compiling GIZA++ can be a pain, since the available source was compiled for an older version of gcc. With improved compliance provided by gcc to standard C++ in terms of template syntax and semantics, the old GIZA++ does not compile on gcc 4.2. With a lot of fixes, I was able to get GIZA++ working, but the word class creation tool mkcls proved a very tough nut to crack. Luckily, I found a link gcc 4.2 compiled source code here on the StatMT site. Hope that helps anybody looking for GIZA++ and mkcls.

Saturday, September 15, 2007

Batch conversion of doc files to text files

Its kind of irritating when you need text files to run language processing applications on and you have corpora in the form of word documents. Here is a way to convert those doc files to txt files without having to open the document editor and do a 'Save As' for each corpus document. Here come Open Office macros to the rescue. You can write up a macro to save the file as a text file. The macro can then be invoked from the commandline by starting up Open Office in invisible mode. And you can wrap all this in a nice shell script to do any filtering/cleaning after saving them as text files. A note of caution: I observed that OpenOffice saves the text files asynchronously, so you the file might not be available for processing by the following script lines. Better to pause for a while, while OOffice saves the document. You can find more about this nifty timesaver here:

http://www.xml.com/pub/a/2006/01/11/from-microsoft-to-openoffice.html?page=2