Saturday, September 15, 2007

Batch conversion of doc files to text files

Its kind of irritating when you need text files to run language processing applications on and you have corpora in the form of word documents. Here is a way to convert those doc files to txt files without having to open the document editor and do a 'Save As' for each corpus document. Here come Open Office macros to the rescue. You can write up a macro to save the file as a text file. The macro can then be invoked from the commandline by starting up Open Office in invisible mode. And you can wrap all this in a nice shell script to do any filtering/cleaning after saving them as text files. A note of caution: I observed that OpenOffice saves the text files asynchronously, so you the file might not be available for processing by the following script lines. Better to pause for a while, while OOffice saves the document. You can find more about this nifty timesaver here:

http://www.xml.com/pub/a/2006/01/11/from-microsoft-to-openoffice.html?page=2

No comments: