Thursday, October 04, 2007

Information Retrieval Books

A list of good books on IR here:
http://researchonsearch.blogspot.com/2005/12/information-retrieval-textbooks.html

Wednesday, September 26, 2007

Compiling GIZA++

Compiling GIZA++ can be a pain, since the available source was compiled for an older version of gcc. With improved compliance provided by gcc to standard C++ in terms of template syntax and semantics, the old GIZA++ does not compile on gcc 4.2. With a lot of fixes, I was able to get GIZA++ working, but the word class creation tool mkcls proved a very tough nut to crack. Luckily, I found a link gcc 4.2 compiled source code here on the StatMT site. Hope that helps anybody looking for GIZA++ and mkcls.

Saturday, September 15, 2007

Batch conversion of doc files to text files

Its kind of irritating when you need text files to run language processing applications on and you have corpora in the form of word documents. Here is a way to convert those doc files to txt files without having to open the document editor and do a 'Save As' for each corpus document. Here come Open Office macros to the rescue. You can write up a macro to save the file as a text file. The macro can then be invoked from the commandline by starting up Open Office in invisible mode. And you can wrap all this in a nice shell script to do any filtering/cleaning after saving them as text files. A note of caution: I observed that OpenOffice saves the text files asynchronously, so you the file might not be available for processing by the following script lines. Better to pause for a while, while OOffice saves the document. You can find more about this nifty timesaver here:

http://www.xml.com/pub/a/2006/01/11/from-microsoft-to-openoffice.html?page=2

Saturday, August 11, 2007

Attending Conferences

Here is an interesting post Conferences: Costs and Benefits. I have never been to a conference, but having sit through many a talk, it seems only a fraction of the time spent is really useful. But I should have an experience of atleast one conference before I comment. Undoubtedly, going around asking questions is not something you can do reading conference proceedings.

Saturday, July 07, 2007

You and your Research

A highly enlightening and inspiring talk by Richard Hamming about what it takes to do significant research, that I read again after a long time. Find it here.

Sunday, July 01, 2007

Writing code at compile time

How do you write a C program, which at compile time allows you to input another program at the terminal. When you run the 1st program, the code that you input at the terminal should execute.

Something like:

bash#gcc -o 1 1.c
#incude
int main(void)
{
printf("Let a thousand ideas bloom :)");
}
^D
bash#./1
Let a thousand ideas bloom :)


So write the code for 1.c. Hint: Assume gcc and any UNIX variant as OS.

As it turns out, the solution is a one liner.

#include "/dev/tty"

While compiling, the macroprocessor tries to open and read from the file /dev/tty, just as it would do for any other include like stdio.h. Since /dev/tty is the terminal, you can now input the second program at the terminal. It get compiled and Bingo! you execute code written at compile time.

Saturday, June 30, 2007

Where does science begin ?

Haven't you seen a lot of scientific work built on fundamental assumptions? A problem I have is in accepting unintuitive assumptions in a lot of research. Especially in a nascent field like Natural Language Processing, you find a lot of these. These are not like Euclidean axioms which seem pretty reasonable. Yet, research without such asssumptions seems to be an impossibility. The problem is then to determine the right set of axioms which serve as a basis to build the theory. Some insights I found in a Phd thesis:

Any philosophical system, any science has to start with assumptions, axioms which cannot be really proved or disproved, which are fundamentally arbitrary but hopefully convincing. In his Tractatus Logico-Philosophicus, Wittgenstein [1918] writes that the only true philosophy would be to utter proven scientific facts, to use nothing but defined symbols of a defined formalism – i.e. to renounce on metaphysics and thus on philosophy:

Die richtige Methode der Philosophie wäre eigentlich die: Nichts zu sagen, als was sich sagen lässt, also Sätze der Naturwissenschaft – also etwas, was mit Philosophie nichts zu tun hat–, und dann immer, wenn ein anderer etwas Metaphysisches sagen wollte, ihm nachzuweisen, dass er gewissen Zeichen in seinen Sätzen keine Bedeutung gegeben hat. ... (Wittgenstein 1918: 85, § 6.53)

Wittgenstein is aware that the problems with this suggestion are, however, that every definition necessitates a definition of the defining terms until we reach the unprovable maxims. If we refuse to accept these fundamental maxims, the cornerstones of meaning, we cannot state anything and are condemned to remain silent.

Wovon man nicht sprechen kann, darüber muss man schweigen. (Wittgenstein 1918: 85, § 7)

These maxims have transcendental, metaphysical quality, only they make any meaning possible and can thus instantiate our questions and answers in life.

Wir fühlen, dass, selbst wenn alle möglichen wissenschaftlichen Fragen beantwortet sind, unsere Lebensprobleme noch gar nicht berührt sind. Freilich bleibt dann eben keine Frage mehr; und eben dies ist die Antwort. (Wittgenstein 1918: 85, § 6.52)
Only the transcendental character of metaphysical philosophy can really give answers and assert meaning. If we only utter scientific proven facts we can only replace meaningless utterances with one another. E.g. in semantics we can step from language to metalanguage to meta-meta-etc.-langauge, but this does not bring us an inch closer to real meaning. On the other hand, because we cannot define the maxims we use, we remain incompetent about them nevertheless. Again, Wittgenstein’s famous quote applies:

Wovon man nicht sprechen kann, darüber muss man schweigen. (Wittgenstein 1918: 85, § 7)

We are therefore in principle disqualified from speaking, from stating anything meaningful or even “scientific”. If we accept a minimal set of maxims on which everybody agrees science seems to be possible nevertheless, as long as we can base everything on these maxims.


A New Beginning

For a long time this blog lived a dormant existence under the name 'Machine Learning Chronicle' . Over the last year, caught between baffling Gaussian tosses and intricate kernel machines which go by the fancy name of Support Vector Machines, I have not been able to write much. Ideas are not restricted to a genre or science and hence I thought of making this blog more broad-based and cover more topics in science and technology. So, here I rechristen this blog as 'Let a thousand ideas bloom!' - for ideas are the heart of science.

Over the last year, I have picked interests in information retrieval, natural language processing, cognitive sciences and the Web 2.0 phenomenon. In addition, programming, physics and space talk are always perennial favourites of mine. So that is what this place is about ... facts, thoughts and ideas.

So let me set the ball rolling ...