I’ve just been playing with the C++ API to Hadoop‘s HDFS. All on my newly-installed virgin Linux box, so no baggage! I encountered a few problems, all of which proved straightforward to fix, but which may highlight issues of possible interest to their documentation folks. Recording here while it’s relatively fresh in the mind.
1: I’ve downloaded hadoop, now how do I install it? The docs and README tell me nothing; there’s no INSTALL. I played with quick start in the download directory, but obviously that’s not something you want to keep on doing! Fortunately someone on the wiki tells me: I just move the whole caboodle to /usr/local and set up the paths. And a dedicated hadoop user as suggested there makes sense.
2: Now “hadoop” works and emits a usage message, but as soon as I try to do something it fails. OK, my virgin linux box doesn’t have a JVM installed; just go ahead and install it. The fact that “hadoop” had produced the usage message had led me to suppose it was installed: misleading until “file hadoop” revealed it to be a script!
3: How do I tell hadoop where to keep its filesystem? Quickstart tells me
Format a new distributed-filesystem:
$ bin/hadoop namenode -format
and the wiki is similar. But neither of them tell me where in my filesystem it’ll start writing! If I have to RTFM for that without a clue where in TFM to start, it rather defeats the purpose of a quick start! OK, run it as my newly-minted hadoop user, so filesystem protections protect me from anything I can’t wipe-and-start-again if it seems to be writing to lots of places I don’t want.
Turns out it created stuff in /tmp (which is fine for now, though I think some of what it created is supposed to be persistent). Also lots of log files, in hadoop’s logs dir – which is also fine just so long as I know where they are! Takes a bit more browsing the wiki to find how to configure it – at yahoo’s tutorial pages!
4: Where are all the files? Lots of find and locate required ‘cos they’re not under Hadoop’s /src and /lib directories, and there isn’t an /include! The C++ API has its own directory as an apparent afterthought.
5: Trial and error required to compile the HelloWorld C sample program. I ended up with the following makefile to record paths. Not a problem, but perhaps the docs page could use it:
CFLAGS= -g -O0 -Wall -c INCLUDES= -I /usr/local/hadoop/src/c++/libhdfs \ -I /usr/lib/jvm/java-6-sun-184.108.40.206/include/ \ -I /usr/lib/jvm/java-6-sun-220.127.116.11/include/linux/ LDPATH= -L /usr/local/hadoop/c++/Linux-i386-32/lib/ \ -L /usr/lib/jvm/java-6-sun-18.104.22.168/jre/lib/i386/client LIBS= -lhdfs -ljvm sample: sample.o $(CC) -o sample sample.o $(LDPATH) $(LIBS) sample.o: sample.c $(CC) $(CFLAGS) $(INCLUDES) sample.c
6: Finally, I needed to set library path and CLASSPATH. Throwing the kitchen sink at the latter, as recommended in the scanty docs, I end up with the ugly but functional: