Running PageRank on Hadoop


We try to set up Hadoop to run the PageRank alogrithm", the following steps summarize how to start up Hadoop to execute the code.

1. Prepare the code
a) compile the java code
# javac -classpath /lhome/hadoop/hadoop/hadoop-0.20.2-core.jar:/lhome/hadoop/hadoop/hadoop-0.20.2-tools.jar:/lhome/hadoop/hadoop/lib/commons-cli-1.2.jar:/lhome/hadoop/hadoop/lib/commons-logging-1.0.4.jar -d bin/ src/examples/org/apache/hadoop/examples/pagerank/*.java
b) package the class files into a jar file
# jar -cvf lib/pagerank.jar -C bin/ .

2. Start up Hadoop:
a) create a logical volume for the input
# lvcreate -L409600 -n ClueWeb09 vg_xen
# mkdir /mnt/ClueWeb09
# mkfs -t ext4 /dev/vg_xen/ClueWeb09
# mount -t ext4 /dev/vg_xen/ClueWeb09 /mnt/ClueWeb09
# df
b) start up hadoop
Create a script "start_hadoop.sh" to prepare the JAVA environment, clean the HDFS and test the Hadoop setup
# sh start_hadoop.sh

3. Prepare the input:
# hadoop fs -put /mnt/ClueWeb09/toy.graph-txt toy-graph-input
# hadoop fs -cat toy-graph-input

4. Execute the program:
# hadoop jar NaivePageRank.jar org.apache.hadoop.examples.pagerank.NaivePageRank toy-graph-input toy-graph-output

5. Retrieve the output:
# hadoop fs -get toy-graph-output toy-graph.out