Hadoop
Last Modified: 02/10 9:02
Description
From the Hadoop wiki: Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
Version
- 0.19.0
Authorized Users
- circe account holders
Platforms
- circe cluster
Local Documentation
Modules
Hadoop requires the following module and its prerequisite to run:
- apps/hadoop/0.19.0
- apps/jdk/1.6.0_10.x86_64
To run Hadoop on the cluster, ensure that you use module initadd to make the changes persistent. See Modules for more information.
Submitting Jobs
First, you will need a Hadoop-aware application to run within the Hadoop environment. For this example, we will use the WordCount? Java program cited in the Hadoop documentation:
Save this example code to a file called WordCount?.java:
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
We will use this code against Hadoop to determine the word frequencies in a file (a listing of how many times a particular word exists in the file)
Now, we shall compile it and create a jar:
$ mkdir wordcount_classes
$ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java
$ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/ .
Next, we create the submission script which starts hadoop and submits a job using our jar file. (Read the comments below for details on how this works.)
#!/bin/bash
#$ -N hadoop_run
#$ -pe hadoop 10
#$ -j y
#$ -o output.$JOB_ID
#$ -l h_rt=00:10:00,a_hadoop=true
#$ -cwd
. $HADOOP_HOME/conf/hadoop-sge.sh
# This configures the environment and starts the Hadoop Cluster.
hadoop_start
# I took a text copy of the King James Version of the Bible as test data (http://www.o-bible.org/download/kjv.txt)
# Removed all verse prefixes, punctuation, and forced everything to lowercase, like this:
# cat kjv.txt | awk '{ $1=NP; print }' | sed 's/[\.,-\!\?]//g' | tr "[:upper:]" "[:lower:]" > properkjv.txt
# Copy the input file into the HDFS filesystem
hadoop fs -put ./properkjv.txt properkjv.txt
# Running the hadoop task(s) here. I am specifying the jar, class, input, and output:
hadoop jar ./wordcount.jar org.myorg.WordCount properkjv.txt output
# Copying the output files from the HDFS filesystem
hadoop fs -get output hadoop-output.$JOB_ID
# Stops the Hadoop cluster.
hadoop_end
With the above run, the result data would be stored in ./hadoop-output-$JOB_ID/part-00000. A sample:
aaronites 2 aarons 31 abaddon 1 abagtha 1 abana 1 abarim 4 abase 4 abased 4 abasing 1 abated 6 abba 3 abda 2 abdeel 1 abdi 3 abdiel 1 abdon 8 abednego 15