The NameNode maintain its state by fsImage and since the time fsimage is created since then any operations happened will be tracked through edit logs. Each time something is requested from namenode it first looks for it in fsimage(loaded in RAM - so its fast) and then the edit log (on local disk).
NameNode Satte = FsImage + Edit Log(contain changes done after last fs image created).
Secondary Name Node:
In Hadoop 1, there was concept of secondary name node (SNN) which kept the checkpoint of Name Node. Secondary NameNode downloads edit logs and fsimage from namenode and merge it to form new fsimage(called checkPoint) It was not capable enough to upload it to Primary nameNode, PNN itself fetches it.
7. Check for input values to be passed for available sample app.
To
check for each sample inputs
[hdfs@sandbox
/]$ yarn jar
/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-examples-2.7.1.2.3.0.0-2557.jar
wordcount
output
Usage:
wordcount <in> [<in>...] <out>
[hdfs@sandbox
/]$
8. Run one of the available sample
To run
Word count Single file:
[root@sandbox
/]# su hdfs
[hdfs@sandbox
/]$ hadoop fs -copyFromLocal abc.txt
/user/hdfs/
[hdfs@sandbox
/]$ yarn jar /usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-examples-2.7.1.2.3.0.0-2557.jar
wordcount abc.txt def.txt
output
15/08/17
09:46:47 INFO impl.TimelineClientImpl: Timeline service address:
http://sandbox.hortonworks.com:8188/ws/v1/timeline/
15/08/17
09:46:47 INFO client.RMProxy: Connecting to ResourceManager at
sandbox.hortonworks.com/10.0.2.15:8050
15/08/17
09:46:47 INFO input.FileInputFormat: Total input paths to process : 1
15/08/17
09:46:48 INFO mapreduce.JobSubmitter: number of splits:1
15/08/17
09:46:48 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1439801989509_0005
15/08/17
09:46:48 INFO impl.YarnClientImpl: Submitted application
application_1439801989509_0005
15/08/17
09:46:48 INFO mapreduce.Job: The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1439801989509_0005/
15/08/17
09:46:48 INFO mapreduce.Job: Running job: job_1439801989509_0005
15/08/17
09:46:54 INFO mapreduce.Job: Job job_1439801989509_0005 running in uber mode :
false
15/08/17
09:46:54 INFO mapreduce.Job: map 0%
reduce 0%
15/08/17
09:46:59 INFO mapreduce.Job: map 100%
reduce 0%
15/08/17
09:47:04 INFO mapreduce.Job: map 100%
reduce 100%
15/08/17
09:47:04 INFO mapreduce.Job: Job job_1439801989509_0005 completed successfully
15/08/17
09:47:04 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=51
FILE: Number of bytes
written=253029
FILE: Number of read
operations=0
FILE: Number of large read
operations=0
FILE: Number of write
operations=0
HDFS: Number of bytes read=150
HDFS: Number of bytes
written=29
HDFS: Number of read
operations=6
HDFS: Number of large read
operations=0
HDFS: Number of write
operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in
occupied slots (ms)=2762
Total time spent by all reduces
in occupied slots (ms)=3086
Total time spent by all map
tasks (ms)=2762
Total time spent by all reduce tasks
(ms)=3086
Total vcore-seconds taken by
all map tasks=2762
Total vcore-seconds taken by
all reduce tasks=3086
Total megabyte-seconds taken by
all map tasks=690500
Total megabyte-seconds taken by
all reduce tasks=771500
Map-Reduce Framework
Map input records=1
Map output records=6
Map output bytes=56
Map output materialized
bytes=51
Input split bytes=118
Combine input records=6
Combine output records=4
Reduce input groups=4
Reduce shuffle bytes=51
Reduce input records=4
Reduce output records=4
Spilled Records=8
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=87
CPU time spent (ms)=1250
Physical memory (bytes)
snapshot=341491712
Virtual memory (bytes)
snapshot=1648259072
Total committed heap usage
(bytes)=264241152
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=32
File Output Format Counters
Bytes Written=29
[hdfs@sandbox
/]$ hadoop fs -ls /user/hdfs/def.txt
Found 2
items
-rw-r--r-- 1 hdfs hdfs 0 2015-08-17 09:47
/user/hdfs/def.txt/_SUCCESS
-rw-r--r-- 1 hdfs hdfs 29 2015-08-17 09:47
/user/hdfs/def.txt/part-r-00000
[hdfs@sandbox
/]$ hadoop fs -cat
/user/hdfs/def.txt/part-r-00000
ajay 2
alok 1
amar 1
anand 2
Similarly, we can run other available sample, such as wordMean, wordMedian, wordStandard deviation.
To run
Word count Multiple file: Pass Dir path containing files
[hdfs@sandbox
/]$ hadoop fs -copyFromLocal ijk.txt /user/hdfs/anand/
[hdfs@sandbox
/]$ hadoop fs -ls /user/hdfs/anand
Found 2
items
-rw-r--r-- 1 hdfs hdfs 27 2015-08-17 10:18
/user/hdfs/anand/ijk.txt
-rw-r--r-- 1 hdfs hdfs 32 2015-08-17 10:16
/user/hdfs/anand/jkl.txt
[hdfs@sandbox
/]$ yarn jar
/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-examples-2.7.1.2.3.0.0-2557.jar
wordcount /user/hdfs/anand/ /user/hdfs/outp
15/08/17
10:19:54 INFO impl.TimelineClientImpl: Timeline service address:
http://sandbox.hortonworks.com:8188/ws/v1/timeline/
15/08/17
10:19:54 INFO client.RMProxy: Connecting to ResourceManager at
sandbox.hortonworks.com/10.0.2.15:8050
15/08/17
10:19:54 INFO input.FileInputFormat: Total input paths to process : 2
15/08/17
10:19:55 INFO mapreduce.JobSubmitter: number of splits:2
15/08/17
10:19:55 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1439801989509_0009
15/08/17
10:19:55 INFO impl.YarnClientImpl: Submitted application
application_1439801989509_0009
15/08/17
10:19:55 INFO mapreduce.Job: The url to track the job:
http://sandbox.hortonworks.com:8088/proxy/application_1439801989509_0009/
15/08/17
10:19:55 INFO mapreduce.Job: Running job: job_1439801989509_0009
15/08/17
10:20:00 INFO mapreduce.Job: Job job_1439801989509_0009 running in uber mode :
false
15/08/17
10:20:00 INFO mapreduce.Job: map 0%
reduce 0%
15/08/17
10:20:06 INFO mapreduce.Job: map 100%
reduce 0%
15/08/17
10:20:13 INFO mapreduce.Job: map 100%
reduce 100%
15/08/17
10:20:13 INFO mapreduce.Job: Job job_1439801989509_0009 completed successfully
15/08/17
10:20:13 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=102
FILE: Number of bytes
written=379609
FILE: Number of read
operations=0
FILE: Number of large read
operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=307
HDFS: Number of bytes
written=56
HDFS: Number of read
operations=9
HDFS: Number of large read
operations=0
HDFS: Number of write
operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in
occupied slots (ms)=6985
Total time spent by all reduces
in occupied slots (ms)=4195
Total time spent by all map
tasks (ms)=6985
Total time spent by all reduce
tasks (ms)=4195
Total vcore-seconds taken by
all map tasks=6985
Total vcore-seconds taken by
all reduce tasks=4195
Total megabyte-seconds taken by
all map tasks=1746250
Total megabyte-seconds taken by
all reduce tasks=1048750
Map-Reduce Framework
Map input records=2
Map output records=10
Map output bytes=99
Map output materialized
bytes=108
Input split bytes=248
Combine input records=10
Combine output records=8
Reduce input groups=7
Reduce shuffle bytes=108
Reduce input records=8
Reduce output records=7
Spilled Records=16
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=139
CPU time spent (ms)=1740
Physical memory (bytes)
snapshot=551854080
Virtual memory (bytes)
snapshot=2480885760
Total committed heap usage
(bytes)=398458880
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=59
File Output Format Counters
Bytes Written=56
[hdfs@sandbox
/]$ hadoop fs -ls /user/hdfs/outp
Found 2
items
-rw-r--r-- 1 hdfs hdfs 0 2015-08-17 10:20 /user/hdfs/outp/_SUCCESS
-rw-r--r-- 1 hdfs hdfs 56 2015-08-17 10:20
/user/hdfs/outp/part-r-00000
[hdfs@sandbox
/]$ hadoop fs -cat /user/hdfs/outp/part-r-00000
ajay 2
alok 1
amar 1
anand 3
sanjay 1
savita 1
tushar 1
WordMean
[hdfs@sandbox
/]$ yarn jar
/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-examples-2.7.1.2.3.0.0-2557.jar
wordmean abc.txt ghi
15/08/17
09:59:51 INFO impl.TimelineClientImpl: Timeline service address:
http://sandbox.hortonworks.com:8188/ws/v1/timeline/
15/08/17
09:59:52 INFO client.RMProxy: Connecting to ResourceManager at
sandbox.hortonworks.com/10.0.2.15:8050
15/08/17
09:59:52 INFO input.FileInputFormat: Total input paths to process : 1
15/08/17
09:59:52 INFO mapreduce.JobSubmitter: number of splits:1
15/08/17
09:59:53 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1439801989509_0006
15/08/17
09:59:53 INFO impl.YarnClientImpl: Submitted application
application_1439801989509_0006
15/08/17
09:59:53 INFO mapreduce.Job: The url to track the job:
http://sandbox.hortonworks.com:8088/proxy/application_1439801989509_0006/
15/08/17
09:59:53 INFO mapreduce.Job: Running job: job_1439801989509_0006
15/08/17
09:59:59 INFO mapreduce.Job: Job job_1439801989509_0006 running in uber mode :
false
15/08/17
09:59:59 INFO mapreduce.Job: map 0%
reduce 0%
15/08/17
10:00:04 INFO mapreduce.Job: map 100%
reduce 0%
15/08/17
10:00:11 INFO mapreduce.Job: map 100%
reduce 100%
15/08/17
10:00:11 INFO mapreduce.Job: Job job_1439801989509_0006 completed successfully
15/08/17
10:00:11 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=39
FILE: Number of bytes
written=252993
FILE: Number of read
operations=0
FILE: Number of large read
operations=0
FILE: Number of write
operations=0
HDFS: Number of bytes read=150
HDFS: Number of bytes
written=18
HDFS: Number of read
operations=6
HDFS: Number of large read
operations=0
HDFS: Number of write
operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in
occupied slots (ms)=2876
Total time spent by all reduces
in occupied slots (ms)=4070
Total time spent by all map
tasks (ms)=2876
Total time spent by all reduce tasks
(ms)=4070
Total vcore-seconds taken by
all map tasks=2876
Total vcore-seconds taken by
all reduce tasks=4070
Total megabyte-seconds taken by
all map tasks=719000
Total megabyte-seconds taken by
all reduce tasks=1017500
Map-Reduce Framework
Map input records=1
Map output records=12
Map output bytes=174
Map output materialized
bytes=39
Input split bytes=118
Combine input records=12
Combine output records=2
Reduce input groups=2
Reduce shuffle bytes=39
Reduce input records=2
Reduce output records=2
Spilled Records=4
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=102
CPU time spent (ms)=1430
Physical memory (bytes)
snapshot=341884928
Virtual memory (bytes)
snapshot=1672888320
Total committed heap usage
(bytes)=264241152
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=32
File Output Format Counters
Bytes Written=18
The mean is:
4.333333333333333
[hdfs@sandbox /]$ hadoop fs -ls /user/hdfs/ghi
Found 2
items
-rw-r--r-- 1 hdfs hdfs 0 2015-08-17 10:00
/user/hdfs/ghi/_SUCCESS
-rw-r--r-- 1 hdfs hdfs 18 2015-08-17 10:00
/user/hdfs/ghi/part-r-00000
[hdfs@sandbox
/]$ hadoop fs -cat
/user/hdfs/ghi/part-r-00000
count 6
length 26
WordMedian
[hdfs@sandbox /]$ yarn jar
/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-examples-2.7.1.2.3.0.0-2557.jar
wordmedian /user/hdfs/abc.txt wordmedianout
15/08/17 10:26:48 INFO
impl.TimelineClientImpl: Timeline service address:
http://sandbox.hortonworks.com:8188/ws/v1/timeline/
15/08/17 10:26:49 INFO
client.RMProxy: Connecting to ResourceManager at
sandbox.hortonworks.com/10.0.2.15:8050
15/08/17 10:26:49 WARN mapreduce.JobResourceUploader:
Hadoop command-line option parsing not performed. Implement the Tool interface
and execute your application with ToolRunner to remedy this.
15/08/17 10:26:49 INFO
input.FileInputFormat: Total input paths to process : 1
15/08/17 10:26:50 INFO
mapreduce.JobSubmitter: number of splits:1
15/08/17 10:26:50 INFO
mapreduce.JobSubmitter: Submitting tokens for job: job_1439801989509_0010
15/08/17 10:26:50 INFO
impl.YarnClientImpl: Submitted application application_1439801989509_0010
15/08/17 10:26:50 INFO mapreduce.Job:
The url to track the job:
http://sandbox.hortonworks.com:8088/proxy/application_1439801989509_0010/
15/08/17 10:26:50 INFO mapreduce.Job:
Running job: job_1439801989509_0010
15/08/17 10:26:56 INFO mapreduce.Job:
Job job_1439801989509_0010 running in uber mode : false
15/08/17 10:26:56 INFO
mapreduce.Job: map 0% reduce 0%
15/08/17 10:27:01 INFO
mapreduce.Job: map 100% reduce 0%
15/08/17 10:27:08 INFO
mapreduce.Job: map 100% reduce 100%
15/08/17 10:27:09 INFO mapreduce.Job:
Job job_1439801989509_0010 completed successfully
15/08/17 10:27:09 INFO mapreduce.Job:
Counters: 49
File System Counters
FILE: Number of bytes read=26
FILE: Number of bytes
written=252719
FILE: Number of read
operations=0
FILE: Number of large read
operations=0
FILE: Number of write
operations=0
HDFS: Number of bytes read=150
HDFS: Number of bytes written=8
HDFS: Number of read
operations=6
HDFS: Number of large read
operations=0
HDFS: Number of write
operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in
occupied slots (ms)=2974
Total time spent by all reduces
in occupied slots (ms)=3426
Total time spent by all map
tasks (ms)=2974
Total time spent by all reduce tasks
(ms)=3426
Total vcore-seconds taken by
all map tasks=2974
Total vcore-seconds taken by
all reduce tasks=3426
Total megabyte-seconds taken by
all map tasks=743500
Total megabyte-seconds taken by all reduce tasks=856500
Map-Reduce Framework
Map input records=1
Map output records=6
Map output bytes=48
Map output materialized bytes=26
Input split bytes=118
Combine input records=6
Combine output records=2
Reduce input groups=2
Reduce shuffle bytes=26
Reduce input records=2
Reduce output records=2
Spilled Records=4
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=114
CPU time spent (ms)=1350
Physical memory (bytes)
snapshot=358936576
Virtual memory (bytes)
snapshot=1683902464
Total committed heap usage
(bytes)=264241152
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=32
File Output Format Counters
Bytes Written=8
The median is: 4
Wordstandarddeviation
[hdfs@sandbox /]$ yarn jar
/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-examples-2.7.1.2.3.0.0-2557.jar
wordstandarddeviation /user/hdfs/jkl.txt worddev
15/08/17 10:30:24 INFO
impl.TimelineClientImpl: Timeline service address:
http://sandbox.hortonworks.com:8188/ws/v1/timeline/
15/08/17 10:30:24 INFO
client.RMProxy: Connecting to ResourceManager at
sandbox.hortonworks.com/10.0.2.15:8050
15/08/17 10:30:25 INFO
input.FileInputFormat: Total input paths to process : 1
15/08/17 10:30:25 INFO
mapreduce.JobSubmitter: number of splits:1
15/08/17 10:30:25 INFO
mapreduce.JobSubmitter: Submitting tokens for job: job_1439801989509_0012
15/08/17 10:30:25 INFO
impl.YarnClientImpl: Submitted application application_1439801989509_0012
15/08/17 10:30:26 INFO mapreduce.Job:
The url to track the job:
http://sandbox.hortonworks.com:8088/proxy/application_1439801989509_0012/
15/08/17 10:30:26 INFO mapreduce.Job:
Running job: job_1439801989509_0012
15/08/17 10:30:32 INFO mapreduce.Job:
Job job_1439801989509_0012 running in uber mode : false
15/08/17 10:30:32 INFO
mapreduce.Job: map 0% reduce 0%
15/08/17 10:30:37 INFO
mapreduce.Job: map 100% reduce 0%
15/08/17 10:30:42 INFO
mapreduce.Job: map 100% reduce 100%
15/08/17 10:30:42 INFO mapreduce.Job:
Job job_1439801989509_0012 completed successfully
15/08/17 10:30:42 INFO mapreduce.Job:
Counters: 49
File System Counters
FILE: Number of bytes read=56
FILE: Number of bytes
written=253199
FILE: Number of read operations=0
FILE: Number of large read
operations=0
FILE: Number of write
operations=0
HDFS: Number of bytes read=150
HDFS: Number of bytes
written=29
HDFS: Number of read
operations=6
HDFS: Number of large read
operations=0
HDFS: Number of write
operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in
occupied slots (ms)=2857
Total time spent by all reduces
in occupied slots (ms)=3110
Total time spent by all map
tasks (ms)=2857
Total time spent by all reduce
tasks (ms)=3110
Total vcore-seconds taken by
all map tasks=2857
Total vcore-seconds taken by
all reduce tasks=3110
Total megabyte-seconds taken by
all map tasks=714250
Total megabyte-seconds taken by
all reduce tasks=777500
Map-Reduce Framework
Map input records=1
Map output records=18
Map output bytes=264
Map output materialized bytes=56
Input split bytes=118
Combine input records=18
Combine output records=3
Reduce input groups=3
Reduce shuffle bytes=56
Reduce input records=3
Reduce output records=3
Spilled Records=6
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=88
CPU time spent (ms)=1290
Physical memory (bytes) snapshot=349061120
Virtual memory (bytes)
snapshot=1669423104
Total committed heap usage
(bytes)=265289728
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=32
File Output Format Counters
Bytes Written=29
The standard deviation is:
0.4714045207910346
Now, lets make sample of our own:
1. WordCount-
Self
Objective- To count number of times each word has occured
InputFile:
A file containing Text which has to be sorted.
Steps-
1. Create
a java project
2.
Create
a java class WordCount.java
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setJarByClass(org.myorg.WordCount.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
3.
Include jars - hadoop-common-2.7.1.jar and hadoop-mapreduce-client-core-2.7.1.jar
4.
Export as jar . Right Click on Project >
Export > jar for ex- wc.jar
5.
Copy to the folder of your wish using winscp to
sandbox m/c
6.
Create input file in hdfs
7.
Fixed by adding job.setJarByClass(org.myorg.WordCount.class);
8.
Run command yarn jar /anand/wc.jar org.myorg.WordCount /user/hdfs/jkl.txt qqq
To start mapreduce task
[hdfs@sandbox anand]$ yarn jar
/anand/wc.jar org.myorg.WordCount /user/hdfs/jkl.txt qqq
15/08/17 11:48:05 INFO
impl.TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
15/08/17 11:48:05 INFO
client.RMProxy: Connecting to ResourceManager at
sandbox.hortonworks.com/10.0.2.15:8050
15/08/17 11:48:05 WARN
mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed.
Implement the Tool interface and execute your application with ToolRunner to
remedy this.
15/08/17 11:48:05 INFO
input.FileInputFormat: Total input paths to process : 1
15/08/17 11:48:06 INFO
mapreduce.JobSubmitter: number of splits:1
15/08/17 11:48:06 INFO
mapreduce.JobSubmitter: Submitting tokens for job: job_1439801989509_0018
15/08/17 11:48:06 INFO
impl.YarnClientImpl: Submitted application application_1439801989509_0018
15/08/17 11:48:06 INFO mapreduce.Job:
The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1439801989509_0018/
15/08/17 11:48:06 INFO mapreduce.Job:
Running job: job_1439801989509_0018
15/08/17 11:48:13 INFO mapreduce.Job:
Job job_1439801989509_0018 running in uber mode : false
15/08/17 11:48:13 INFO
mapreduce.Job: map 0% reduce 0%
15/08/17 11:48:19 INFO
mapreduce.Job: map 100% reduce 0%
15/08/17 11:48:25 INFO
mapreduce.Job: map 100% reduce 100%
15/08/17 11:48:25 INFO mapreduce.Job:
Job job_1439801989509_0018 completed successfully
15/08/17 11:48:25 INFO mapreduce.Job:
Counters: 49
File System Counters
FILE: Number of bytes read=74
FILE: Number of bytes
written=253051
FILE: Number of read
operations=0
FILE: Number of large read operations=0
FILE: Number of write
operations=0
HDFS: Number of bytes read=150
HDFS: Number of bytes
written=29
HDFS: Number of read
operations=6
HDFS: Number of large read
operations=0
HDFS: Number of write
operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied
slots (ms)=3150
Total time spent by all reduces
in occupied slots (ms)=3191
Total time spent by all map
tasks (ms)=3150
Total time spent by all reduce
tasks (ms)=3191
Total vcore-seconds taken by
all map tasks=3150
Total vcore-seconds taken by
all reduce tasks=3191
Total megabyte-seconds taken by
all map tasks=787500
Total megabyte-seconds taken by
all reduce tasks=797750
Map-Reduce Framework
Map input records=1
Map output records=6
Map output bytes=56
Map output materialized
bytes=74
Input split bytes=118
Combine input records=0
Combine output records=0
Reduce input groups=4
Reduce shuffle bytes=74
Reduce input records=6
Reduce output records=4
Spilled Records=12
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=87
CPU time spent (ms)=1290
Physical memory (bytes)
snapshot=354729984
Virtual memory (bytes)
snapshot=1664598016
Total committed heap usage
(bytes)=264241152
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=32
File Output Format Counters
Bytes Written=29
To see Output file name
[hdfs@sandbox anand]$ hadoop fs -ls
/user/hdfs/qqq
Found 2 items
-rw-r--r-- 1 hdfs hdfs 0 2015-08-17 11:48
/user/hdfs/qqq/_SUCCESS
-rw-r--r-- 1 hdfs hdfs 29 2015-08-17 11:48
/user/hdfs/qqq/part-r-00000
To see output file content
[hdfs@sandbox anand]$ hadoop fs -cat
/user/hdfs/qqq/part-r-00000
ajay 2
alok 1
amar 1
anand 2
2. Dictionary
order
Objective- to sort all the words encountered in input file on basic of dictionary order of first letter.
InputFile:
A file containing Text which has to be sorted.
Mapreduce:
[hdfs@sandbox anand]$ yarn jar
dictionary.jar org.demo.Dictionary /user/hdfs/nm.txt out_new23
15/08/18 06:24:03 INFO
impl.TimelineClientImpl: Timeline service address:
http://sandbox.hortonworks.com:8188/ws/v1/timeline/
15/08/18 06:24:03 INFO
client.RMProxy: Connecting to ResourceManager at
sandbox.hortonworks.com/10.0.2.15:8050
15/08/18 06:24:03 WARN
mapreduce.JobResourceUploader: Hadoop command-line option parsing not
performed. Implement the Tool interface and execute your application with
ToolRunner to remedy this.
15/08/18 06:24:04 INFO
input.FileInputFormat: Total input paths to process : 1
15/08/18 06:24:04 INFO
mapreduce.JobSubmitter: number of splits:1
15/08/18 06:24:04 INFO
mapreduce.JobSubmitter: Submitting tokens for job: job_1439801989509_0039
15/08/18 06:24:04 INFO
impl.YarnClientImpl: Submitted application application_1439801989509_0039
15/08/18 06:24:04 INFO mapreduce.Job:
The url to track the job:
http://sandbox.hortonworks.com:8088/proxy/application_1439801989509_0039/
15/08/18 06:24:04 INFO mapreduce.Job:
Running job: job_1439801989509_0039
15/08/18 06:24:10 INFO mapreduce.Job:
Job job_1439801989509_0039 running in uber mode : false
15/08/18 06:24:10 INFO
mapreduce.Job: map 0% reduce 0%
15/08/18 06:24:15 INFO
mapreduce.Job: map 100% reduce 0%
15/08/18 06:24:21 INFO
mapreduce.Job: map 100% reduce 100%
15/08/18 06:24:21 INFO mapreduce.Job:
Job job_1439801989509_0039 completed successfully
15/08/18 06:24:22 INFO mapreduce.Job:
Counters: 49
File System Counters
FILE: Number of bytes read=3994
FILE: Number of bytes
written=260909
FILE: Number of read
operations=0
FILE: Number of large read
operations=0
FILE: Number of write
operations=0
HDFS: Number of bytes read=1884
HDFS: Number of bytes
written=1901
HDFS: Number of read
operations=6
HDFS: Number of large read
operations=0
HDFS: Number of write
operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied
slots (ms)=3329
Total time spent by all reduces
in occupied slots (ms)=3104
Total time spent by all map
tasks (ms)=3329
Total time spent by all reduce
tasks (ms)=3104
Total vcore-seconds taken by
all map tasks=3329
Total vcore-seconds taken by
all reduce tasks=3104
Total megabyte-seconds taken by
all map tasks=832250
Total megabyte-seconds taken by
all reduce tasks=776000
Map-Reduce Framework
Map input records=14
Map output records=350
Map output bytes=3288
Map output materialized
bytes=3994
Input split bytes=117
Combine input
records=0
Combine output records=0
Reduce input groups=26
Reduce shuffle bytes=3994
Reduce input records=350
Reduce output records=26
Spilled Records=700
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=89
CPU time spent (ms)=1380
Physical memory (bytes)
snapshot=352882688
Virtual memory (bytes)
snapshot=1665654784
Total committed heap usage
(bytes)=265289728
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1767
File Output Format Counters
Output
[hdfs@sandbox anand]$ hadoop fs -cat
/user/hdfs/out_new23/part-r-00000
- - - -
A and and and as and and and abode a and
Ajanta and are a as a also and all attract are a are and and all a after and a
a and and and and
B been binding Baba by Bhagat Bose boast
beloved Buddha Bose Brahmaputra bloom biggest
C country country country country country
civilisations caves country culture Chandra C.V centuries coasts Chand Chandra
China can country
D democracy Dr droughts described
diversity delight
E earth Ellora
F fields from frontiers for few Fort from
Fatehpur fit fertile fields fed
G gods guard Gandhi greats Gangetic Gandhi
given Ganga gods Godavari
H have her has have Himalayas has has have
has Homi hindrance here
I It is India India is is in in in is
India In it in is inherited India is is Its
J Jawaharlal Jagdish
K Krishna Kashmir Kaveri Kajuraho
L languages like Lajpat Lala like leaders
like land land literature land lakes land like
M most many my my modern mountains many
mighty my many Mahatma Mahatma mighty My most my Mahal
N Netaji Nilgiris Nehru natural north
Narmada Nath
O of one of Ooty of on over of of our Ours
on on oceans our of of of oldest of one of of of
P Prem people paradise place proud
produced Pratap populous Patel places produced persons produced
Q Qutab
R Rabindra region Ramman Rai Rana Red
running religions rivers rivers
S Sikri second Shiva sides Saratchandra
speak science South same Singh spirit Subhash spirit Sardar sides Shivaji;
secular state
Special Characters . , , , , , . , . . , , , , . , , , . . ,
, , , . . , , , , . . . . . , . . , . , , , , . .
T the the the The that three the the
through the There Tagore the the through the together tourists' The Taj the The
the the the the the temples the the the The the the times The the the the
temples the The
U unity us
V various Valley villages valleys
W world world warriors we worship We world
We world which without wonders which
Y yet Yamuna
Source code:
package org.demo;
import java.io.IOException;
import java.security.spec.EncodedKeySpec;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class Dictionary {
public static class Map extends Mapper<LongWritable, Text, Text, Text> {
private Text one = new Text();
private Text word = new Text();
private Text specialCharacterKey = new Text("Special Characters");
String t;
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
String temp;
t=tokenizer.nextToken();
temp=new String(t);
if(t.endsWith(",")||t.endsWith(".")||t.endsWith("\"")) {
t=t.substring(0, t.length()-1);
context.write(specialCharacterKey, new Text(temp.substring(temp.length()-1, temp.length())));
}
if(t.startsWith("\"")) {
t=t.substring(1, t.length()-1);
context.write(specialCharacterKey, new Text(temp.substring(0, 1)));
}
word.set(t.substring(0, 1).toUpperCase());
one.set(t);
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String sum="";
for (Text val : values) {
sum = sum+val.toString()+" ";
}
context.write(key, new Text(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "dictionary");
job.setJarByClass(org.demo.Dictionary.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
3, Expenses:
Objective - To calculate the aggregate expenditure for each category ex- Grocery, medicine etc
Input file:
[hdfs@sandbox root]$ hadoop fs -cat /user/hdfs/exp.txt
Mobile 1000
Oil 5000
Medicine 2000
Medicine 1000
Grocery 3000
Grocery 4000
Medicine 1000
Flour 2000
Mobile 2000
Medicine 1000
Education 40000
Grocery 2000
Grocery 2000
Medicine 2000
Drinks 5000
Mobile 2000
Medicine 3000
Grocery 7000
Education 20000
Medicine 1000
Mobile 2000
Medicine 1000
MapReduce:
[hdfs@sandbox anand]$ yarn jar
expenses.jar org.demo.Expense /user/hdfs/exp.txt out_new25
15/08/18 06:27:39 INFO
impl.TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
15/08/18 06:27:39 INFO
client.RMProxy: Connecting to ResourceManager at
sandbox.hortonworks.com/10.0.2.15:8050
15/08/18 06:27:40 WARN
mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed.
Implement the Tool interface and execute your application with ToolRunner to
remedy this.
15/08/18 06:27:40 INFO
input.FileInputFormat: Total input paths to process : 1
15/08/18 06:27:40 INFO
mapreduce.JobSubmitter: number of splits:1
15/08/18 06:27:40 INFO
mapreduce.JobSubmitter: Submitting tokens for job: job_1439801989509_0041
15/08/18 06:27:41 INFO
impl.YarnClientImpl: Submitted application application_1439801989509_0041
15/08/18 06:27:41 INFO mapreduce.Job:
The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1439801989509_0041/
15/08/18 06:27:41 INFO mapreduce.Job:
Running job: job_1439801989509_0041
15/08/18 06:27:47 INFO mapreduce.Job:
Job job_1439801989509_0041 running in uber mode : false
15/08/18 06:27:47 INFO
mapreduce.Job: map 0% reduce 0%
15/08/18 06:27:52 INFO
mapreduce.Job: map 100% reduce 0%
15/08/18 06:27:57 INFO
mapreduce.Job: map 100% reduce 100%
15/08/18 06:27:57 INFO mapreduce.Job:
Job job_1439801989509_0041 completed successfully
15/08/18 06:27:57 INFO mapreduce.Job:
Counters: 49
File System Counters
FILE: Number of bytes read=334
FILE: Number of bytes
written=253589
FILE: Number of read
operations=0
FILE: Number of large read
operations=0
FILE: Number of write
operations=0
HDFS: Number of bytes read=441
HDFS: Number of bytes
written=68
HDFS: Number of read
operations=6
HDFS: Number of large read
operations=0
HDFS: Number of write
operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied
slots (ms)=2795
Total time spent by all reduces
in occupied slots (ms)=2983
Total time spent by all map
tasks (ms)=2795
Total time spent by all reduce
tasks (ms)=2983
Total vcore-seconds taken by
all map tasks=2795
Total vcore-seconds taken by
all reduce tasks=2983
Total megabyte-seconds taken by
all map tasks=698750
Total megabyte-seconds taken by
all reduce tasks=745750
Map-Reduce Framework
Map input records=23
Map output records=23
Map output bytes=282
Map output materialized
bytes=334
Input split bytes=118
Combine input records=0
Combine output records=0
Reduce input groups=5
Reduce shuffle bytes=334
Reduce input records=23
Reduce output records=5
Spilled Records=46
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=85
CPU time spent (ms)=1260
Physical memory (bytes)
snapshot=339275776
Virtual memory (bytes)
snapshot=1672081408
Total committed heap usage
(bytes)=264241152
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=323
File Output Format Counters
Bytes Written=68
Output
[hdfs@sandbox anand]$ hadoop fs -cat
/user/hdfs/out_new25/part-r-00000
Education 60000
Grocery 29000
Medicine 12000
Mobile 7000
Other 5000
[hdfs@sandbox anand]$
Source Code:
package org.demo;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class Expense {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text groceryKey = new Text("Grocery");
private static IntWritable groceryValue;
private Text medicalKey = new Text("Medicine");
private static IntWritable medicalValue;
private Text mobileKey = new Text("Mobile");
private static IntWritable mobileValue;
private Text educationKey = new Text("Education");
private static IntWritable educationValue;
private Text otherKey = new Text("Other");
private static IntWritable otherValue;
private String exp;
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
if (tokenizer.countTokens() == 2) {
exp = tokenizer.nextToken();
if (Arrays.asList("Grocery", "Flour", "Oil").contains(exp)) {
groceryValue = new IntWritable(Integer.parseInt(tokenizer.nextToken().toString()));
context.write(groceryKey, groceryValue);
} else if (exp.equals("Medicine")) {
medicalValue = new IntWritable(Integer.parseInt(tokenizer.nextToken().toString()));
context.write(medicalKey, medicalValue);
} else if (exp.equals("Mobile")) {
mobileValue = new IntWritable(Integer.parseInt(tokenizer.nextToken().toString()));
context.write(mobileKey, mobileValue);
} else if (exp.equals("Education")) {
educationValue = new IntWritable(Integer.parseInt(tokenizer.nextToken().toString()));
context.write(educationKey, educationValue);
} else {
otherValue = new IntWritable(Integer.parseInt(tokenizer.nextToken().toString()));
context.write(otherKey, otherValue);
}
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Expenses");
job.setJarByClass(org.demo.Expense.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}4. A sample retrieving temperature max , min from weather report.
4. Weather
Objective- Retrieving temperature max , min from
weather report.
Input file:
A weather report in format as below:
23907 20150712 2.423 -98.08 30.62 34.1 21.7 27.9 27.8 0.0 26.94 C 50.9 23.8 34.8 -9999.0 -9999.0 -9999.0 0.175 0.238 -99.000 -99.000 -99.000 28.6 27.9 -9999.0 -9999.0 -9999.0
23907 20150713 2.423 -98.08 30.62 35.5 22.6 29.1 28.8 0.0 28.29 C 52.5 24.8 36.0 -9999.0 -9999.0 -9999.0 0.169 0.233 -99.000 -99.000 -99.000 29.3 28.5 -9999.0 -9999.0 -9999.0
23907 20150714 2.423 -98.08 30.62 36.0 21.8 28.9 28.3 0.0 28.69 C 52.3 24.1 35.6 -9999.0 -9999.0 -9999.0 0.163 0.228 -99.000 -99.000 -99.000 29.4 28.7 -9999.0 -9999.0 -9999.0
23907 20150715 2.423 -98.08 30.62 34.5 22.0 28.3 27.7 0.0 28.70 C 52.1 24.3 35.4 90.3 11.4 49.1 0.157 0.223 -99.000 -99.000 -99.000 29.5 28.8 -9999.0 -9999.0 -9999.0
MapReduce:
[hdfs@sandbox anand]$ yarn jar weather.jar org.demo.MyMaxMin /user/hdfs/weather_data.txt out_new25
Output:
Source Code:
package org.demo;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
public class MyMaxMin {
//Mapper
/**
*MaxTemperatureMapper class is static and extends Mapper abstract class
having four hadoop generics type LongWritable, Text, Text, Text.
*/
public static class MaxTemperatureMapper extends
Mapper<LongWritable, Text, Text, Text> {
/**
* @method map
* This method takes the input as text data type.
* Now leaving the first five tokens,it takes 6th token is taken as temp_max and
* 7th token is taken as temp_min. Now temp_max > 35 and temp_min < 10 are passed to the reducer.
*/
@Override
public void map(LongWritable arg0, Text Value, Context context)
throws IOException, InterruptedException {
//Converting the record (single line) to String and storing it in a String variable line
String line = Value.toString();
//Checking if the line is not empty
if (!(line.length() == 0)) {
//date
String date = line.substring(6, 14);
//maximum temperature
float temp_Max = Float
.parseFloat(line.substring(39, 45).trim());
//minimum temperature
float temp_Min = Float
.parseFloat(line.substring(47, 53).trim());
//if maximum temperature is greater than 35 , its a hot day
if (temp_Max > 35.0) {
// Hot day
context.write(new Text("Hot Day " + date),
new Text(String.valueOf(temp_Max)));
}
//if minimum temperature is less than 10 , its a cold day
if (temp_Min < 10) {
// Cold day
context.write(new Text("Cold Day " + date),
new Text(String.valueOf(temp_Min)));
}
}
}
}
//Reducer
/**
*MaxTemperatureReducer class is static and extends Reducer abstract class
having four hadoop generics type Text, Text, Text, Text.
*/
public static class MaxTemperatureReducer extends
Reducer<Text, Text, Text, Text> {
/**
* @method reduce
* This method takes the input as key and list of values pair from mapper, it does aggregation
* based on keys and produces the final context.
*/
public void reduce(Text Key, Iterator<Text> Values, Context context)
throws IOException, InterruptedException {
//putting all the values in temperature variable of type String
String temperature = Values.next().toString();
context.write(Key, new Text(temperature));
}
}
/**
* @method main
* This method is used for setting all the configuration properties.
* It acts as a driver for map reduce code.
*/
public static void main(String[] args) throws Exception {
//reads the default configuration of cluster from the configuration xml files
Configuration conf = new Configuration();
//Initializing the job with the default configuration of the cluster
Job job = new Job(conf, "weather example");
//Assigning the driver class name
job.setJarByClass(org.demo.MyMaxMin.class);
//Key type coming out of mapper
job.setMapOutputKeyClass(Text.class);
//value type coming out of mapper
job.setMapOutputValueClass(Text.class);
//Defining the mapper class name
job.setMapperClass(MaxTemperatureMapper.class);
//Defining the reducer class name
job.setReducerClass(MaxTemperatureReducer.class);
//Defining input Format class which is responsible to parse the dataset into a key value pair
job.setInputFormatClass(TextInputFormat.class);
//Defining output Format class which is responsible to parse the dataset into a key value pair
job.setOutputFormatClass(TextOutputFormat.class);
//setting the second argument as a path in a path variable
Path OutputPath = new Path(args[1]);
//Configuring the input path from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));
//Configuring the output path from the filesystem into the job
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//deleting the context path automatically from hdfs so that we don't have delete it explicitly
OutputPath.getFileSystem(conf).delete(OutputPath);
//exiting the job only if the flag value becomes false
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
5. CommonFriend Finder:
Objective- find common friends of two.
Input:
A:B C D
B:A C D E
C:A B D E
D:A B C E
E:B C D
Map-Reduce
[hdfs@sandbox anand]$ yarn jar commonFriend.jar org.demo.Finder /user/hdfs/friendlist.txt count_new13
Output:
SourceCode:
package org.demo;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class Finder {
public static String commonString(String str1, String str2) {
String res="";
char[] arr1 = str1.toCharArray();
char[] arr2 = str2.toCharArray();
for (int i = 0; i < arr1.length; i++) {
for (int j = 0; j < arr2.length; j++) {
if (arr1[i] == arr2[j]) {
res=res+arr1[i] ;
}
}
}
return res;
}
public static class Map extends Mapper<LongWritable, Text, Text, Text> {
private static Text keyMapper = new Text();
String person;
String friends;
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
person=line.substring(0, 1);
friends=line.substring(2, line.length());
char[] ch = friends.toCharArray();
for(int i=0; i<ch.length;i++) {
if(ch[i]<person.toCharArray()[0])
keyMapper.set(ch[i]+person);
else
keyMapper.set(person+ch[i]);
context.write(keyMapper, new Text(friends));
}
}
}
public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
ArrayList<String> l = new ArrayList<String>();
for (Text val : values) {
l.add(val.toString());
}
String res=l.get(0);
for (String s: l) {
res=commonString(res, s);
}
context.write(key, new Text(res));
}
}
public static void main(String[] args) throws Exception {
//Needed when output key value for maaper is different from reducer
Configuration c= new Configuration();
Job job = new Job(c, "Finder");
job.setJarByClass(org.demo.Finder.class);
//expected output key value for both reducer and mapper
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
6. Different length word counter
Objective- find number of words of different Length
InputFile:
A file containing Text which has to be sorted.
MapReduce:
[hdfs@sandbox anand]$ yarn jar wc_diff.jar org.demo.AnalyzeText /user/hdfs/tesxt.txt count_new10
OutPut:
SourceCode:
package org.demo;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class AnalyzeText {
public static class Map extends Mapper<LongWritable, Text, IntWritable, Text> {
private static IntWritable length = new IntWritable();
private static Text word = new Text();
String val;
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
val=tokenizer.nextToken();
length.set(val.length());
word.set(val);
context.write(length, word);
}
}
}
public static class Reduce extends Reducer<IntWritable, Text, IntWritable, IntWritable> {
public void reduce(IntWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (Text val : values) {
sum++;
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
//Needed when output key value for maaper is different from reducer
JobConf c = new JobConf();
c.setMapOutputKeyClass(IntWritable.class);
c.setMapOutputValueClass(Text.class);
Job job = new Job(c, "AnalyzeText");
job.setJarByClass(org.demo.AnalyzeText.class);
//expected output key value for both reducer and mapper
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
To resolve the dependencies of Hadoop project include the jars using maven repositories. to do so, follow below steps:
pom.xml
A. Create a maven project
B. Define the below artifacts in pom.xml with the version of your need
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.bogotobogo.hadoop</groupId>
<artifactId>wordcount</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>wordcount</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.2</version>
</dependency>
</dependencies>
</project>