Testing multi outputformat based mapreduce

12/10/2014 Testing MultiOutputFormat based MapReduce | Ashok Agarwal
Ashok Agarwal
Testing MultiOutputFormat based MapReduce
≈ LEAVE A COMMENT
[]
Tags
11 Thursday Sep 2014
POSTED BY ASHOK AGARWAL IN BIG DATA
Big Data, Hadoop, MapReduce
In one of our projects, we were require to generate per client file as output of MapReduce Job, so
that the corresponding client can see their data and analyze it.
Consider you get daily stock prices files.
For 9/8/2014: 9_8_2014.csv
1234
9/8/14,MSFT,47
9/8/14,ORCL,40
9/8/14,GOOG,577
9/8/14,AAPL,100.4
For 9/9/2014: 9_9_2014.csv
1234
9/9/14,MSFT,46
9/9/14,ORCL,41
9/9/14,GOOG,578
9/9/14,AAPL,101
So on…
123456789
10
9/10/14,MSFT,48
9/10/14,ORCL,39.5
9/10/14,GOOG,577
9/10/14,AAPL,100
9/11/14,MSFT,47.5
9/11/14,ORCL,41
9/11/14,GOOG,588
9/11/14,AAPL,99.8
9/12/14,MSFT,46.69
9/12/14,ORCL,40.5
https://erashokagarwal.wordpress.com/2014/09/11/testing-multioutputformat-based-mapreduce/ 1/7

11
12
9/12/14,GOOG,576
9/12/14,AAPL,102.5
We want to analyze the each stock weekly trend. In order to that we need to create each stock
based data.
The below mapper code splits the read records from csv using TextInputFormat. The output
mapper key is stock and value is price.
123456789
10
11
12
13
package com.jbksoft;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class MyMultiOutputMapper extends Mapper<LongWritable, Text, Text, public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] tokens = line.split(",");
context.write(new Text(tokens[1]), new Text(tokens[2]));
}
}
The below reducer code creates file for each stock.
123456789
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
public class MyMultiOutputReducer extends Reducer<Text, Text, NullWritable, MultipleOutputs<NullWritable, Text> mos;
public void setup(Context context) {
mos = new MultipleOutputs(context);
}
public void reduce(Text key, Iterable<Text> values, Context context)
for (Text value : values) {
mos.write(NullWritable.get(), value, key.toString());
}
}
protected void cleanup(Context context)
mos.close();
}
}
The driver for the code:

123456789
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class MyMultiOutputTest {
public static void main(String[] args) throws IOException, InterruptedException, Path inputDir = new Path(args[0]);
Path outputDir = new Path(args[1]);
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(MyMultiOutputTest.class);
job.setJobName("My MultipleOutputs Demo");
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setMapperClass(MyMultiOutputMapper.class);
job.setReducerClass(MyMultiOutputReducer.class);
FileInputFormat.setInputPaths(job, inputDir);
FileOutputFormat.setOutputPath(job, outputDir);
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
job.waitForCompletion(true);
}
}
The command for executing above code(compiled and packaged as jar):
123456789
aagarwal‐mbpro:~ ashok.agarwal$ hadoop jar test.jar com.jbksoft.MyMultiOutputTest aagarwal‐mbpro:~ ashok.agarwal$ ls ‐l /Users/ashok.agarwal/dev/HBaseDemo/output
total 32
‐rwxr‐xr‐x 1 ashok.agarwal 1816361533 25 Sep 11 11:32 AAPL‐r‐00000
‐rwxr‐xr‐x 1 ashok.agarwal 1816361533 20 Sep 11 11:32 GOOG‐r‐00000
‐rwxr‐xr‐x 1 ashok.agarwal 1816361533 20 Sep 11 11:32 MSFT‐r‐00000
‐rwxr‐xr‐x 1 ashok.agarwal 1816361533 19 Sep 11 11:32 ORCL‐r‐00000
‐rwxr‐xr‐x 1 ashok.agarwal 1816361533 0 Sep 11 11:32 _SUCCESS
aagarwal‐mbpro:~ ashok.agarwal$
The test case for the above code can be created using MRunit.
The reducer needs to be mocked over here as below:

123456789
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
package com.jbksoft.test;
import com.jbksoft.MyMultiOutputReducer;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;
import org.apache.hadoop.mrunit.types.Pair;
import org.junit.Before;
import org.junit.Test;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertTrue;
public class MyMultiOutputReducerTest {
MockOSReducer reducer;
ReduceDriver<Text, Text, NullWritable, Text> reduceDriver;
Configuration config;
Map<String, List<Text>> outputCSVFiles;
static String[] CSV = {
"9/8/14,MSFT,47",
"9/8/14,ORCL,40",
"9/8/14,GOOG,577",
"9/8/14,AAPL,100.4",
"9/9/14,MSFT,46",
"9/9/14,ORCL,41",
"9/9/14,GOOG,578"
};
class MockOSReducer extends MyMultiOutputReducer {
private Map<String, List<Text>> multipleOutputMap;
public MockOSReducer(Map<String, List<Text>> map) {
super();
multipleOutputMap = map;
}
@Override
public void setup(Reducer.Context context) {
mos = new MultipleOutputs<NullWritable, Text>(context) {
@Override
public void write(NullWritable key, Text value, String outputFileName)
throws java.io.IOException, java.lang.InterruptedException {
List<Text> outputs = multipleOutputMap.get(outputFileName);
if (outputs == null) {
outputs = new ArrayList<Text>();

55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
multipleOutputMap.put(outputFileName, outputs);
}
outputs.add(new Text(value));
}
};
config = context.getConfiguration();
}
}
@Before
public void setup()
throws Exception {
config = new Configuration();
outputCSVFiles = new HashMap<String, List<Text>>();
reducer = new MockOSReducer(outputCSVFiles);
reduceDriver = ReduceDriver.newReduceDriver(reducer);
reduceDriver.setConfiguration(config);
}
@Test
public void testReduceInput1Output()
throws Exception {
List<Text> list = new ArrayList<Text>();
list.add(new Text("47"));
reduceDriver.withInput(new Text("MSFT"), list);
reduceDriver.runTest();
Map<String, List<Text>> expectedCSVOutput = new HashMap<String, List<Text> outputs = new ArrayList<Text>();
outputs.add(new Text("47"));
expectedCSVOutput.put("MSFT", outputs);
validateOutputList(outputCSVFiles, expectedCSVOutput);
}
static void print(Map<String, List<Text>> outputCSVFiles) {
for (String key : outputCSVFiles.keySet()) {
List<Text> valueList = outputCSVFiles.get(key);
for (Text pair : valueList) {
System.out.println("OUTPUT " + key + " = " + pair.toString());
}
}
}

109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
protected void validateOutputList(Map<String, List<Text>> actuals,
Map<String, List<Text>> expects) {
List<String> removeList = new ArrayList<String>();
for (String key : expects.keySet()) {
removeList.add(key);
List<Text> expectedValues = expects.get(key);
List<Text> actualValues = actuals.get(key);
int expectedSize = expectedValues.size();
int actualSize = actualValues.size();
int i = 0;
assertEquals("Number of output CSV files is " + actualSize + " actualSize, expectedSize);
while (expectedSize > i || actualSize > i) {
if (expectedSize > i && actualSize > i) {
Text expected = expectedValues.get(i);
Text actual = actualValues.get(i);
assertTrue("Expected CSV content is " + expected.toString() + "expected.equals(actual));
}
i++;
}
}
}
}
The mapper unit test can be as below:
123456789
10
11
12
13
14
15
16
17
18
19
20
package com.jbksoft.test;
import com.jbksoft.MyMultiOutputMapper;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.apache.hadoop.mrunit.types.Pair;
import org.junit.Before;
import org.junit.Test;
import java.util.ArrayList;
import java.util.List;
public class MyMultiOutputMapperTest {
MyMultiOutputMapper mapper;
MapDriver<LongWritable, Text, Text, Text> mapDriver;
Configuration config;
static String[] CSV = {
"9/8/14,MSFT,47",
"9/8/14,ORCL,40",
"9/8/14,GOOG,577"

21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
};
@Before
public void setup()
throws Exception {
config = new Configuration();
mapper = new MyMultiOutputMapper();
mapDriver = MapDriver.newMapDriver(mapper);
mapDriver.setConfiguration(config);
}
@Test
public void testMapInput1Output()
throws Exception {
mapDriver.withInput(new LongWritable(), new Text(CSV[0]));
mapDriver.withOutput(new Text("MSFT"), new Text("47"));
mapDriver.runTest();
}
@Test
public void testMapInput2Output()
throws Exception {
final List<Pair<LongWritable, Text>> inputs = new ArrayList<Pair&inputs.add(new Pair<LongWritable, Text>(new LongWritable(), new Text(CSV[
inputs.add(new Pair<LongWritable, Text>(new LongWritable(), new Text(CSV[
final List<Pair<Text, Text>> outputs = new ArrayList<Pair<outputs.add(new Pair<Text, Text>(new Text("MSFT"), new Text(&outputs.add(new Pair<Text, Text>(new Text("ORCL"), new Text(&// mapDriver.withAll(inputs).withAllOutput(outputs).runTest();
}
}
References:
1. MapReduce Tutorial
2. HDFS Architecture
3. MultipileOutputs
4. MRUnit
About Occasionally, these ads
some of your visitors may see an advertisement here.
Tell me more | Dismiss this message
Blog at WordPress.com. The Chateau Theme.

Testing multi outputformat based mapreduce

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (13)

Similaire à Testing multi outputformat based mapreduce

Similaire à Testing multi outputformat based mapreduce (20)

Dernier

Dernier (20)

Testing multi outputformat based mapreduce