First, the similarities:
In both technologies, input data is typically divided in a large number of smaller data units, we can simply call them records for the purpose of this article. Map operation is responsible for the transformation of each record individually. It only ever needs to know one item at a time, which is a very powerful assumption as it allows very easy parallelization. Reduce, on the other hand, works by applying the transformation on records against each other or in other words it derives information from multiple items at a time. This makes parallelization a bit more difficult, but it still may be possible depending the way the data is organized. So, roughly said, in both cases map performs a scalar transformation on the input sequence and reduce aggregates it.
Now, the differences:
Hadoop map seems to be a more general case of Clojure map, specifically with regard to argument cardinality.
In Clojure, map always produces a sequence of the same length it received. For example, we can increase every number in the input sequence by exactly ten:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(map #(+ % 10) [1 2 3 4 5]) | |
;=> (11 12 13 14 15) |
The same can be written in Hadoop as this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
public void map(Text key, IntWritable value, | |
OutputCollector<Text, IntWritable> output, | |
Reporter reporter) | |
throws IOException { | |
output.collect(key, new IntWritable(value.get() + 10)); | |
} |
It becomes obvious just by looking at the code above that the number of output records for each input record depends solely on the number of times the collect method has been called. For example, we can decide to completely ignore input records with values in some specific range. We can achieve this by making a small change in the code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
public void map(Text key, IntWritable value, | |
OutputCollector<Text, IntWritable> output, | |
Reporter reporter) | |
throws IOException { | |
//ignore records not larger than 3 | |
if(value.get() > 3){ | |
output.collect(key, value); | |
} | |
} |
This is something our Clojure map function cannot do. Admittedly it can transform unwanted items to nil and leave it to the caller function to remove them, but that is not exactly the same thing. The function we need here is filter:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(filter #(> % 3) [1 2 3 4 5]) | |
;=> (4 5) |
Following the same train of thought, we can notice that the number of output items can also be larger:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
public void map(Text key, IntWritable value, | |
OutputCollector<Text, IntWritable> output, | |
Reporter reporter) | |
throws IOException { | |
for(int i = 0; i < value.get(); i++){ | |
output.collect(key, value); | |
} | |
} |
Although, all by itself Clojure map cannot produce more output items than it has input items, we can use some trickery to achieve this. Instead of producing separate multiple items, we will produce sequences and then flatten the final sequence of sequences:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(apply concat (map #(repeat % %) [1 2 3 4 5])) |
Or a bit more elegant:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(mapcat #(repeat % %) [1 2 3 4 5]) | |
;=> (1 2 2 3 3 3 4 4 4 4 5 5 5 5 5) |
That was about Map - the differences are mainly with regard to the input and output argument cardinality. Now we will focus on the Reduce part.
Clojure reduce aggregates the result by sequentially applying the given function on the current item, then applying the same function on the result and the next item and so on, until it runs out of input items. In the next example we will find the minimum element in a sequence:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(reduce min [4 2 1 5 3]) |
What happens here:
- Function min is applied on 4 and 2, the result is 2
- Now min is applied to the previous result which is 2 and the next element which is 1, the result is 1.
- min(1,5) = 1
- min(1,3) = 1
- We have no more elements, so the result is 1
On the other hand, an equivalent Hadoop reduce method would look something like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
public void reduce(Text key, Iterator<IntWritable> values, | |
OutputCollector<Text, IntWritable> output, | |
Reporter reporter) throws IOException { | |
int min = Integer.MAX_VALUE; | |
while (values.hasNext()) { | |
int current = values.next().get(); | |
if(current < min){ | |
current = min; | |
} | |
} | |
output.collect(key, new IntWritable(min)); | |
} |
Again, it seems that Hadoop reduce is a bit more general case than its Clojure equivalent, since the order in which the input items will be processed isn't fixed, even if the order of the input items is.
However, if we focus on the part that matters and that is the way we think about our programs all the differences in terminology between the two technologies fade. Even if the same keywords do not exactly map one-to-one the simple and most important similarity remains: Map is processing records individually and Reduce is combining them - both of which are the necessary steps in processing of huge piles of data, even more so since this way of dividing the operations also allows the process to be paralellized to some extent.