Thursday, June 6, 2013

Map and Reduce - Conceptual differences Between Clojure and Hadoop

In this article I will explain the differences and the similarities in the concepts of map and reduce between the two very popular platforms. This is not a comparison of Clojure and Hadoop, the two are largely incomparable as one is a programming language and the other one a data processing framework. It is also not a performance benchmark, so if you are looking for tables with statistics on arbitrarily chosen operations, you won't find it here. This is merely an article on what do the words map and reduce mean within the scope of these two technologies.

First, the similarities:

In both technologies, input data is typically divided in a large number of smaller data units, we can simply call them records for the purpose of this article. Map operation is responsible for the transformation of each record individually. It only ever needs to know one item at a time, which is a very powerful assumption as it allows very easy parallelization. Reduce, on the other hand, works by applying the transformation on records against each other or in other words it derives information from multiple items at a time. This makes parallelization a bit more difficult, but it still may be possible depending the way the data is organized. So, roughly said, in both cases map performs a scalar transformation on the input sequence and reduce aggregates it.

Now, the differences:

Hadoop map seems to be a more general case of Clojure map, specifically with regard to argument cardinality.

In Clojure, map always produces a sequence of the same length it received. For example, we can increase every number in the input sequence by exactly ten:

The same can be written in Hadoop as this:

It becomes obvious just by looking at the code above that the number of output records for each input record depends solely on the number of times the collect method has been called. For example, we can decide to completely ignore input records with values in some specific range. We can achieve this by making a small change in the code:

This is something our Clojure map function cannot do. Admittedly it can transform unwanted items to nil and leave it to the caller function to remove them, but that is not exactly the same thing. The function we need here is filter:

Following the same train of thought, we can notice that the number of output items can also be larger:

Although, all by itself Clojure map cannot produce more output items than it has input items, we can use some trickery to achieve this. Instead of producing separate multiple items, we will produce sequences and then flatten the final sequence of sequences:

Or a bit more elegant:

That was about Map - the differences are mainly with regard to the input and output argument cardinality. Now we will focus on the Reduce part.

Clojure reduce aggregates the result by sequentially applying the given function on the current item, then applying the same function on the result and the next item and so on, until it runs out of input items. In the next example we will find the minimum element in a sequence:

What happens here:

  1. Function min is applied on 4 and 2, the result is 2
  2. Now min is applied to the previous result which is 2 and the next element which is 1, the result is 1.
  3. min(1,5) = 1
  4. min(1,3) = 1
  5. We have no more elements, so the result is 1

On the other hand, an equivalent Hadoop reduce method would look something like this:

Again, it seems that Hadoop reduce is a bit more general case than its Clojure equivalent, since the order in which the input items will be processed isn't fixed, even if the order of the input items is.

However, if we focus on the part that matters and that is the way we think about our programs all the differences in terminology between the two technologies fade. Even if the same keywords do not exactly map one-to-one the simple and most important similarity remains: Map is processing records individually and Reduce is combining them - both of which are the necessary steps in processing of huge piles of data, even more so since this way of dividing the operations also allows the process to be paralellized to some extent.