Showing posts with label clojure. Show all posts
Showing posts with label clojure. Show all posts

Thursday, June 6, 2013

Map and Reduce - Conceptual differences Between Clojure and Hadoop

In this article I will explain the differences and the similarities in the concepts of map and reduce between the two very popular platforms. This is not a comparison of Clojure and Hadoop, the two are largely incomparable as one is a programming language and the other one a data processing framework. It is also not a performance benchmark, so if you are looking for tables with statistics on arbitrarily chosen operations, you won't find it here. This is merely an article on what do the words map and reduce mean within the scope of these two technologies.

First, the similarities:

In both technologies, input data is typically divided in a large number of smaller data units, we can simply call them records for the purpose of this article. Map operation is responsible for the transformation of each record individually. It only ever needs to know one item at a time, which is a very powerful assumption as it allows very easy parallelization. Reduce, on the other hand, works by applying the transformation on records against each other or in other words it derives information from multiple items at a time. This makes parallelization a bit more difficult, but it still may be possible depending the way the data is organized. So, roughly said, in both cases map performs a scalar transformation on the input sequence and reduce aggregates it.

Now, the differences:

Hadoop map seems to be a more general case of Clojure map, specifically with regard to argument cardinality.

In Clojure, map always produces a sequence of the same length it received. For example, we can increase every number in the input sequence by exactly ten:



The same can be written in Hadoop as this:



It becomes obvious just by looking at the code above that the number of output records for each input record depends solely on the number of times the collect method has been called. For example, we can decide to completely ignore input records with values in some specific range. We can achieve this by making a small change in the code:



This is something our Clojure map function cannot do. Admittedly it can transform unwanted items to nil and leave it to the caller function to remove them, but that is not exactly the same thing. The function we need here is filter:



Following the same train of thought, we can notice that the number of output items can also be larger:



Although, all by itself Clojure map cannot produce more output items than it has input items, we can use some trickery to achieve this. Instead of producing separate multiple items, we will produce sequences and then flatten the final sequence of sequences:



Or a bit more elegant:



That was about Map - the differences are mainly with regard to the input and output argument cardinality. Now we will focus on the Reduce part.

Clojure reduce aggregates the result by sequentially applying the given function on the current item, then applying the same function on the result and the next item and so on, until it runs out of input items. In the next example we will find the minimum element in a sequence:



What happens here:

  1. Function min is applied on 4 and 2, the result is 2
  2. Now min is applied to the previous result which is 2 and the next element which is 1, the result is 1.
  3. min(1,5) = 1
  4. min(1,3) = 1
  5. We have no more elements, so the result is 1

On the other hand, an equivalent Hadoop reduce method would look something like this:



Again, it seems that Hadoop reduce is a bit more general case than its Clojure equivalent, since the order in which the input items will be processed isn't fixed, even if the order of the input items is.

However, if we focus on the part that matters and that is the way we think about our programs all the differences in terminology between the two technologies fade. Even if the same keywords do not exactly map one-to-one the simple and most important similarity remains: Map is processing records individually and Reduce is combining them - both of which are the necessary steps in processing of huge piles of data, even more so since this way of dividing the operations also allows the process to be paralellized to some extent.

Sunday, January 8, 2012

Clojure Extension for Chrome

Do you like Clojure? Do you often read blogs and articles about it? I know I do. And every time I see an interesting Clojure snippet I rush to open my REPL to try it out. To do that I have to execute following steps:
1. Press Logo key to open my console.
2. cd to any project
3. lein repl
4. Copy the code snippet from the webpage and paste it in the REPL.
5. Wow, cool!
Your own steps to run the code may be different and probably more efficient. However, in any case, they probably involve opening a REPL and copy pasting code from the page into it.

What if you do this way too often to repeat all the steps? What if you are just plain lazy? All of us Clojurians know that lazy is good, therefore I decided to jump on the problem and try to make it easier. What if, I asked myself, I could just select a piece of Clojure code right in my browser, right click it and choose something like Eval to get my result back. For that functionality I would need to write a browser plugin, specifically a Chrome extension.

 As it turned out, writing Chrome extensions is a pretty easy task. It is not that much different than writing regular web application - write some HTML files, put some JavaScript in it to do the work and an Ajax call to communicate with the backend service. The only significant difference is in various security rules which the extension has to satisfy. Anyway, the official documentation is pretty good and the examples are even better.
As for the backend service, I used the source of the popular online REPL tryclojure written by Anthony Grimes and modified it slightly to replace Noir framework with an old version of Compojure which I already used in some of my older apps. Then I unmodified it because Noir is way better.

Anyway, after a few afternoons of hacking, results of which can be seen on goranjovic/chromeclojure on GitHub if anyone cares, I published the app. Clojure backend is hosted on Heroku on a free plan and the Chrome extension itself is where all Chrome extensions live - on Google's Chrome Web Store . Click on the link, install it and try it out on this very blog post. I included some Clojure snippets below just for that purpose.

A simple snippet that calculates the meaning of life:


Go on, select the snippet, right click it and choose Eval as Clojure. If everything is installed and working properly you should see a Chrome notification containing result 42. If you, for whichever reason, don't really like notifications and find them annoying you can change the extension to use plain old alerts to show the result. To do this:
1. Open Wrench Menu > Tools > Extensions
2. Find Clojure Extension
3. Click the Options link
4. Choose between offered response methods and click Save. Currently it is just notification and alert, but more options may come in future releases.

A somewhat less simple snippet which converts number 12345 into a sequence of its digits in base 12:



Again, use the extension to eval the code. It should return (7 1 8 9).

It is also possible to evaluate several forms sequentially, which is useful if you have one or several def-ed functions and one call which evaluates what is needed. This seems useful for the previous example. Select both Clojure forms and evaluate them as before.



Of course, you don't have to evaluate them both at once. You could have evaluated the function definition first and then the function call in a different request. Each user has a session and can use it for defs within the time limit.

Naturally, not everything can be evaluated this way. For example if you try to evaluate the following snippet you will get an error, since require isn't allowed for security reasons and the jar is probably unavailable to the backend service.



So, unfortunately it is not possible to use this feature to evaluate snippets like the ones on Noir website. But, then again sometimes it's better not to be lazy and fire up leiningen.

I hope you like my extension and find it useful in your browsing. If you find any bugs feel free to report them on the project's Issues page. If you have any idea on how to extend the plugin functionality feel free to comment or fork my repo and add them yourself.

Happy hacking!

Tuesday, September 21, 2010

Developing From Console

In my previous post I promised to explain how I developed an application while (mostly) avoiding bloated development tools, so here it comes...

First, a brief explanation of the app itself: It is a genetic algorithm application implemented in Clojure, whose goal is to solve a mathematical game. Game solver (my program) gets one target number and six more numbers. The goal is to use these six numbers to create an equation which evaluates to the target number. In this article I will focus on development environment, rather than on domain problem or the program itself, so you can see more details about the game and get the source on project page: http://code.google.com/p/genetic-my-number (In case you got to this blog from there, just continue reading, and see how it was made :))

Ok, I briefly explained the problem which my application solved, now to get to the post title. Yeah, most of the tools I used to develop it are command line based. Well, actually all but two. Unix-like environments offer a great deal of useful command line tools, so it was a piece of cake to find and install all of them, however most are (as far as I know) available on Windows platform as well.

So, let's see those tools:

Editor - Vim, with limited Clojure plugin meaning that I only used syntax colouring feature and rainbow parens, not the nailgun or any other automatic runner. (I also could've used an easier editor like nano, I just like vim more)

Source Versioning - since I hosted my source on Mercurial, the best and simplest client turned out to be default console based Mercurial client hg.

Build Management and Running - Maven2, default console based (I'm sure leiningen is great, and sure will try it out, but I really don't see anything wrong with maven either). Clojure plugin for Maven offers a clojure:repl command, which creates a REPL with all the maven dependencies set up on the classpath without trouble.

Diff tool - Meld (a GUI app!) - simple, lightweight, intuitive, and looks nice... Much easier to work with than default diff, vimdiff and (personal opinion) aesthetically more appealing than kdiff. It also serves the purpose to show that choosing console tools over graphical ones shouldn't be just for geekness sake, but based on their actual usefulness.

Web Browser - Chrome and Firefox, obvious choice for both testing web applications and reading online documentation (also add whatever browser you wish to test against)...

Music Player (C'mon, who doesn't occasionally need music while programming :)) - Moc, Music On Console (the geekest of them all) a command line based music player that does exactly what it says, plays music - no lyrics plugin, no funky visualisations, no music store support - it just plays the damn music... :-)

These tools (and the command line terminal itself) were the only toolset I used while writing the genetic algorithm application, and I wrote it in a remarkably short time. Why is that? First, we shouldn't forget the choice of programming language and platform. Clojure is a modern Lisp, which means that it gives a huge flexibility and the possibility to write great functionality for brief time and in small number of code lines. Second, the tools I chose are all lightweight, they load in a millisecond, store file in the same time, writing a file doesn't cause rebuild... so I just got rid of all these annoying seconds waiting for Eclipse to finish some automatic task, not to mention time needed to setup all the plugins. There were many times when I had more trouble integrating a tool with Eclipse, than setting up the tool itself, i.e. trying to use a Google Web Toolkit, with both Maven GWT plugin and Eclipse GWT plugin. So, my solution is quite simple - use editor for editing. Period. Use build tool for building, version tool for versioning, etc... "Write (or in my case: Use) programs that do one thing and do it well" This principle isn't anything new, it's been known as a key part of the Unix philosophy, so it isn't really a surprise that most development tools that satisfy it come from Unix world.

Also, as I mentioned in my previous post, there is a tendency in IDE design to encompass the functionality of system's window manager within IDE window, so basically we have to switch between working on different files, REPL, web browser etc. both within IDE and within system window manager. With toolkit I used, only switching I had to do was between terminal windows and a browser. Even the meld tool would pop out and launch from console, and then after seeing diff and merging changes I would simply close it.


Of course, there are drawbacks to this aproach. Some technologies, especially proprietary ones or those that have enterprisey approach (or both!), simply demand some specific enviroment. But, whenever I have the luxury of choosing what to work with it will be whatever is comfortable and makes me most productive.