One of the common questions I get from people about my new book MapReduce Design Patterns is “why did you write it?” In this post, I’ll explain the reasons, as well as what MapReduce design patterns are, why they need to exist, and why the time is right.
Before getting into MapReduce design patterns, let’s talk about what a design pattern is. A design pattern in software engineering has the following properties:
- General: the pattern strives to be domain independent
- Reusable: the pattern is applicable to a number of different problems
- Cannot be transformed directly into code: the pattern is a template for problem solving, not a solution
- Follows best practices: there may be a number of ways to solve a problem, but likely the pattern is the best practice for that type of solution
Outside of MapReduce, design patterns provide a number of benefits to a community of software engineers. They:
- Get the developer 80% of the way there and save some time. Knowing how to solve the problem is the majority of the battle; the developer just needs to tailor it to the domain-specific use case. The 80% rule seems to be a good middle ground. If you make the pattern any more specific, it won’t have general applicability. If it is less specific, it’s not really useful at all.
- Pass knowledge from experts to beginners. New engineers can benefit from the lessons their predecessors learned. If experts spend the time to document a pattern, they can save themselves time in the future while supporting a broader audience.
- Provide a common language for solutions. If problems and solutions are named, a community has a common language to discuss challenges and implementation. This enables more efficient and effective communication among members of a team.
- Make the intent of code easier to understand. When you implement a pattern to solve one of your problems, it’ll it easier for other people who know the pattern to understand your code. While some software solutions are complex by necessity, when they follow a recognized template, it is far easier for others to understand them.
What is a MapReduce design pattern? Well, it’s all of the things above, but in the context of MapReduce. It is a rather constraining framework where you have to place your solutions in the terms of “map” and “reduce”.,In return, you get the benefits of abstracted parallelism and fault tolerance. The paradigm may be limiting, but it is far easier to work with — the list of different ways to solve problems is relatively short in comparison to object-oriented patterns.
So why write the book on MapReduce design patterns now? I have taught a number of courses on Hadoop, mentored several Hadoop newbies, and explained how to do things in MapReduce to more general audiences. Explaining certain approaches over and over again became really tedious, and I found there was a general lack of good, centralized, and authoritative documentation that I could point someone to. At the start, I thought writing such documentation would save myself some time, since I assumed I was just one person in one company with this problem. I soon realized that my situation was not unique in the Hadoop user community. Hadoop is sufficiently mature that there are now the right number of experts and new users for a guide to design patterns to become useful.
Not too early…
You may be asking whether the release of this book is too early in Hadoop’s evolution. After all, prematurely building design patterns can be a waste of time.
Hadoop in the past few years as a project has changed significantly, but recent changes are not as radical as they once were. For example, there was a major split between the old and the new MapReduce APIs, with the new API lacking several utilities but a revamped interface. There was quite a bit of time when deciding which one to use was a very awkward. With the release of 1.0, a significant increase in users for mission-critical purposes, mature commercial support from several companies, and more, Hadoop has now gotten to the point where it has to be stable.
Second, at this point users have had time to determine what works well and what does not. I could have come up with a bunch of patterns that nobody has ever used before, but there would be no point to it. With a more mature community of experts that has repeatedly identified and solved problems, the most common solutions have developed into design patterns.
There have been a few other places that have written about MapReduce design patterns. My favorite is a blog post by Ilya Katsov which is closest in spirit to my book. I think his approach for patterns is very similar to mine, reaffirming my belief that this is an important topic. Next, is Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, one of the first books written on Hadoop and MapReduce. It covers several design patterns, but for the most part sticks to the text-related domain of problems. I think this book was a great start, but is not general and domain-independent. Then, there are the countless mailing list posts, StackOverflow answers, blog posts, etc. that document little tidbits here and there. Nothing in the book MapReduce design patterns is new or novel. But having useful patterns diluted with marginal ones has made the gems harder to find, necessitating a guide that separates the signal from noise.
The time is right…
The Hadoop community is ready for an authoritative source of MapReduce design patterns, which I hope the contents of my book can either be or inspire. Here are the reasons why this is all coming together:
- It’s not too early
- Groups of engineers are building patterns independently, but having a hard time sharing them with the rest of the community
- There are tons of new Hadoop users every day that could leverage experts’ documentation
- MapReduce is a new way of thinking that may not be intuitive to everyone right away, so some ways to solve problems may sneak up on people
- MapReduce design patterns provide a foundation for higher-level abstractions such as Pig, Hive, and who knows what else will come next
Hopefully I’ve convinced you that MapReduce design patterns are a good thing: this really has to be a community effort in the long run. Get the word out! Blog about new patterns that you discover, or perhaps talk about them at a Hadoop conference or local Hadoop meetup. This will be even more crucial as Hadoop continues to change. With the nature of data shifting towards even more challenging formats such as audio, imagery, video, and bio we’ll see some new patterns crop up to tackle the challenges of each. New libraries, tools, and abstractions will be built that will make some of the current patterns useless and in turn will open up the doors for completely new patterns. Another possibility is to just enable currently existing patterns to be implemented more elegantly. Also, with the advent of YARN, and with the rise of other Hadoop ecosystem components, the list of useful patterns for Hadoop will expand beyond MapReduce.
The only way we can keep up is to make the commitment as a community to documenting, discussing, and refining patterns for the greater good, much like the object-oriented programming community has done to great success.