Making Hadoop MapReduce Work with a Redis Cluster

redis_logo

Redis is a very cool open-source key-value store that can add instant value to your Hadoop installation. Since keys can contain strings, hashes, lists, sets and sorted sets, Redis can be used as a front end to serve data out of Hadoop, caching your ‘hot’ pieces of data in-memory for fast access when they are needed again.

Read more »

The History of Hadoop: From Small Starts to Big Data

gigaom-hadoop-icon-final

Named after a toy elephant belonging to developer Doug Cutting’s son, over the past decade Hadoop has proven to be the little platform that could. From its humble beginnings as an open source search engine project created by Cutting and Mike Cafarella, Hadoop has evolved into a robust platform for Big Data storage and analysis.

Read more »

Disruptive Data Science – Transforming Your Company into a Data Science-Driven Enterprise

Print

Big Data is the latest technology wave impacting C-Level executives across all areas of business, but amid the hype, there remains confusion about what it all means. The name emphasizes the exponential growth of data volumes worldwide (collectively, 2.5 Exabytes/ day in the latest estimate I saw from IDC), but more nuanced definitions of Big Data incorporate the following key tenets: diversification, low latency, and ubiquity.

Read more »

Hadoop Vaidya: Performance advisor for Hadoop Map/Reduce Jobs

Vaidya: (In “Sanskrit” language) An expert (versed in his own profession, esp. in medical science), skilled in the art of healing.

It’s been few years since I open sourced the Hadoop Vaidya, as a “contrib” project under Apache Hadoop. It is a rule-based performance diagnostic framework for MapReduce jobs where each rule (aka diagnostic test) identifies a specific problem with the job’s performance, scalability or even a best practice violation and suggests a solution.

Read more »

Meet the “Team of Rivals” Building Greenplum HD

gp_hd.jpg

When our company was acquired by EMC in July of 2010, we could have easily been scooped up and monetized as a pretty nice data warehousing business for our parent company. They decided to do the opposite. EMC’s leadership believed in our team and our vision for leading the Big Data analytics industry and decided to double down on their investment.

Read more »

Why The Time is Right for MapReduce Design Patterns

NewImage.png

One of the common questions I get from people about my new book MapReduce Design Patterns is “why did you write it?” In this post, I’ll explain the reasons, as well as what MapReduce design patterns are, why they need to exist, and why the time is right.

Read more »

Moving Beyond Restaurant Recommendations to Predictive Location Analytics

Map matrix via ESRI.

The big news around maps in recent months has been the battle between Google Maps and Apple’s new alternative service for iOS 6. As smartphones increasingly become the all-in-one personal computer, organizer, assistant, and navigator, both companies want to build their own beachheads to establish platform control.

Read more »

Top Picks for Hadoop World 2012

Screen Shot 2012-10-15 at 11.19.09 AM

Hadoop World 2012 is just around the corner, kicking off next Tuesday, October 23, in New York City. This will be my third consecutive time at Hadoop World and it has been exciting to watch the ecosystem change and evolve over the past few years.

Read more »

Hadoop MapReduce Can Transform How You Build Top-Ten Lists

Cat icon by Marie Coons via The Noun Project.

It seems like websites, magazines, and TV shows all over the place are building top ten lists (or top-k lists) these days. The top ten science fiction movies of all time, the best places to live, etc. Top-ten lists are not only a lot of fun because of our seemingly primal need to create categories and hierarchies — they can actually be a useful way to analyze your data.

Read more »

Towards a Unified In-Situ Analytics System

Photo by Michael Mandiberg via Flickr. (CC BY-SA 2.0)

With ever-growing data sets produced from user-generated online content and activity, and the amount of machine-generated data from server logging and network traffic monitoring, enterprise customers want the best of both worlds. They want to perform complicated interactive queries and sophisticated reporting easily, using existing BI tool sets.

Read more »