It’s been few years since I open sourced the Hadoop Vaidya, as a “contrib” project under Apache Hadoop. It is a rule-based performance diagnostic framework for MapReduce jobs where each rule (aka diagnostic test) identifies a specific problem with the job’s performance, scalability or even a best practice violation and suggests a solution.
This project has only seen limited activity within the open source community due to usability issues within the testing framework and problems with keeping the diagnostic rules logic consistent with the constantly evolving Hadoop code base. Of course, you can also blame my lack of active involvement in improving the state of the project as another reason.
Around 2009/10, I was actively involved in productizing the tool at Yahoo!. Internal users at Yahoo! were using Vaidya to analyze thousands of jobs every day and to discover ways to improve these MapReduce jobs. It proved to be a very effective tool for many users in Yahoo who didn’t have much insight into configuring and tuning their MapReduce programs. At the same time, Vaidya also proved to be a very valuable tool to help cluster administrators identify underperforming jobs wasting the cluster’s shared resources and recognize common user mistakes in their construction of MapReduce jobs. The latter of which allowed admins to certify the overall quality of jobs before on-boarding them onto the production cluster.
Now as a proud member of Greenplum HD team, I have an opportunity to revise my initial work on Hadoop Vaidya and rectify some its key problems. Our work has progressed quickly and we’re planning to make it available to Greenplum’s users as part of the upcoming release of Greenplum HD release built using Apache Hadoop. Here are few immediate goals:
- Integrate Hadoop Vaidya with the Job Tracker History UI for users to conveniently invoke and view the Vaidya analysis report for each job
- Adding more Vaidya diagnostic tests/rules enabling more comprehensive job analysis
In its current state, Hadoop Vaidya is an extensible framework that allows users to write their own tests/rules for analyzing MapReduce applications. It leverages Job Configuration and Job History logs as input for this analysis, but moving forward we plan to integrate the tool with data from Greenplum Command Center, a management and monitoring platform for both GPDB and GPHD. This will enable Vaidya incorporate more sources of information from the cluster such as daemon/user logs, audit logs, job queue information and system metrics into its analysis. It will also enable real-time job analysis when running MapReduce jobs. This tool will also be hosted on our 1000 node Analytics Workbench (AWB) so that partners and research institutions using the cluster can take advantage of the benefits of using Vaidya in their own analysis.
I am quite excited and confident about this development as the same leadership team (Milind Bhandarkar and Apurva Desai) who made Vaidya a success at Yahoo, are eager to expand upon the project and offer it to Greenplum HD’s users.
Stay tuned for more blog posts in this series, where I will describe extensible design of Vaidya framework, existing diagnostic rules as well as how to write new rules etc.