For The Win
Kaggle Transforms Data Science Into Competitive Sport.
In the popular imagination, the intersection shared by competitive sports and data science is rather slim. Those who work with data on a daily basis know that parsing, mining, refining, and developing predictive algorithms from big data is a game with some of the highest stakes imaginable, but they rarely get the glory. Over the past two years, Kaggle has changed that, developing an online platform for machine learning competitions that invites data scientists to compete against their peers for cash prizes and glory.
At the fourth Big Data for the Public Good seminar on Wednesday, May 16, 2012, hosted by Code for America and sponsored by Greenplum, a division of EMC, Jeremy Howard, Kaggle’s President and Chief Scientist, explained the genesis of the start-up that “makes data science a sport” by inviting data scientists to work with real-world data sets and address significant problems.
“There’s this amazing mismatch between the people that have the data and the problems and those who know how to use it,” Howard explained. “All of my buddies in research, around data mining, data science, machine learning — they say their number one problem is that they can’t get access to datasets, they can’t get access to real-world problems. Kaggle was created to deal with this mismatch through a crowdsourcing-based approach.”
This global platform for machine learning competitions began in April 2010 with a contest to develop a more effective model for predicting a patient’s HIV progression. At the time, the state of art of prediction after four years of dedicated research had a 70% accuracy rate. With a limited clinical information and a data set of 1,000 patients, data scientists from various backgrounds competed for 96 days to develop a more accurate predictive model.
“Participants had three months to not only get to the point where they could match the state of the art after four years of research,” said Howard, “but go beyond that and improve the prediction accuracy rate. After only one and a half weeks, the world’s best research had been easily surpassed.” By the end of the competition, the accuracy rate was increased to 77%. “We discovered that this approach of running machine learning competitions can actually advance technology in amazing ways.”
A year later, Kaggle was approached by European Space Agency, the Royal Astronomical Society, and NASA to run a contest to develop an algorithm that accounted for gravitational lensing (the distribution of matter that bends light from the source) when mapping dark matter. It was a problem that had long stymied the finest scientific minds in the field. As noted on The White House’s website, “in less than a week, Martin O’Leary, a PhD student in glaciology, had crafted an algorithm that outperformed the state-of-the-art algorithms most commonly used in astronomy for mapping dark matter.” By the end of the three-month competition, 15 participants had surpassed previous results by up to 300%.
Since then, participants in Kaggle contests have outperformed betting markets, developed a more accurate system to predict chess outcomes, forecasted bodily injury liability for Allstate, and for a $3 million dollar prize, developed algorithms to identify patients likely to be admitted to a hospital within the following year.
Howard cites the diversity of participants and approaches as a primary reason for the high rates of success. “[The] data scientists who are entering them are from all over the world,” Howard said. “They’re coming from academia, they’re coming from research organizations, they’re coming from commercial organizations.”
“Interestingly, the most successful participants tend to be not from machine learning, not from statistics, but from engineering and physics — people who apply real data every day to solving real problems. These people apply a wide range of different techniques. So you’ve got a diversity of people using a diversity of approaches, and together they can come up with new ways of doing things that haven’t been thought of before.”
Jeremy Howard, Kaggle
Data scientists are enticed not only by the cash prices, but also the opportunity to work with data sets and solve problems they otherwise wouldn’t have access to. “Hackers love to hack at interesting problems and interesting code,” he said. During the talk’s introduction, Jack Madans, Program Coordinator for Code for America, spoke to the hacker ethos that entices developers to hackathons, data dives, and Kaggle competitions. “This is cognitive surplus in action,” said Madans, citing Clay Shirky’s concept that the Internet allows society to harness the skills and the trillions of hours of free time enjoyed by residents of the developed world.
But there’s another fundamental reason for Kaggle’s success: the competitive impulse it harnesses. Participants are ranked on a real-time leaderboard that drives them to push harder to succeed. Howard said, “every day you can see how you’re going on the leaderboard — ‘how predictive is my predictive model?’ You can see if people have passed you or if you’ve passed other people, and it creates this amazing ability to push people further than they think they can go.”
While Kaggle’s appeal for research organizations and corporations is self-evident, the model’s potential to serve the social good is equally promising. Howard noted that it’s free to start a competition on Kaggle that serves a social cause, providing a unique opportunity for organizations that lack the resources and staff to gain predictive insight from the data available to them. He cited the EMC Data Science Global Hackathon for Air Quality Prediction as one such example. The weekend-long competition provided participants with EPA Air Quality Index data for Chicago, Illinois, and called for a locally-based early warning system that could accurately predict dangerous pollutant levels on an hourly basis.
Not only did the participants develop a solution to the problem at hand, they created models that could then be applied to other locations. The competitive hackathon worked with “a data set which is local in scope,” Howard said, “[which] you can use at a local level, yet you can also take the results and apply them really powerfully throughout the world.”
In the case of social organizations and causes, the mismatch dividing skilled practitioners and available data is pronounced. “What I’ve discovered is that with cause organizations where they don’t have people working on this stuff, they often don’t see the forest for the trees,” he said. Weekend-long civic hackathons and data dives such as those hosted by Code for America and DataKind, the organization founded by previous Big Data for the Public Good speaker Jake Porway, begin to address this divide. But given the limited duration of such events, attendees are only able to begin identifying problems posed by the data.
“There’s a place for hackathons and data dives,” Howard said. Attendees often get to “the point where they understand a problem that can be solved that then could use some machine learning.” In Howard’s view, this is where Kaggle’s data competition model comes into the picture, provoking data practitioners from a wide variety of backgrounds to develop predictive models for money, glory, and most significantly, the greater good.