Data science is a team sport that thrives upon collaboration, quick iteration, and a healthy amount of collegial competitiveness. These characteristics also drive development in the open source software community. So it’s fitting that Greenplum announced the release of Chorus, its social platform for collaboration on predictive analytics projects, as an open source project last week at the Strata Conference in New York City.
The OpenChorus project aims to develop a platform for collaborative data science with Greenplum customers, data science practitioners, open source developers, and a variety of like-minded partners, while facilitating an open dialogue about the future of predictive analytics. Datastream spoke with Logan Lee, Director of Product Management at Greenplum, about the company’s reasons for opening Chorus’s code to the community, and the company’s goals for the future of the product and the practice of data science.
Can you speak to the motivations behind Greenplum’s decision to release Chorus as an open source project?
Greenplum takes its opportunity to advance the practice of Data Science seriously. Chorus represents an important component of our effort to improve this practice. In order to build a truly transformative platform for collaborative data science, Chorus must be open and reflective of a vision beyond Greenplum. By doing so, we hope to include others within the Data Science community, who can bring their own ideas and contribute to improving a practice that will profoundly affect wide-ranging industries and causes worldwide.
While our primary objective is to facilitate contribution to Chorus that its users will value, we also wanted to ensure that Chorus’ source code is useful to others beyond our vision. We decided to license under the Apache 2.0 Software License, because we believe that it provides the right opportunity and protections for both contributors to our project and those who would re-use our code for other purposes. We think that’s the right approach for the type of product that we have.
What sorts of contributions and collaborations does Greenplum expect?
We believe contribution is likely to come from partners in the ecosystem and customers of the product, but we also welcome others with like-minded goals for the practice of data science to get involved. We expect contributions will focus on making communities more productive, creating access to data, and improving the value of applications through social use.
At Strata, Greenplum also announced three partnerships which we think reflect these types of contributions. The first is with Kaggle, which comes back to the idea of community being transformative. The goal of our partnership is to tackle the short supply and heavy demand for data scientists. Leveraging Chorus to connect both sides transforms the way people can get assistance with their Big Data problems, and also allows Kaggle’s elite data scientists to expand the market for their highly-sought skills.
The second part, the data piece, is a collaboration with Gnip. We agree with Gartner, which stated that one of the top technology trends for 2013 will be an increased focus on joining public data sources with internal data assets. Gnip is a leading provider of social media data streams. This first effort allows Chorus users to automatically access and import historical Twitter data horizontally from Gnip’s APIs into their own Greenplum database in order to join datasets and perform sophisticated analysis.
The third is a collaboration with Tableau, which makes an application for advanced visual data analysis. Like many aspects of Data Science, there are opportunities for re-use of an organization’s Tableau assets across teams and productivity to be gained during the iterative stages of authoring visual analysis. Integrating Tableau workflow into Chorus creates visibility and access to make that opportunity possible.
How do you see the open sourcing of Chorus and these collaborations contributing to the product in the long-term?
First, its important to clarify that independent of our open sourcing of Chorus’ code and our investment in facilitating the community project, Greenplum continues to build and stand behind Chorus as a product. When customers use the Chorus product Greenplum produces with this code, they will benefit from our investment in its continued quality, and the support we provide them for any issue they experience.
We believe our three recent partnerships will create new opportunities and solve real problems for our users. We fully intend to continue building on them, based on what our customers tell us.
The open sourcing of Chorus’ code will accelerate the value the product provides, thanks to an increasing number of hands producing it, which will broaden the range of perspectives and experience that drive its vision. While we continue to build Chorus to solve the problems our customers tell us are important to their Data Science efforts, our partners and customers are now empowered to contribute improvements themselves should they choose to.
We’re not just opening the code, we’re investing in establishing a community to foster onboarding, dialogue, and successful implementations to the product by providing resource and people that facilitate those interactions. That starts with our OpenChorus project page, which is the place to get started if you’re interested in contributing.
We also believe customers of Chorus have an opportunity, should they choose, to privately extend and modify Chorus in ways that are meaningful to themselves and operate their own version of the product. While we know this isn’t for everyone, its important to point out it’s a valid option.