Tooling is starting to tame Hadoop

OVUM VIEW

Summary

The key hurdles to Hadoop entering mainstream enterprise adoption include the fact that it is an immature platform, compounded by a lack of tooling and skills. However, with major IT data platform providers having already staked their claims to Hadoop, attention is shifting toward infrastructure and data-consumption tooling. Highlights include VMware’s announcement of initiatives to make Hadoop virtualization-friendly, and Teradata Aster and Hortonworks’ announcement to add new bridges to the SQL world, as well as releases by Hadoop-native tool providers such as Datameer and Karmasphere that will enable consumption and development of data-driven analytics applications off Hadoop. These announcements, coinciding with the recent Hadoop Summit developer conference, provide fresh evidence that the Hadoop tooling ecosystem is starting to materialize, beginning with data-access tooling. We believe that Hadoop tooling releases will reach critical developer mass within 12 months, but progress will be slower on the infrastructure level. However, VMware’s announcement is an important early step in this direction.

Hadoop may be a moving target, but the platform is getting increasingly “packaged”

Hadoop has been a classic early-adopter market, with early practitioners improvising in defining the Hadoop platform. The diversity of sub-projects within the Apache Hadoop projects reflects the energy that is still being devoted to staking claims to what Hadoop can and should be, from job coordination and scheduling, to event logging, data serialization, metadata management, and more.

While the Apache project is a meritocracy that does not necessarily screen projects for their commercial potential, decisions by committee are being made to impose some order over the chaos. It began with the definition and release of a 1.0 version, which ironically is less current than the 0.23 version. The Apache Hadoop community is now trying to add some predictability to the roadmap, outlining what should go into the 2.0 platform tentatively planned for 2013.

For enterprises, however, it is the vendor community not the Apache Foundation that will impose de facto order. There now are roughly half a dozen commercially supported distributions of Hadoop that vary in their content. These include different mixes and versions of Apache Hadoop sub-projects and, in some cases, proprietary content that either replaces or adds to what Apache offers. Excluding MapR, a proprietary variant that customers choose because of its higher performance levels, most first-wave enterprise adopters won’t judge commercial distributions by which Apache project modules are included. Instead, their decision will hinge on product support and external value-add, such as data integration, training, and pre-packaged analytics functions.

While commercial distributions will never become carbon copies of one another, Ovum expects the core content of commercial Hadoop platforms to settle down within 18 months.

Tooling will be the brass ring

The next logical step in Hadoop’s market development is the emergence of a tooling ecosystem. The Hadoop market is in the first stage of commercial tooling development, and there are currently a handful of offerings from firms emerging out of stealth phase or into their second generations of releases. Recent news includes:

  • Datameer, which provides a spreadsheet that carries several hundred analytics functions operating natively in Hadoop, has released versions of its product priced for workgroup and individual adoption.
  • Karmasphere, which offers data-driven application development tools for accessing, querying, and analyzing Hadoop data, has added a collaborative development front end, matching what EMC Greenplum recently released with Chorus (see Ovum’s “EMC spotlights App Development side of Big Data”).
  • Teradata Aster and Hortonworks are partnering on development of SQL H, a new option for providing SQL access to Hadoop. SQL H involves taking a bet on HCatalog, a still-nascent Hadoop sub-project that is in incubation stage.

This tooling is necessary to lower the bar for the required technical skills and to improve access to the data. Although tools alone won’t solve the skills deficit, they will help, just as they did to drive the evolution of the business intelligence (BI) and data warehousing software markets, which faced similar challenges when they emerged. Admittedly, there are some unique challenges that Hadoop and Big Data in general throw into the mix, such as changing how analytics is conducted, and with it the need for a new breed of near-genius, the data scientist (see Ovum’s “Where are the Big Data Skills?”).

Ovum therefore expects that the upcoming wave of venture-backed Big Data firms will focus on tooling that makes the platform easier to use, and we expect the tooling market to hit critical mass in the next 12 months.

Making Hadoop a first class citizen of IT infrastructure

While tools will make Hadoop more consumable to the business, there remains the urgency of getting Hadoop to fit within traditional data center practices and technologies. Here, Hadoop is at a much earlier stage of evolution because early adopters largely implemented Hadoop as an island that was managed differently from their other internal systems.

One of the big questions is virtualization, a practice that is well-established in the enterprise. While early adopters have focused on scale-out rather than server utilization, mainstream enterprises have more finite resources and frequently have budgetary restraints against new capital purchases, such as buying new nodes for a cluster. VMware’s announcements that it will work with the Apache project to make Hadoop virtualization-aware and launch its own open-source project (Serengeti) for virtualizing Hadoop clusters are important steps toward making Hadoop a first-class citizen of enterprise IT infrastructure. This follows a recent EMC announcement for extending support of the Hadoop HDFS file system to its Isilon scale-out managed storage clusters (see Ovum’s “EMC brings NAS storage to Hadoop HDFS”.

These are only the early steps, and several things must now happen. Hadoop must develop hooks to enterprise security, identity and access management, performance and service-level management, and other infrastructure tools and practices. In turn, Hadoop may force adaptations to some of those practices as it assumes roles, such as replacing near-line or offline archiving. Because 2012 is the year that access or consumption-oriented tooling will begin to emerge, we expect that 2013 will be the year that infrastructure-management vendors start to engage en masse with the Hadoop community.

APPENDIX

Disclaimer

All Rights Reserved.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior permission of the publisher, Ovum (an Informa business).

The facts of this report are believed to be correct at the time of publication but cannot be guaranteed. Please note that the findings, conclusions, and recommendations that Ovum delivers will be based on information gathered in good faith from both primary and secondary sources, whose accuracy we are not always in a position to guarantee. As such Ovum can accept no liability whatever for actions taken based on any information that may subsequently prove to be incorrect.