Hadoop’s growing pains



These are boom times for the Hadoop community. Over the past year, major players such as IBM, Oracle, EMC, and Teradata have planted their stakes regarding Hadoop support. Independents Cloudera and Hortonworks, whose business is solely Hadoop-focused, have raised around $60 million in financing, while venture firm Accel Partners has just announced a $100 million fund for Big Data startups. At the recent Hadoop World annual conference, “for hire” messages were appearing on the final slides of a lot of presentations. Clearly advancing out of its incubation with Internet giants such Google, Yahoo, and Facebook, Hadoop is now being positioned at the enterprise mainstream. That will happen, but there are speed bumps on the road to Hadoop: talent is still scarce, the tooling is rudimentary, and the definition of the core stack is still being determined. Nonetheless, for organizations that are either able to recruit professionals with Hadoop and MapReduce skills, or are able to spare time and money to get existing staff trained, the time is ripe to begin piloting.

The boom times for Hadoop should sound familiar

This is a heady period for Hadoop as success stories are materializing that show its power and potential. The draw of Hadoop is that it is designed to store huge volumes of data without the overhead of relational databases. It is also designed for economic deployment on commodity hardware and provides the elasticity to support flexible scale out; if you already know how to set up and maintain a Hadoop cluster, it is not that difficult to simply add more commodity servers and storage. The business cases for Hadoop are those for Big Data: namely, there are many instances where it pays to be able to analyze larger sets of data to provide better analytic results. Yahoo realized dramatic improvements in click-through (which in turn drove higher ad revenue) once it replaced its MySQL data warehouse with a Hadoop installation that was able to retain far more data.

The world has woken up to the possibilities of Hadoop, and in the past year major platform providers have planted their stake and venture capital is rushing in. The catch is that talent is still scarce and the technology is evolving. Right now it is a sellers’ market when it comes to recruiting Hadoop and MapReduce talent.

If this story sounds familiar, it should. Rewind the tape to the dot-com bubble. Java was the emerging Internet application development language with the promise or write once, run anywhere. Tooling was similarly emergent, J2EE had not yet codified the Java enterprise stack, and entry-level Java programmers coming out of university were commanding six-figure salaries.

The situation resolved itself with the emergence of the Java enterprise platform, a new generation of tooling and application servers, and the effects of the laws of supply and demand as developers rushed to learn Java. When the dot-com bubble burst and the post-9/11 recession reared its ugly head, that talent shortage became a surplus. The Hadoop story should follow a similar scenario, although this time around, we hope that the economy will actually improve.

Hadoop is still a diamond in the rough

Hadoop’s ability to store large amounts of data, and the MapReduce framework’s approach for harnessing the power of parallel processing, are proven. The platform is still not known for speed – for now relegating Hadoop to largely an offline, batch-oriented system. Clearly, to become broadly useful to organizations, performance must improve. At Hadoop World, Facebook demonstrated the enhancements it has made to transform HBase, the columnar database-like table of Hadoop, into a realtime system; those enhancements are only now starting to make their way into the Apache project.

Tooling is still rudimentary, requiring developers to work mostly in command line; for instance, tooling for automating the generation of MapReduce programs is only starting to emerge. Compounding the challenge, Apache Hadoop projects are proliferating; a new effort, called BigTop, that is to standardize APIs between modules, has only just been formed.

The Hadoop stack is still being defined

The definition of what is the core Hadoop stack is still a matter of contention. While there are many Apache projects covering different aspects of Hadoop, from the file system to data serialization, data mining libraries, and job management, vendors such as IBM and EMC/Greenplum are developing their own alternatives for the file system, which is a critical building block of the Hadoop stack. Admittedly, the debate over replacing open source components or adding proprietary extensions to them is common to any emerging open source technology. For instance, there are Red Hat extensions to Linux that are now accepted as mainstream. In Hadoop, the technologies in question are so new that the market has not yet had a chance to deliver its judgment on where the dividing line is between open source and proprietary technology.

The outcome of this process – defining what is the core Hadoop stack – will be critical if a market is to develop for several reasons. First, enterprises must understand which is the core platform so they can compare vendor offerings; secondly, the core platform must be defined so third-party tools and applications can run against it.

Ovum expects that a de facto standard Hadoop core platform will materialize within 12–18 months. The pressure to draw third-party support will force an informal vendor consensus on what components constitute a minimum Hadoop implementation.

The future is not purely Hadoop

Hadoop will not replace SQL data warehousing; instead it will complement it by extending the bounds of analytics. The conventional wisdom today is that Hadoop is used for exploratory analysis, while SQL data warehouses are used for reports or analytics that are used repeatedly. As enterprises gain experience with Hadoop, they will regard this view as over simplistic; in the long run, Hadoop will provide the ability to perform new kinds of analytics on new forms of data.

Additionally, Hadoop will not be the last word in NoSQL data. Alternatives such as Cassandra, MongoDB, Couchbase, and others are starting to emerge. Admittedly, in timing and venture funding, Hadoop has a significant head start. However, these other data stores have different strengths, especially with processing documents; by this time next year, first-generation products will be available for evaluation.

Start piloting now, if you can find the talent

The progression of Hadoop is much like that of Java or C# of a decade ago. The platform is a moving target, skills are scarce, but success stories are impressive. Ultimately, enterprises that developed Internet applications had to acquire the talent; as talent became available, so did tooling that helped make the new technologies easier.

Enterprises that can find value from analyzing data types such as weblogs, call detail records, point of sale transactions, sensory data, rich media, or content coming from messages or social networks should get proactive in piloting. If your organization cannot easily find Hadoop/MapReduce programmers it may prove worthwhile to budget for training of a select few.

However, early experience with Hadoop could prove invaluable if for no other reason than to begin understanding how to draw value out of the mass of data that has traditionally been off limits to SQL-based data warehouses.



All Rights Reserved.

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the publisher, Ovum (an Informa business).

The facts of this report are believed to be correct at the time of publication but cannot be guaranteed. Please note that the findings, conclusions and recommendations that Ovum delivers will be based on information gathered in good faith from both primary and secondary sources, whose accuracy we are not always in a position to guarantee. As such Ovum can accept no liability whatever for actions taken based on any information that may subsequently prove to be incorrect.