Cloudera plots its path forward with enterprise data hub strategy

OVUM VIEW

Summary

Two years ago, we wondered what Hadoop would be when it grows up. Back then, it was a matter of what were the components that defined the Hadoop stack. Today, the question is more about Hadoop’s role in the enterprise analytics ecosystem. At the recent Strata/Hadoop World conference in New York, Cloudera announced a new enterprise data hub strategy. More than an update of the enterprise data warehouse, the hub becomes the logical point where data lands and analytics are managed. Cloudera backs its claim with recent enhancements supporting new workloads, along with the beginnings of some key data security and platform backup capabilities. But there remain many blanks to fill for Cloudera – and the underlying Hadoop platform – to fully support this strategy (e.g. with data security). Nonetheless, it is a logical path for a company seeking to transcend commodity product markets, although at some risk to its partnerships. Cloudera’s next step is to more clearly elaborate what the hub means and what elements it will deliver.

Plotting its own Hadoop path

The release of Apache Hadoop 2.0 has accentuated the differences between Cloudera and its biggest Hadoop rivals (Hortonworks and MapR). The differences are not simply over technology, but the role that their platform implementations will play in an enterprise’s analytic computing environment.

Hortonworks’ pure Apache open source strategy is positioned as highly partner-friendly and commodity. By contrast, MapR is competing to become the high-performance Big Data warehouse, which will ultimately place it in competition with SQL incumbents. Cloudera’s enterprise data hub strategy could similarly place it on a collision course with SQL incumbents.

Besides adding redundancy to the NameNode, the highlight of the Apache 2.0 release is the new YARN framework for allocating resource to different workloads. Developed in response to the need for Hadoop to diversify beyond MapReduce, YARN marks a significant milestone in its evolution into a more versatile platform. With the initial release, YARN allocates resources to different workloads, but does not have full dynamic resource management capabilities.

Hortonworks, which led development of YARN, will apply the framework universally across all workloads running on its platform. By contrast, Cloudera’s support is measured; it contends that YARN is not suited for running continuous workloads, such as search or streaming, remaining best suited for job-like workloads (which have beginnings and ends) like MapReduce and Impala (interactive SQL). But even with Impala, Cloudera will implement supplemental resource management.

Cloudera’s YARN stance typifies its positioning, which is that Hadoop is enabling technology that needs other pieces to play a central role in managing Big Data. It is the latest manifestation of “embrace and extend” strategies historically associated with vendors such as Microsoft, with the difference that this time, it is about open source technology. Cloudera supports the core Apache platform, but does not necessarily use all of the Apache Hadoop components for every function of Hadoop. Its strategy dates back to Cloudera Manager, which picked up where the open source project left off (and well before Hortonworks’ championing of the Ambari open source APIs for management tools to reach into Hadoop’s kernel).

Looking beyond Hadoop

While Cloudera’s core platform is based on Hadoop, it doesn’t simply view itself as competing with other Hadoop pure-plays. That’s where the enterprise data hub positioning comes in.

Cloudera maintains that as the Hadoop platform grows more robust it provides a more economical and scalable environment for landing data and managing analytics. Cloudera doesn’t intend to replace SQL EDWs, but intends to claim more (and new) workloads. With SQL EDWs extending their analytics management umbrellas to Hadoop, Cloudera may as well plant its stake to avoid being relegated to being a low-margin commodity product.

Cloudera’s goals of becoming an enterprise data hub are fully consistent with Ovum’s vision that Big Data – and Hadoop – must become first-class citizens with IT, the data center, and the enterprise. That means that Hadoop must be integrated into, rather than operated apart from, the IT data platform environment.

Cloudera’s highly auspicious goals

Given the status of the Hadoop platform, it’s fair game to ask whether Cloudera has jumped the gun. It has taken early steps toward enlarging its own Hadoop footprint with Sentry, an open source project for applying role-based authorization for creating Hive metadata tables; Navigator, which provides the beginnings of a data lineage and auditing capability; and new support for UDFs (user-defined functions), which could eventually fill the gap with players like Teradata or SAS for in-database analytics. But these are only first steps; SQL still has a huge edge with security, performance, and in-database analytics.

Cloudera will have plenty of competition here. SQL DW players, like Teradata, Oracle, Microsoft, SAP, and IBM, have similar aims, as do BI tool players like SAS, QlikTech, Tableau, Pentaho, and others. Cloudera’s strategy must, of necessity, rest on lower prices than SQL incumbents, but with enough value-add that it avoids being relegated to commodity.

Not a reincarnation of the “galactic” enterprise data warehouse

There is nothing carved in stone that requires Hadoop to retain all data. For instance, streaming data that is not necessarily persisted (in Hadoop, or anywhere else) will become increasingly prominent as applications (especially with the Internet of things) proliferate. And in many cases, company policy and regulatory compliance may require purging of older data.

Ramifications for Cloudera’s partners

This is the trickiest issue. Cloudera faces some fancy footwork to avoid alienating partners. While Hortonworks has cultivated a reputation as an OEM partner (thanks to high-profile relationships with Microsoft and Teradata), Cloudera has similar arrangements to protect with IBM and Oracle.

Cloudera’s primary threat to EDW incumbents is for soaking up new data sources and analytic workloads, as the Hadoop platform will prove more economical once the skills base and tooling catches up with the SQL world. Its ace in the hole should be as a nonthreatening partner to BI players, by focusing on data and server management and leaving the analytics choreography and federation to BI players.

APPENDIX

Author

Tony Baer, Principal Analyst, Software – Information Management

tony.baer@ovum.com

Further reading

“Teradata’s unified data world is evolving,” IT014-002729 (May 2013)

“IBM introduces flexibility to Big Data governance,” IT014-002794 (September 2013)

Disclaimer

All Rights Reserved.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior permission of the publisher, Ovum (an Informa business).

The facts of this report are believed to be correct at the time of publication but cannot be guaranteed. Please note that the findings, conclusions, and recommendations that Ovum delivers will be based on information gathered in good faith from both primary and secondary sources, whose accuracy we are not always in a position to guarantee. As such Ovum can accept no liability whatever for actions taken based on any information that may subsequently prove to be incorrect.