Cloudera maps SQL into Hadoop

OVUM VIEW

Summary

Ovum has long believed that if Big Data implementation is to attain mainstream enterprise adoption, platforms, practices, and skillsets must align with what is already present in IT and the data center. For Hadoop, one of the hurdles is making the platform accessible to SQL developers. Although hardly the first attempt to bridge the Hadoop and SQL worlds, Cloudera has developed a framework that will enable SQL to work natively inside Hadoop, rather than simply extracting data from it. Impala, Cloudera’s new framework, is an alternative to MapReduce that works with Hive using SQL commands, not Java. Significantly, Impala is not part of the Hadoop project, raising the certainty that the SQL face of Hadoop will become one of the layers where vendors differentiate from the core platform.

The right tool and platform for the job

Hadoop’s claim to fame is its sheer power and scale, but for enterprise use it remains a diamond in the rough. Tooling is crude and the platform is largely restricted to batch operations, usually through MapReduce, which is designed for scale but not fast performance. Although HBase, Hadoop’s table store, is designed for interactivity, it is a rudimentary database with few if any optimizations. This makes Hadoop an accessible target, either via programmatic approaches or through frameworks designed to automate extracting SQL views of data to the familiarity of an SQL data warehouse. While Hadoop is already used heavily for fraud detection, search and web recommendation engines, and market basket analysis, adding BI access is the brass ring that would get it over the hump to the enterprise mainstream.

Ultimately, the laws of supply and demand will alleviate the shortage of Hadoop programmers. There are, however, many reasons why SQL access would provide a best of both worlds. Beyond BI on steroids, it could enable new forms of analytics that take advantage of the power, scale, and breadth of Hadoop data, and it would complement, not replace, existing enterprise BI and data warehouses. But Hadoop must first become SQL-developer-friendly.

Cloudera takes SQL inside Hadoop

Cloudera’s Impala framework enables realtime SQL querying of Hadoop. It provides an alternative to MapReduce. While MapReduce uses programmatic commands (usually Java), Impala only reads SQL, and requires Hive to be present to offer the metadata layer against which queries can be targeted. Impala was designed for high performance by using the massively parallel, shared-nothing approaches similar to those employed by Advanced SQL analytics platforms such as EMC Greenplum, IBM Netezza, HP Vertica, Teradata Aster, and others. It will also subsequently make available Enterprise RTQ, a subscription add-on upgrade that provides commercial support and management via Cloudera Manager. Running the Impala framework, Cloudera claims that RTQ will execute up to four times faster for data stored on HDFS, and up to 30 times faster for data stored in HBase.

Impala is still early technology, with the first generation now in public beta. It lacks columnar support, which could accelerate analytics performance even further, but this is expected to be added in a near-term update. More importantly, because it relies on Hive, it lacks a cost-optimizer, which in the long run will be necessary when used in regular production.

Cloudera is not the first to map SQL into Hadoop. Hadapt offers an approach that squeezes SQL tables inside standard HDFS 64-MByte file blocks, and it also claims high-performance interactive query support. However, for now, most other approaches take an arm’s-length approach, emphasizing data integration connectivity between SQL platforms and Hadoop, with the conventional wisdom in the enterprise data warehouse (EDW) world relegating Hadoop to an exploration platform. Ovum, however, doesn’t expect this situation to last.

Hadoop market enters the real world

Impala is key to Cloudera’s long-term repositioning as an analytics platform provider that will supplement but not replace SQL data warehouses. Cloudera’s trump card is its leadership in the commercial Hadoop distribution market, where it has a four-year headstart on the rest of the pack. Cloudera has laid down the gauntlet to rivals with a stake for making SQL a first-class citizen inside Hadoop. As analytics loads shift over time, with Hadoop playing a more prominent role in performing analytics, Ovum expects Cloudera’s rivals and “co-opetition” to follow suit. For instance, Teradata Aster could use SQL H as the connective tissue. All of the other Advanced SQL players that already advertise connectivity with Hadoop are likely to start promoting features that address how analytics loads are shared in a hybrid, managed SQL/NoSQL environment.

Technically, Impala is not a fork of the Apache Hadoop project itself because it is being developed outside the Hadoop project. However, it signifies that going forward, much value-add will come from outside the “core” Apache stack, whatever it is (interpretations vary by vendor). For now, HCatalog, which is inside the Apache umbrella but is only in incubation, is backed primarily by Hortonworks and Teradata.

For enterprises, choosing the Hadoop distribution won’t simply rest on which layers of the stack are supported, but instead on the value-added layers and the third-party vendor platform, solution, and tool integrations that are supported.

Hadoop is going to look a lot more like the SQL database market, which is rooted on common presumptions of database architecture, but differs on the dialect of SQL language, how the storage engine is used, and how third-party tools and solutions play into that engine.

APPENDIX

Author

Tony Baer, Principal Analyst, Ovum Enterprise solutions

tony.baer@ovum.com

Further reading

2013 Trends to Watch: Big Data, Ovum, October 2012

Disclaimer

All Rights Reserved.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior permission of the publisher, Ovum (an Informa business).

The facts of this report are believed to be correct at the time of publication but cannot be guaranteed. Please note that the findings, conclusions, and recommendations that Ovum delivers will be based on information gathered in good faith from both primary and secondary sources, whose accuracy we are not always in a position to guarantee. As such Ovum can accept no liability whatever for actions taken based on any information that may subsequently prove to be incorrect.