Pentaho expands Big Data coverage



Pentaho, an open-source BI vendor, recently unveiled a partnership with DataStax, a professional software and support organization for the Cassandra NoSQL datastore. Together, the vendors will offer a Cassandra-based analytics solution that allows Penatho’s Kettle visual extract, transform, and load (ETL) tool to interface with Cassandra for easy bi-directional movement and analysis of NoSQL data. While data movement and transformation tasks are mundane in the SQL world, NoSQL data stores still require a lot of custom-coding for simple tasks. Kettle’s release is well-timed and clearly targeted, and by the end of this year Ovum expects other BI vendors to start to offer similar tooling.

Wasn’t Pentaho already supporting Big Data?

Pentaho was one of the earliest BI vendors to support Big Data analysis, and in October 2010 the company began to support Hadoop in Pentaho Kettle, a visual environment that helps end users to graphically construct ETL jobs. At the time, the Big Data components of Kettle were only available to customers and were not available in the open-source version of Kettle. However, early 2012 the company open-sourced all Big Data capabilities in Pentaho Kettle and added support for NoSQL datastores including HBase, Cassandra, and mongoDB. At the same time, Pentaho changed Kettle’s open-source license from LGPL to Apache. The intention behind the license change was to be fully compatible with the Apache license used by Hadoop and several of the leading NoSQL databases. The Apache license is incompatible with the LGPL license with respect to distributing a combined work, which makes Apache projects hesitant to embed or distribute any LGPL-licensed code.

Pentaho’s goal was to remove the barriers to Kettle’s adoption and to fuel the widespread adoption of Big Data technology including Hadoop and NoSQL. Ovum believes that Kettle could well accelerate the “operationalization” of Big Data. As Big Data developers, analysts, and data scientists across companies experience Kettle, they may potentially choose Kettle over other programmatic approaches. Pentaho’s end-game could then be to pitch the full Pentaho Business Analytics suite at organizations such as these.

Cassandra support expands Pentaho’s horizons

With its support for Cassandra, Pentaho has demonstrated its commitment to expanding Kettle to other NoSQL datastores. To date, the Hadoop ecosystem has overshadowed all other distributions in the NoSQL world. However, many less-hyped NoSQL data stores have also evolved in parallel. One such popular distribution is Cassandra, which targeted at live interactive retrieval of data from Hadoop, replacing HDFS as the file store. Cassandra is ideal for large data sets that change very often and require realtime or very low-latency lookups. Because Cassandra has a smaller development community than Hadoop, the ecosystem is still lacking the basic tooling that Hadoop possesses. A tool that easily helps users move data from relational systems into Cassandra, or to move data out of it, is a welcome addition to the stable.

NoSQL as an ETL tool for Big Data is a valid use-case

While it is still too early to predict what use-cases will best justify investment in NoSQL databases, one of the most popular is to use NoSQL databases as an ETL tool for Big Data. Organizations are now exploring the option of using NoSQL databases alongside traditional ETL tools, each handling different kinds of data, and ultimately bringing a refined version back to the central data store for further manipulation and analysis.

Pentaho is certainly not the only vendor to provide tools for the ETL use-case. For users that are well-versed with the Hadoop ecosystem, open-source projects such as Pig, Oozie, Hive, Flume, and Avro are good starting points. Most of these projects attempt to abstract the map/reduce steps and reduce the need for custom programming for simple ETL tasks.

Several BI and data-management vendors also provide tools for easy data integration from their native environments into Hadoop.

  • IBM BigInsights provides connectors for DB2, Netezza, IBM Smart Analytics System, and InfoSphere Warehouse.
  • Oracle provides four different adapters that help move data to and from any Hadoop instance into the Oracle Database 11g and R.
  • SnapLogic offers its HDFS Snap technology for getting data in and out of Hadoop, using Pig, Hive, and Flume.
  • Informatica’s Informatica for Hadoop helps get data from Hadoop/HDFS into a variety of databases and analytics platforms.

Pentaho’s Hadoop approach is simple yet effective

Penatho’s approach to Big Data differs from that of other BI vendors in the market. Unlike the mega-vendors, Pentaho does not have any ambitions to offer its own version of Hadoop or make a play for the database layer. Unlike the data-management vendors, Pentaho does not aim to offer pre-built connectors that take Hadoop data and write target-specific formats for a large number of analytics and database solutions.

Instead, Pentaho is aiming to get Kettle bundled with every download of Hadoop, Cassandra, MongoDB, and other NoSQL software. The vendor believes that users will inherently choose Kettle over programmatic approaches to Big Data ETL, which could make Pentaho the analytics solution of choice.



All Rights Reserved.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior permission of the publisher, Ovum (an Informa business).

The facts of this report are believed to be correct at the time of publication but cannot be guaranteed. Please note that the findings, conclusions, and recommendations that Ovum delivers will be based on information gathered in good faith from both primary and secondary sources, whose accuracy we are not always in a position to guarantee. As such Ovum can accept no liability whatever for actions taken based on any information that may subsequently prove to be incorrect.