Informatica 9.5 platform natively engages Hadoop

OVUM VIEW

Summary

As predicted in the Ovum report “Informatica’s 2012 strategy”, this year Informatica is going to place strong emphasis on extending its data integration to support Big Data including Hadoop. While the new Informatica 9.5 platform adds a number of capabilities and enhancements across the board, Ovum believes that Hadoop integration is a major highlight. Some of the new capabilities include the ability to conduct data-profiling on a mass scale, to replicate real-time data to Hadoop, and to integrate ultra-messaging with the broader Informatica Platform. While other players such as Oracle and IBM have built connectors for exchanging Hadoop data with relational targets, Informatica is the first major data-integration provider to move some of its data-integration capabilities directly inside Hadoop. Ovum expects that Informatica’s move won’t go unanswered for long.

Integrating “big transaction data” is a major theme

Although the Informatica 9.5 platform is not solely about Hadoop integration, the four “V”s of Big Data (volume, variety, velocity, and value) are common themes of the release. Informatica uses the term “big transaction data” to emphasize that this release is not only intended for early adopters of Big Data, but also for the mainstream of its installed base (large enterprises) that are managing transactional data both on-premise and in the cloud. Examples include:

  • Data discovery. Informatica Data Explorer, which highlights anomalies in structured and unstructured information and detects dependencies, now detects business entities and automates the profiling of hundreds of sources at a time for enterprise-scale projects.
  • Natural-language-based probabilistic data parsing. Informatica Data Quality adds text-based entity-identification capabilities that are aimed at deciphering variably structured social media. On the horizon, Ovum expects that this technology could be extended to provide probabilistic data cleansing of text entities, an approach we predicted would emerge (see Ovum report “Data Quality and Big Data: From Discovery to Precision”).
  • Integration of the recently acquired Wisdom Force technology for higher-performance change data capture (CDC) and continuous, current (Informatica would call it “real-time”) replication.
  • Data virtualization to reduce or eliminate the need to move data, which can be critical for data sets of tens or hundreds of terabytes and more.
  • Integration with Ultra Messaging. With this release, Informatica Ultra Messaging is integrated with the rest of the data platform, which allows organizations, typically capital markets trading into data warehouses for historical analytics, to selectively strip off key data from high-speed feeds.
  • Data masking now has a unified console for managing policies for both persistent and dynamic data masking, and is also integrated with data discovery to detect sensitive data that might be hidden.

Going native with Hadoop

For now, excluding IBM and Oracle, many data platform providers are placing Hadoop at arm’s length. They view Hadoop primarily as a staging area where some preliminary MapReduce sorts are performed, with result data sets loaded to Advanced SQL analytics platform targets where the real analytics of all kinds of data, including variably structured, are performed. In Ovum’s 2012 Big Data Trends to Watch report, we predicted that Hadoop would very soon become a prime analytics platform in its own right. We predicted that 2012 would see a repeat of the same script that played out in 1995-96 when modern data warehousing and BI emerged, and tooling became available that made analytics and OLAP data stores easier to use by the mass of SQL developers.

Informatica has taken on the challenge. Following up on the release last year of HParser that provided one of the first commercial tools to make it easier to organize raw Hadoop data, the 9.5 platform provides additional core capabilities to make it easier for enterprises to adopt Hadoop.

Key enhancements in the 9.5 release include replication where Hadoop can be managed as a source or target by Informatica’s replication tools, including Informatica Data Replication, Fast Clone, and PowerExchange CDC. It is managed as part of the same engine that also includes relational data stores, both on and off-premise. In so doing, Hadoop can be seen not only as a source, but also as a near-line data warehouse that complements higher-performance interactive or real-time data stores.

Informatica is also introducing additional pre-built transforms native to Hadoop, beyond HParser. These include identity resolution, profiling, matching, cleansing, natural-language processing, and data transformation, and will be complemented by a visual IDE (integrated development environment) that reduces or eliminates the need for knowledge of Hadoop’s Pig scripting language. Specifically, these transformations will work inside Hadoop with Hadoop data and automate generation of MapReduce, Hive, or Pig scripts to organize and manage data. These capabilities will be in beta in the July timeframe. Currently, Informatica is supporting the Cloudera, Hortonworks, and MapR Hadoop distributions.

Lowering the barrier of entry for Hadoop

Informatica has taken the lead in developing tooling for taming Hadoop for mainstream SQL developers. For now, on the open-source side, most of the tooling has been specialized and aimed at early adopters with Hadoop skills. The Informatica 9.5 platform extends tooling to bring integration of Hadoop and manipulation of its data into more familiar environments. Of course, on its own, the 9.5 platform does not eliminate the need for skillsets that understand how Hadoop distributes data, nor does it deploy and run large scale-out clusters (for those not using appliances or cloud services), and of course it does not overcome the knowledge gap for data science.

Informatica is clearly the first to start bringing Hadoop inside a familiar data-integration platform, and Ovum expects that by year-end IBM and Oracle will respond in kind.

APPENDIX

Disclaimer

All Rights Reserved.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior permission of the publisher, Ovum (an Informa business).

The facts of this report are believed to be correct at the time of publication but cannot be guaranteed. Please note that the findings, conclusions, and recommendations that Ovum delivers will be based on information gathered in good faith from both primary and secondary sources, whose accuracy we are not always in a position to guarantee. As such Ovum can accept no liability whatever for actions taken based on any information that may subsequently prove to be incorrect.