EMC brings NAS to Hadoop HDFS

Until now, the conventional wisdom for Big Data was that premium storage subsystems such as storage area networks (SANs) or network attached storage (NAS) would not be cost effective for the huge data sets that are involved. EMC aims to turn the equation on its head by adding native HDFS support to its Isilon NAS systems. This is part of a new wave of releases and rebranding for EMC Greenplum’s analytic database offerings that tie in Apache Hadoop. Additionally, the new release provides fresh proof that the Apache Hadoop stack is becoming not only the formal but the de facto standard Hadoop platform.

EMC refines Greenplum Big Data product branding and targeting

EMC has announced a new reordering of its Greenplum products that bundle the NoSQL Hadoop data store. The offerings include Greenplum HD, which bundles the Apache Hadoop stack, and Greenplum MR, an edition aimed at what EMC terms “more advanced’ users that OEMs the MapR system, which adapts Hadoop by replacing the core Apache HDFS file system with a proprietary alternative. Both are also offered in appliance form as Greenplum DCA. This represents a rebranding and refinement of product targeting. Whereas before, the MR edition (previously called Greenplum HD Enterprise Edition) appeared to be the mainstream offering, it is now being targeted at higher end use cases.

EMC infuses core storage DNA into Big Data

Since EMC bought Greenplum and subsequently announced its Hadoop strategy, the question has been when the other shoe would finally drop: when would EMC finally bring its core storage DNA into the product? As part of the current announcement, EMC is also adding an option for Greenplum HD by offering native support of the Apache HDFS file system – but not yet the MapR alternative – within its Isilon NAS offering.

Big Data would seem like an ideal use case for premium storage as it involves lots of data. However, Big Data (especially the types of data that are collected in Hadoop) is often ‘low density” data where the value is not with individual “records” such as individual log files, but in the aggregate where patterns can be discovered. Furthermore, the way HDFS was designed, it stores three replicas of the data, seemingly making premium storage unaffordable for such projects. Not surprisingly, Hadoop’s pioneers have promoted the Internet data center commodity computing model, with endless scale-out clusters of x86 servers and direct attached disks, with storage incumbents until now locked out of the market.

With the new release, EMC is making the case that if you adopt a sufficiently robust storage system, you do not need to make all those copies of the data that you normally would in HDFS. In effect, you might pay more for the storage per petabyte, but for only a third of the overall data volume.

EMC also makes the case that storage systems like Isilon not only cut the level of complexity in setting up and operating clustered NAS systems, but that they can extend similar levels of simplicity to storing Hadoop data. The question is whether this is cost effective; as is customary in the storage sector, EMC is not disclosing pricing. The problem, however, is that rival Oracle with its Big Data Appliance (which had its own direct storage) has published pricing. EMC is now competing outside its traditional storage niche and needs to play by those rules. Given that it is introducing a new architecture that at first glance appears more expensive, the onus is on EMC Greenplum to make the case that its approach ultimately delivers greater business value, lower total cost of ownership (TCO), or both.

EMC’s offerings broaden the spectrum of Hadoop options

At first glance, offerings from EMC and Oracle would appear to promote the choice of scale-up versus scale-out architecture for Big Data, and Hadoop in particular. While scale-out implies the ability to expand Hadoop clusters at one commodity server (and attached disk) at a time, EMC’s and Oracle’s offerings are more scale-up in that they promote self-contained optimized systems where scaling requires addition of larger increments. Although EMC promotes Isilon as a scale-out storage architecture, the package may also include Greenplum DCA appliances which come in four-node increments, which are still larger than adding a commodity server or two and definitely more expensive.

In reality, the choice for Hadoop deployment architecture is not a black and white scale-up versus scale-out choice. Instead, it is a spectrum that encompasses cloud service providers that prepackage Hadoop clusters, SaaS providers that package Big Data applications, systems integrators, or outsourcers that set up and maintain scale-out clusters, or appliances that provide the optimized engineered scale-up alternative. There is not going to be any single answer or architecture that will apply to all organizations.

EMC’s move bolsters momentum for core Apache Hadoop stack

At first glance, it appeared that EMC announced a reversal of strategy. When it introduced its first commercial Hadoop offerings, it appeared that the MapR edition was going to be EMC’s primary focus as it was more developed as a packaged product. Reflecting that, it was branded Greenplum HD Enterprise Edition, while the open source alternative was branded Community Edition.

EMC has rebranded and clarified its targeting. The former community edition is now Greenplum HD, and is intended as EMC’s mainstream offering. It is the one that has received the Isilon support. By contrast, the former Greenplum HD Enterprise Edition is rechristened Greenplum MR, and is targeted at more advanced customers. With Oracle’s recent OEMing of Cloudera’s Apache open source Hadoop distribution as part of the Oracle Big Data Appliance, the trend becomes unmistakable. The Apache Hadoop stack with HDFS is becoming the de facto standard stack, with forks (such as MapR or IBM’s GPFS file systems) becoming the outliers. Other forks will still occur, but they will not be at the basic file system level.

This has major implications for growing a third-party ecosystem, which now has a recognizable common target to write against.