Steve's Data Science & Big Analytics Blog: March 2014

Tuesday, March 25, 2014

Apache Spark and the Future of MapReduce

The days of using the MapReduce framework for big data processing may be numbered.

Spark, an in-memory framework designed to work with the Hadoop Distributed File System (HDFS), has now become an official Apache project. This is great news for Spark, as it ensures that the project will gain some stability as it continues to grow and popularize among users of Hadoop.

As this article points out, Spark has many advantages over MapReduce. It is much faster than MapReduce for most applications because it is in-memory. It is also relatively easier to program. Perhaps most interesting is that it is also primed to handle future big data applications - including machine learning and real time processing.

While Spark is a fascinating project with great prospects, MapReduce still has some advantages as the dominant Hadoop programming model. Spark still cannot do everything that MapReduce can, and MapReduce may be better at handling batch processing applications.

Tuesday, March 18, 2014

Amazon and IBM vs. Open Source Hadoop

Forrester Research: Hadoop vendor chart

This interesting editorial from ReadWrite comments on the current state of Apache's Hadoop, which is an open source software framework for analyzing and storing large data sets. Hadoop has filled a niche in enterprise grade data analysis, and has more or less become a "cornerstone of any flexible future data management platform". It is attractive to businesses because of its low cost, high efficiency, and open source nature.

A report from Forrester Research on Hadoop vendors shows the bigger companies like IBM and Amazon as having the long term strategic advantage. This is presumably because they have more resources to research, develop, produce, and sell Hadoop related software and services.

As the author of this editorial points out, these large companies have taken advantage of the Hadoop trend without contributing to the development of the framework. As a result, the ones who actually have a strategic edge is the developers because they shape the direction that the project will go in. The companies that contribute to the collaborative project will ultimately gain the strategic edge as vendors because they will be able to influence the future of Hadoop.