Steve's Data Science & Big Analytics Blog

Tuesday, May 6, 2014

The White House Tackles Big Data

A report that was officially commissioned by the President seeks to examine the opportunities and challenges posed by current and future big data technologies. This new report details how different groups like the government, businesses, and private citizens use big data technology and how they will eventually be affected by their widespread use.

This article mentions that big data has the potential to change our lives for the better, but privacy issues remain a significant danger. To address these concerns, the report concludes with six recommendations: The creation of a consumer privacy bill for data collection, new national standards on preventing and reporting data breaches, extend privacy protections to non-U.S. citizens, ensure data collected in schools is only used to increase educational outcomes, update the Electronic Communications Privacy Act to reflect the new technologies of cloud computing and mobile data, and to prevent big data for being used for discriminatory purposes.

Tuesday, April 22, 2014

The Senate Passes the DATA Act

The Digital Accountability and Transparency (DATA) act, which was introduced to the Senate last year, was unanimously passed on April 10th. The act is an overhaul of the previous information reporting system used by various government agencies, which was paper-based and not always easily accessible by the public. The DATA act sets government-wide standards for data reporting, and will require all agencies to publish information online at USASpending.gov. This fully searchable, public website will make government spending more transparent to everyone, and even offers visualization tools that display government spending trends.

Source: USASpending.gov

This bill still needs to be passed by the House of Representatives and signed by the President, but it is expected to be approved with no issues.

Once implemented, the website will become a resource for businesses, the government, and private citizens, as spending data will be freely available to download. As this article mentions, big data analytics can be used on these data sets to generate insights into government spending patterns. Opportunities for such uses are numerous, and could include ways to find and address waste in federal spending. Perhaps this could be the platform to react to the growing national debt problem.

Tuesday, April 15, 2014

Data Visualization From the U.S. Census Bureau

Data visualization tools are graphical and sometimes even interactive portrayals of data sets. Due to their ease of use and visual nature, demand for visualization tools is rising and are considered by many businesses to be an integral part of big data analysis. Last December, the Census Bureau released its own data visualization tool, the Census Explorer. This interactive online map shows various demographic information based on their data sets, including median household income, educational level, and home ownership rate.

A home ownership map of Washington D.C., based off of 2012 survey data (census.gov)

As this blog post from the Census Bureau points out, you can explore data at the state, county, and even the neighborhood level. You can also manipulate the data sets you are able to see to compare data across time periods, although the oldest data available is from 1990.

Tuesday, April 8, 2014

Oracle's NoSQL Opportunity

Oracle has long been known for its mySQL relational database management system (RDBMS). A recent trend in the world of database management is NoSQL (not only SQL), which is a database that can handle unstructured data. NoSQL also describes any database that does not use the relational model to store information. This article from Forbes mentions that Oracle is missing out on a huge opportunity with the growing popularity of NoSQL. It is important to note that Oracle does indeed have a NoSQL product, the Oracle NoSQL Database, but their market presence is limited compared to their competitors.

Oracle: 4% market share in 2012 (wikibon.org)

NoSQL is beginning to gain traction with enterprise customers, and the market is quickly growing. Most successful NoSQL projects like MongoDB are open source and have a limited reach to businesses. Oracle is a huge company and is already has an established presence is the business world. They have preexisting relationships with millions of traditional Oracle SQL users. If Oracle was to put more resources to developing a better NoSQL product, or even acquire a few startups that have a better product, then they are in a good position to gain market share and be at the forefront of the NoSQL trend.

Tuesday, March 25, 2014

Apache Spark and the Future of MapReduce

The days of using the MapReduce framework for big data processing may be numbered.

Spark, an in-memory framework designed to work with the Hadoop Distributed File System (HDFS), has now become an official Apache project. This is great news for Spark, as it ensures that the project will gain some stability as it continues to grow and popularize among users of Hadoop.

As this article points out, Spark has many advantages over MapReduce. It is much faster than MapReduce for most applications because it is in-memory. It is also relatively easier to program. Perhaps most interesting is that it is also primed to handle future big data applications - including machine learning and real time processing.

While Spark is a fascinating project with great prospects, MapReduce still has some advantages as the dominant Hadoop programming model. Spark still cannot do everything that MapReduce can, and MapReduce may be better at handling batch processing applications.

Tuesday, March 18, 2014

Amazon and IBM vs. Open Source Hadoop

Forrester Research: Hadoop vendor chart

This interesting editorial from ReadWrite comments on the current state of Apache's Hadoop, which is an open source software framework for analyzing and storing large data sets. Hadoop has filled a niche in enterprise grade data analysis, and has more or less become a "cornerstone of any flexible future data management platform". It is attractive to businesses because of its low cost, high efficiency, and open source nature.

A report from Forrester Research on Hadoop vendors shows the bigger companies like IBM and Amazon as having the long term strategic advantage. This is presumably because they have more resources to research, develop, produce, and sell Hadoop related software and services.

As the author of this editorial points out, these large companies have taken advantage of the Hadoop trend without contributing to the development of the framework. As a result, the ones who actually have a strategic edge is the developers because they shape the direction that the project will go in. The companies that contribute to the collaborative project will ultimately gain the strategic edge as vendors because they will be able to influence the future of Hadoop.

Tuesday, February 25, 2014

Machine Learning Algorithm Enables Customer Support of the Future

Mobile intelligence company Carrier IQ has announced a machine learning algorithm that will potentially revolutionize customer support for your mobile device. Carrier IQ, which sells support applications to mobile network operators, is using machine learning and Big Data analysis to automatically detect common customer problems. The company's 'IQ care' software suite mines a large amount of device data, and uses that Big Data to look for trends that are not obvious from simple statistical analysis. The information gathered from this process is used to automatically diagnose individual device problems, and a cell provider is able to support 50 million reporting devices simultaneously using this software. The two main things this software looks for is to differentiate network problems from device problems, and to find resource draining applications on your phone that will negatively impact user experiences. The ultimate goals for cell phone carriers that use this technology is to increase customer satisfaction, stop unnecessary device returns, and to allow customers to easily troubleshoot their own devices.

Tuesday, February 18, 2014

Amazon Offers On-Demand Big Data Analysis

In a promising sign for the future of Big Data Analytics, Internet giant Amazon is now offering a hosted version of the R programming language.

R is a programming language that is mainly used for statistical analysis. Although it has been around since the 1990s, it has recently become popular for its use in Big Data mining and analysis. Amazon is offering 'Revolution R Enterprise 7' (RRE) on its cloud service, which is a commercialized version of the R language developed by Revolution Analytics. RRE will run on either Windows or Linux platforms.

While R itself is free and open source, Revolution Analytics claims that its service offers "higher performance and higher scalability for dealing with large data sets". There is also a version of RRE that will run on multiple processors simultaneously. However, this service doesn't come cheap - the base rate for RRE starts at 1.25 per hour per core.

Personally, it is difficult to see the added value of this service. While there is a certain level of convenience by using this on demand service, not to mention the knowledgeable tech support that Revolution provides, the appeal for this product is limited at best. Revolution themselves acknowledges this by saying that it would be more cost effective to buy their product outright for any data set exceeding 1TB.

In essence, this service offers a convenient way for more people to get in on the Big Data trend and test out the R programming language, but faster and more cost effective alternatives will likely arise with the increasing popularity of the R language and Big Data Analytics in general.

Wednesday, February 12, 2014

The Booming Big Data Market

Big Data has been a fast emerging trend in technology, and a new report shows that it has started to take off in the business world – and in a big way. The market for Big Data Analytics climbed to 18.6 Billion in 2013, a 58 percent increase in spending over 2012. Market forecasts show that the spending spree won’t slow down in the near future, with Big Data services projected to rake in 50 Billion USD by 2017.

But why are companies suddenly willing to invest in Big Data? As the report mentions, the rapid maturing of this new field has boosted confidence in Big Data products and services among all kinds of businesses. These products have also become increasingly more secure, and have allowed for better privacy capabilities. The widening availability and feasibility of using the Apache Hadoop framework for Big Data processing has also helped improve the popularity of Big Data services. As the Big Data market continues to mature, more businesses will utilize Big Data services as more polished applications emerge, and as security and data privacy issues are addressed.

Tuesday, February 4, 2014

Sentiment Analysis, Social Media, & the Super Bowl

On the day of the 10 year anniversary of Facebook, most of us recognize how social media has revolutionized the way we communicate online.

Social networking has been proven to be popular with the public at large over the years, with no signs of the trend slowing down. Consequently, businesses have adopted social media as a way to keep in touch with their customers. More recently, the corporate world has begun to harness the power of social media differently - through gathering and analyzing the large quantities of opinionated data available on social media sites concerning their company or products.

So how do they do this? Companies can use software tools that gather and analyze data on the web using a process called sentiment analysis. Sentiment analysis, also known as opinion mining, is a process that looks at large volumes of text in order to gather specific information, called sentiments. Sentiments are defined as personal opinions or feelings that an author has concerning a particular subject. Ultimately, the goal of sentiment analysis is to derive the writer’s attitudes, opinions, and conveyed emotions in a particular piece of writing.

A good reputation is critical for any company or brand, and the viral nature of social media communcations makes online repuation management even more critical. In this week's Super Bowl, many of the advertisers took to the web to monitor reactions to the commercial while they were happening. Volkswagen set up their own social media "war room" to monitor major trending topics Twitter during the Super Bowl. They then reacted to the trends to create related tweets and videos on the spot. This reinforced connection with the customers, while at the same time effectively managing their brand on social media.

Social Sentiment: Are You Listening?

Sentiment analysis is a field in computer science that seeks to derive opinions from text based data. In this video, IBM's Jonathan Taplin explains sentiment analysis and its relation to social media and 'big data'.