1. Apache Spark's success: Overhyped or preordained?
After an absence of about a year, and a stint as Research Director at the now defunct Gigaom
Research, I've returned to ZDNet to cover Big Data. The year went by pretty quickly, but a number
of things have changed:
SQL-on-Hadoop has become ubiquitous to the point that almost every Hadoop and relational
database vendor has its own solution
Industry consolidation has begun. Companies like Jaspersoft, Pentaho, Hadapt, RainStor and
Revolution Analytics have been acquired or soon will be.
YARN and Hadoop 2.x now have all the mindshare and old-school MapReduce is in retreat
But one change that has become especially noteworthy is the degree to which Apache Spark has
captured the attention and excitement of the industry.
Special Feature
Going Deep on Big Data
Big data is transitioning from one of the most hyped and anticipated tech trends of recent years into
one of the biggest challenges that IT is now trying to wrestle and harness. We examine the
technologies and best practices for taking advantage of big data and provide a look at organizations
that are putting it to good use.
Spark can run independently of Hadoop or as a YARN application on a Hadoop cluster. In the latter
configuration it can read data in the Hadoop Distributed File System (HDFS) and can then enable a
range of workloads to be carried out on that data. Spark SQL enables a HiveQL-compatible SQL
execution environment; Spark's MLLib enables machine learning; Spark Streaming provides for
high-speed stream processing of data and GraphX provide for graph processing.
See Spark run
In addition to the familiarity that Spark SQL provides, Spark code can be written in Scala, Java and
Python. Spark can (but does not have to) use memory, and in a distributed fashion across the RAM
facilities in its cluster's nodes. Getting a sample application running in Spark is fairly
straightforward. That, combined with its memory-based, non-batch processing capabilities, provide
interactive experimentation and near-instant gratification - something that has not been the norm in
the Hadoop world.
That relatively friction-free experience, even if at the command line, can be intoxicating. And
intoxicated the industry is. While Spark is still quite new and several people have reported to me
that it's not ready for prime time, industry support for Spark is intense. Cloudera has promised to re-
platform most Hadoop ecosystem components in its distribution onto Spark. MapR includes Spark in
its distro and Hortonworks, once a Spark holdout, has jumped on the bandwagon as well, including
Spark in HDP (Hortonworks Data Platform), its own Hadoop distribution.
Getting started is easy
2. While neither Amazon's Elastic MapReduce nor Microsoft's Azure HDInsight cloud Hadoop services
include Spark automatically, both companies have enabled installation of Spark via custom script
steps that simply require specifying a URL when a cluster is created. Both companies also provide
samples and tutorials that make it easy to run quick-and-dirty Scala code or SQL queries.
And if none of that works for you, then Databricks, the company founded by Spark's creators, has its
Databricks Cloud offering (something you might wish to call Spark as a Service, if that didn't
overload an already well-worn acronym) in the wings.
Some companies, like Paxata and ClearStory Data, have built their products on Spark. Others, like
Platfora, have deployed new product capabilities that have dependencies on, and certain
integrations with, the Apache Software Foundation project. Adoption of Spark in the enterprise may
be low so far, but industry adoption is formidable.
The Power of IoT and Big Data
As sensors spread across almost every industry, the internet of things is going to trigger a massive
influx of big data. We delve into where IoT will have the biggest impact and what it means for the
future of big data analytics.
So what happens next with Spark? Some in the industry have predicted that Spark's popularity and
its ability to run without Hadoop mean it may overtake it. Others, myself included, are more
skeptical of that, given that HDFS alone has become enough of a standard to keep Hadoop
entrenched, and YARN allows challengers to run as applications on the cluster.
Irrational exuberance?
In general, vendors seem so far ahead of customers on Spark that it's almost worrisome. If Spark
isn't yet stable and robust enough for big enterprise production jobs, if even the companies that
have standardized on Spark say they have had to write their own enhancements to make it work for
them (something I have been told by important vendors in the Big Data space), then is Spark just
hype?
Readiness is in the eye of the beholder. Robin Bloor of Bloor Research, a well-respected industry
analyst firm, once told me this (and I'm paraphrasing): when platforms get beyond a certain critical
mass of support, they eventually become what the hype has made them out to be. In other words,
belief in the quality of a platform tends to self-fulfill. Once the industry commits to something, it
3. creates an imperative around getting it stable and well-performing, even if the committers
themselves have to pitch in.
We're now a bit more than three months into the year; I saw my first Mr. Softee truck yesterday, a
sure sign that Spring has finally come to New York. Before the big Christmas tree goes up in
Rockefeller Center at the end of the year, Spark seems likely to achieve at least some of its own self-
fulfilling maturation and reliability. There's a bunch of shopping days to go, in the interim; let's wait
and see the outcome.
http://zdnet.com.feedsportal.com/c/35462/f/675847/s/4524d71f/sc/15/l/0L0Szdnet0N0Carticle0Cspar
ks0Esuccess0Eoverhyped0Eor0Epreordained0C0Tftag0FRSSbaffb68/story01.htm