20. Industry’s First and Only Scale-Out Storage Solution
with Native Hadoop Integration
Accelerating the Benefits
of Hadoop for the
Enterprise
Reducing Risk
End-to-End Data Protection
Organizational
Knowledge/Experience
Here’s what we’re going to cover in today’s session:Walk through agenda
To start things off today, let’s look at “The Big Data Opportunity”
<This slide gives you the opportunity to tell the audience that we are in the era of big data.>I’m sure you’ve seen some of the articles in the press about “Big Data”. It seems as if everyone is talking about it. Some of you are probably living it today. There’s lots of interest in it but many aren’t exactly sure about what they should be doing about it. Big Data has been recognized world over for the potential impact it can have. Gartner has said that enterprise’s who embrace Big Data will outperform their peers financially by 20%.<click>
Make no mistake about it The Era of Big Data is Here
Over the next decade, the explosion of data will introduce not only massive challenges for IT, but massive opportunities for business. In fact, we’ve seen a number of our customers use Big Data to transform their businessLet’s look a just a few examples
Healthcare: Hospitals are implementing EMR (electronic medical records) and enabling access to larger volumes of historical patient data. Big Data Analytics infrastructures enable doctors and hospitals to leverage this EMR data to find patterns in the success of various treatments for patients with a variety of characteristics. Through the ability to store and analyze massive volumes of patient data, doctors are discovering more effective treatment options targeted at the specific characteristics of their patientsFinancial services: Banks and investment institutions have always been focused on the use of data in all of their operations Now “Big Data” brings the ability to run predictive analytics enabling these organizations to determine how their balance sheets can be affected by a variety of different market forces. For example, if the Euro drops 20%, how will that affect the bank’s balance and ability to borrow or lend money.Utilities: The implementation of Advanced Metering Infrastructure is generating massive amounts of data on the distribution and consumption of energy by commercial institutions and businesses. Utility companies can leverage these new forms of data to predict service failures and more quickly detect energy theft.
Now let’s look at ”Hadoop” and its role on Big Data Analytics.
To harness the full power of Big Data assets, “Big Data Analytics” are increasingly importantWith “Big Data Analytics”, organizations can leverage their “Big Data” assets to uncover new, emerging trends and identify potential business opportunities.With these powerful tools, businesses can tap into their Big Data assets and potentially discover new ways to gain competitive advantages.In sum, these technologies help organizations become more agile and identify opportunities and respond fasterRecent technology trends including the growth of the Internet have generated an immense and growing wave of “Big Data” that will require your “Big Data” storage and analytics platforms to scale significantly to handle the volume, velocity and variety of this data.To under score this, IDC recently projected that the amount of data managed by enterprises today will increase by 50x by 2020. In addition, 80% or more of this data will be “unstructured” , file-based data. With this as the backdrop, let’s look at the emergence of Hadoop.
Hadoop was developed 5-6 years ago to specifically address the need for “Big Data Analytics” At the time, development for Hadoop was being driven by the big Internet companies like Yahoo! And Google who were amassing a huge amount of unstructured data and needed a new way to analyze it because traditional approaches couldn’t handle this new “Big Data” challenge.The development of Hadoop was pioneered by Doug Cutting, a former Yahoo! EngineerHadoop consists of 2 key elements: The “Hadoop Distributed File System” (HDFS) while handles the storage component of the systemMapReduce which handles the “compute” functionToday,Hadoop is an ‘open-source’ initiative, very similar to Linux, and backed by a large, open source development community who collaborate on “Apache Hadoop”As with Linux, there are a number of approved or authorized Apache Hadoop distributions, including EMC Greenplum’s “Greenplum HD”. <You may also want to note that “Hadoop” got it’s name from Doug Cutting’s son’s toy elephant. This also explains, the “elephant” that is often depicted on materials relating to Apache Hadoop.>Now let’s look at why hadoop is so important.
One reason Hadoop has emerged as an important technology is because it is an innovative, Big Data analytics engine designed specifically for massively large data volumes. With it, organizations can greatly reduce the time required to derive valuable insight from an enterprise’s dataset. By adopting Hadoop to store and analyze massive data volumes, enterprises are gaining an agile new platform to deliver new insights and identify new opportunities to accelerate their business.Hadoop has also been designed to tackle analytics for unstructured data. This is significant because this is the dominant area of data growth projected for the foreseeable future.Now let’s look at how the adoption of Hadoop is evolving.
<This slide will automatically build to the next slide>
The initial, early adopters of Hadoop were largely the big Internet companies as well as a number of universities and research organizations.These early adopters were very “techy” and research-oriented. Typically, Hadoop was deployed in a “lab” environment, outside the domain of any traditional enterprise IT department. Often, these early deployments were very much a “do-it-yourself” effort involving the assembly of systems using commodity components.It wasn’t unusual, especially in academic environments, for a small-army of research assistants to be used to keep the system running.<advance to next slide>
Now, flash forward 5-6 years and we are seeing Hadoop beginning to go mainstream in enterprise environments across a wide range of industries.Increasingly, IT executives and line-of-business managers looking to leverage the “Big Data” assets within their organization to identify new opportunities and accelerate their business.Related to this, we are seeing the emergence of a new role in organizations: Data ScientistsThese organizations are also keenly interested in integrating Hadoop and its infrastructure into their overall IT environment so that they can protect the data and manage it with their standard IT processes. They are also more interested in acquiring and deploying ‘proven’ Hadoop solutions rather than building a “do-it-yourself” projectWhile Hadoop offers great potential value to organizations, it is not without certain challenges that need to be addressed. Let’s look as these.
It this section, we’re going to identify and describe the key technology challenges of Hadoop, especially when deployed using direct-attached storage (DAS).
One challenge associated with traditional deployments of Hadoop, is that it has largely been done on a dedicated infrastructure and not integrated with or connected to any other applications. In effect, a silo’d environment, often outside the realm of the IT team. This poses a number inefficiencies and risks.<click>A well-recognized issue with traditional Hadoop deployments is the “single-point-of-failure” problem with a the HadoopNamenode. In a Hadoop environment, a single namenode manages the hadoopfilesystem. If it goes down, the Hadoop environment will immediately go off-line. <Click to next build slide>
Another issue with traditional Hadoop environments is the lack of enterprise-level data protection. Typical Hadoop deployments do not have rigorous data protection backup and recovery capabilities such as snapshots or data replication capabilities for disaster recovery (DR) purposes.<click> Traditional Hadoop deployments on direct-attached storage (DAS) are also extremely inefficient. It’s not unusual for a DAS environment to operate with a 30-35% storage utilization rate (or less). Compounding this inefficiency is the fact that data is often mirrored (the default is 3 times). In addition to storage inefficiency, this type of infrastructure is very management-intensive.<click>Another issue with Hadoop running with direct attached storage is that ‘server’ and ‘storage’ resources must be increased together in lock-step. For example, if more storage resources are required, a new server must be deployed (and vice versa). This rigidity adds additional inefficiencies. Another issue is the manual import/export of data that is required in a traditional hadoop environment. In addition to being time and resource (bandwith) consuming, the hadoop data in typical environments can not be accessed or shared with other enterprise applications due to the lack of industry-standard protocol support.To address these challenges and to enable enterprises to begin realizing the benefits of Hadoop quickly and easily, EMC has recently introduced an exciting new Hadoop solution.<click to advance to next slide>
With the new EMC solution which incorporates EMC Isilon Scale-out NAS storage, organizations can deploy Hadoop on a highly scaleable platform that easily leverage other enterprise applications and workflows.<click>
The new EMC solution also eliminates the “single-point-of-failure” issue. We do this by enabling all nodes in an EMC Isilon storage cluster to become, in effect, namenodes. This greatly improves the resiliency of your hadoop environment.The EMC solution for hadoop also provides reliable, end-to-end data protection for Hadoop data including snapshoting for backup and recovery and data replication (with SyncIQ) for disaster recovery capabilities.Our new hadoop solution also takes advantage of the outstanding efficiency of EMC Isilon storage systems. With our solutions, customers can achieve up to 80% or more storage utilization.EMC Hadoop solutions can also scale easily and independently. This means if you need to add more storage capacity, you don’t need to add another server (and vice versa). With EMC isilon, you also get the added benefit of linear increases in performance as the scale increases.EMC also recently announced that we are the 1st vendor to integrate the HDFS (Hadoop Distributed File System) into our storage solutions. This means that with EMC Isilon storage, you can readily use your Hadoop data with other enterprise applications and workloads while eliminating the need to manually move data around as you would with direct-attached storage.
EMC is the industry’s first and only storage vendor to provide native Hadoop integration with scale-out storage. Our solution is designed to a number of key benefits:Our end-to-end approach helps enterprises deploy a proven hadoop solution quickly so that you can begin benefitting from this powerful technology quickly.Our solution eliminates risk and increases data protection.Another advantage of EMC’s Hadoop solution is that we have a significant amount of knowledge and expertise about big data analytics that you can leverage (we’ll cover this in more detail later in the presentation.Now let’s take a closer look at the EMC solution to see how we’re able to deliver on these benefits.
Scale out software architectures make commodity hardware work. You don’t want to be in the hardware business. This is being commoditized.Graphics showing the accelerating growth of the hardware with the go lower price pressures.
Main points:With a shared node based architecture any node can go down and any other node can take over for it; N-way resiliencyIsilon stripes vertically across all nodesIf a drive were to fail we rebuild the data across the available free space of the clusterIsilon can do protection levels unprecedented in the storage industry. N+1 through N+4… quadruple parity protectionCan sustain up to four simultaneous failures (4 drives or 4 nodes)Since each node in the cluster is participating in rebuilding a small piece of the data in parallel we can rebuild lost drives faster than anyone in the industry…. Easily rebuilding a 250GB drive in minutes rather than hours________________________________________________________Example narration:Lets talk a little about reliability. First as a shared node based architecture any node can go down and any other node can take over for it. We call this N-way resiliency. Second we do data protection very uniquely. Take this oil and gas file. A user hits “save” (“Click”) and the file is sent to the cluster and striped vertically across all nodes. Each node takes a small part of the file. It is “distributed” across the entire cluster. We also do this with parity or ECC. If a drive were to fail we rebuild the data across the available free space of the cluster… rather than on some dedicated parity drive or within some RAID group of drives. (“Click”) Moreover we can do protection levels unprecedented in the storage industry. N+2… akin to RAID 6 or RAID DP all the way through N+4… or quadruple parity protection failure. So we can sustain up to four simultaneous failures in our solution and be protected. This is industry leading data protection levels not previously understood in storage but achieved with Isilon. Finally, since each node in the cluster is participating in rebuilding a small piece of the data in parallel we can rebuild lost drives faster than anyone in the industry…. Easily rebuilding a 250GB drive in minutes rather than hours. This minimizes your window of risk when you have failed components.
EMC’s enterprise hadoop solution combines the power of EMCGreenplum HD, EMC’s “Apache Hadoop Distribution”, with EMC Isilon Scale-out NAS storage.The Greenplum HD software, depicted here at the top of the diagram, provides the “Compute” function while the Isilon storage (depicted at the bottom of the diagram) provides the “storage” function in the EMC Hadoop solution. Note that the “Hadoop Distribution File System (HDFS)” is integrated into the OneFS Operating system used by the EMC Isilon storage systems.Together, this solution provides a comprehensive hadoop solution that is easy to implement and manage. It is also highly efficient, reliable and highly scaleable.Our Hadoop solution can also be easily augmented with additional EMC Greenplum technologies to expand your data analytics capabilities (these will be discussed later in the presentation). Now let’s look at how the EMC Hadoop solution is packaged.
EMC’s Hadoop solution is a available in 2 basic configurations:EMC GreenplumHadoop software + EMC Isilon storageAn EMC Hadoop “data computing appliance” + EMC Isilon storage In the 1st solution configuration, the customer provides their own x86 server hardware which is then loaded with Greenplum HD is packaged as software-only. The server then is connected to the EMC Isilon Scale-out NAS.In the 2nd solution configuration, an EMC Greenplum “Data Computing Appliance” (includes an x86 server appliance, pre-loaded with Greenplum HD software) connects to the Isilon scale-out NAS storage platform.Either offering, enterprises can deploy and implement a comprehensive hadoop solution quickly and easily. Now let’s look at the underlying software architecture of the solution with out “Data Computing Appliance”.
This slide illustrates the architecture of EMC’s enterprise Hadoop solution based on our Greenplum “Data Computing Appliance (DCA)”.Starting at the bottom, you’ll note that the solution incorporates EMC Isilon storage which connects to our DCA with the HDFS protocol. Within the DCA, you’ll note: the Pluggable Storage LayerThe MapReduce Layer of Hadoop (which provides the “Compute” function).Standardhadoop tools such as “Pig” and “Hive”Advanced tools through Greenplum Chorus (which will be described in more detail in a few minutes).This solution provides a number of advantages over traditional Hadoop deployments:Easier and more reliable: EMC’s end-to-end approach removes the pain associated with building out a Hadoop cluster from scratch, which is required with other distributions. A purpose-built Hadoop infrastructure: Enterprises can deploy a Hadoop cluster quickly while eliminating the risk associated with the typical hardware and software configuration process.A key component of a unified analytics platform: The Hadoop solution of Greenplum HD is a core component of Greenplum’s Unified Analytics Platform, which is designed to answer the Big Data analytics needs of the agile enterprise by delivering business value through analytical insights.As a packaged and supported solution from EMC, you can also take advantage of the EMC's extensive support and services:Enterprise Hadoop support:Rely on EMC to provide 24x7 worldwide support with the industry’s largest Hadoop support infrastructure. Proven at scale: Certified by EMC to remove the guesswork associated with Hadoop deployments.Now, let me introduce my colleague from EMC’s Greenplum team to describe additional ways we can help you address your Big Data analytics needs.
Greenplum is working with an amazing group of customers to help them pursue business value from Analytics and participate in this era of Big Data. These industry leaders and innovative thinkers are doing extraordinary things with our platform. As you can see we are working with companies in many industries and verticals. Everything from Finance, to retail, to telecom to internet. Regardless of the sector, companies using Greenplum are innovating in new ways.
Our expansive partner network ensures you protect your existing investments while having the opportunity to leverage the best available technology. Greenplum has deep partnerships with industry leading organizations such as the SAS institute, Microstrategy and Informattica. We are also working with the emerging partners including karmasphere, datameer and predixion who doing new and interesting things on Hadoop and big data. Finally, we are fortunate to work with a number of leading applications providers like Silverspring networks and Clickfox who leverage Greenplum as a powerful backend technology. Greenplum is proud to work with this extraordinary partner ecosystem.
You have heard us say Greenplum not just a database, but guess what, it’s also, Greenplum, not just about technology.Data science teams are an emerging practice that are making amazing things happen on big data on behalf of their organizations. Greenplum is committed to the future of data science. We are working with leading universities on developing data science curriculums and programs.And we are investing in the community. We recently announced, with the help of several partners, a 1000 node Hadoop cluster called the Greenplum Analytic workbench for Hadoop. The only one of its kind in the industry. We will always have community editions of our software available for free. And we continue to invest in the practice by creating an publicizing events like the Data Science Summits.We also have our own data scientist practice with PHDs that have expertise in leading analytic tools. This team works every day with our customers advancing their projects and enabling new things from data.