3. By 2015, Organizations that
Build a Modern Information
Management System Will
Outperform their Peers
Financially by 20 Percent.
– Gartner, Mark Beyer, “Information Management in the 21st Century”
Where are weWhere does it go from hereWhat’s nextCommunityNew projects incubated: Falcon, Knox, and moreHadoop 2 and the YARN based architecture coming in for landing (beta vote)Certification of YARN based applications – Hortonworks just announcedEcosystemVC’s invested $1.4M in Big Data companies in 2012 and 2013 even bigger (now huge investment into tools for accessing data in Hadoop, indicating “it has arrived”)Virtually every provider that touches data in any shape has brought Hadoop inJob postingsCommercial adoptionProjects going live at scaleAmazing commercial use cases that you will hear more aboute.g Cardinal Health, Home Depot? phenomenal examples of application to healthcare
To frame up my talk, I chose this quote from Mark Beyer of Gartner:“By 2015, organizations that build a modern information management system will outperform their peers financially by 20 percent.”Whether it’s opening up new business opportunities or outperforming your competitors by 20% or more, the important point to be made is that big data technologies offer very real and compelling BUSINESS and FINANCIAL value to go along with TECHNOLOGY that is able to do things never before possible.
So let’s set some context before digging into the Modern Data Architecture.While overly simplistic, this graphic represents the traditional data architecture:- A set of data sources producing data- A set of data systems to capture and store that data: most typically a mix of RDBMS and data warehouses- A set of custom and packaged applications as well as business analytics that leverage the data stored in those data systems. This architecture is tuned to handle TRANSACTIONS and data that fits into a relational database tables. [CLICK] Fast-forward to recent years and this traditional architecture has become PRESSURED with New Sources of data that aren’t handled well by existing data systems. In the world of Big Data, we’ve got classic TRANSACTIONS as well as New Sources of data that come from what I refer to as INTERACTIONS and OBSERVATIONS.INTERACTIONS come from such things as Web Logs, User Click Streams, Social Interactions & Feeds, and User-Generated Content including video, audio, and images.OBSERVATIONS tend to come from the “Internet of Things”. Sensors for heat, motion, and pressure and RFID and GPS chips within such things as mobile devices, ATM machines, automobiles, and even farm tractors are just some of the “things” that output Observation data.
To get a sense of the scope of these NEW SOURCES of data, let’s look at some stats from IDC.[CLICK] According to IDC, 2.8ZB of data were created and replicated in 2012.A Zettabyte for those unfamiliar with the term is 1 BILLION Terabytes.[CLICK] 85% of that is from New Sources of Data.[CLICK] Out of that 85%, machine-generated data is a key driver in the growth and just that one new source of data is expected to grow by 15X by 2020.[CLICK] Fast-forward to 2020 and we’ll have 40 Zettabytes of data in the digital universe! This represents 50-fold growth from the beginning of 2010.[CLICK] Needless to say, wrestling that scale of data is like this poor guy trying to wrestle a champion Sumo athlete. Overwhelmed and outmatched to say the least. Fortunately, your data architecture need not be outmatched.
As the volume of data has exploded, Enterprise Hadoop has emerged as a peer to traditional data systems. The momentum for Hadoop is NOT about revolutionary replacement of traditional databases. Rather it’s about adding a data system uniquely capable of handling big data problems at scale and doing so in a way that integrates easily with existing data systems, tools and approaches.This means it must interoperate with every layer of the stack:- Existing applications and BI tools- Existing databases and data warehouses for loading data to / from the data warehouse- Development tools used for building custom applications- Operational tools for managing and monitoringMainstream enterprises want to get the benefits of new technologies in ways that leverage existing skills and integrate with existing systems.
So I’d like to walk you through a solution architecture focused on how new and existing data sources flow through this modern data architecture. The architecture starts with two major areas of data processing that are very familiar to enterprises:1. Business Transactions & Interactions2. Business Intelligence & AnalyticsEnterprise IT has been connecting these systems via classic Data Integration and ETL processing for many years in order to deliver STRUCTURED and REPEATABLE business analytics. The business determines the questions to ask and IT collects and structures the data needed to answer those questions.[CLICK] As we’ve discussed, New Data Sources representing Interactions and Observations have come onto the scene. And Enterprise Hadoop has appeared as a new system capable of capturing ALL of this multi-structured data into one place. Hadoop acts as a “Data Lake” if you will. Some call it a Data Reservoir, a Catch Basin, a Data Refinery, the foundation for a Data Hub & Spoke architecture. Regardless of name, it’s a place where ALL data can be brought together where it can then be flexibly aggregated and transformed into useful formats that help fuel new insights for the business. Structure and schema is applied when needed, NOT as a prerequisite before landing the data. [CLICK] The next step is about getting the data into the right format for the people and applications that need it. Some folks will earmark subsets of the Data Lake for data scientists, researchers, or particular departments to interact with. Tools like Hive and HBase are commonly used for interacting with Hadoop data directly.Others will directly integrate Enterprise Hadoop with Business Intelligence & Analytics solutions so they can obtain a 360 ̊ view of their customers and enhance their ability to more accurately understand customer Interactions that lead to or inhibit their Transactions.Still others will perform complex analytic models and calculations of key parameters in Hadoop and flow the results into online applications with the goal of more accurately targeting customers with the best and most relevant offers, for example.[CLICK] And to achieve a closed loop analytics system, companies are leveraging Hadoop to cost-effectively retain large volumes of data for long periods of time. Keeping an active archive of the past 10 years of historical retail data enables companies to blend that data with 10 years of weather data so they can analyze the impact of weather on “Black Friday” selling season, for example.The result? Customers now have an agile data architecture that enables them to maximize the value from ALL of their data: transactions + interactions + observations.
So as mainstream enterprises begin to store ALL of their data in one place, they will increasingly want to create applications that interact with that data in a wide variety of ways. While classic batch-oriented MapReduce is powerful, it’s just one of many application types people need.[CLICK] Interactive SQL solutions running on or next to Hadoop have gotten lots of press over recent months. Online data systems that store their data in HDFS are on the rise. As is Streaming and Complex Event Processing solutions, and Graph Processing. In-Memory Data Processing is another area. Even classic HPC Message Passing Interface apps are storing data in HDFS.
The first wave of Hadoop was about HDFS and MapReduce where MapReduce had a split brain, so to speak. It was a framework for massive distributed data processing, but it also had all of the Job Management capabilities built into it.The second wave of Hadoop is upon us and a component called YARN has emerged that generalizes Hadoop’s Cluster Resource Management in a way where MapReduce is NOW just one of many frameworks or applications that can run atop YARN. Simply put, YARN is the distributed operating system for data processing applications. For those curious, YARN stands for “Yet Another Resource Negotiator”.[CLICK] As I like to say, YARN enables applications to run natively IN Hadoop versus ON HDFS or next to Hadoop. [CLICK] Why is that important? Businesses do NOT want to stovepipe clusters based on batch processing versus interactive SQL versus online data serving versus real-time streaming use cases. They're adopting a big data strategy so they can get ALL of their data in one place and access that data in a wide variety of ways. With predictable performance and quality of service. [CLICK] This second wave of Hadoop represents a major rearchitecture that has been underway for 3 or 4 years. And this slide shows just a sampling of open source projects that are or will be leveraging YARN in the not so distant future.For example, engineers at Yahoo have shared open source code that enables Twitter Storm to run on YARN. Apache Giraph is a graph processing system that is YARN enabled. Spark is an in-memory data processing system built at Berkeley that’s been recently contributed to the Apache Software Foundation. OpenMPI is an open source Message Passing Interface system for HPC that works on YARN. These are just a few examples.
1.0Architected for the Large Web Properties to; Hadoop 2.0 represents the next generation of the foundation of big data. Under development for nearly three years now, It is a more mature version of Hadoop that has been architected for broader use by more generic enterprise. The main focus for this nest generation has been the broader enterprise. They have very explicit requirements that are a little bit different than the typical web properties who first adopted hadoop. Some of the requirements required the community to rethink the approach. Plus, our experience running hadoop at yahoo provided much insight into how we could architect things to make them better.Some of the critical features are listed here. Go through them.Highlight workloads and explain how 2.0 is engineered to meet these exacting demands. There is a graphic to help illustrate. We have moved beyond just batch…
Since enterprise Hadoop lies at the heart of the next-generation data architecture, it needs to provide the services and features that make it an enterprise-viable data platformAt the center, we start with Apache Hadoop for distributed file storage and data processing (a la HDFS, MapReduce, and YARN).[CLICK] Beyond that core, we need to address enterprise concerns such as high availability, disaster recovery, snapshots, security, etc. And the community has been hard at work in both the 1.0 and 2.0 lines of Hadoop addressing these needs. [CLICK] And on top of this, we need to provide data services that make it easy to move data in and out of the platform, process and transform the data into useful formats, and enable people and other systems to access the data easily. This is where components like Apache Hive for SQL access, HCatalog for describing and managing your tables within Hadoop, Pig for script-based data processing, HBase for online data serving, Sqoop and Flume for getting data into Hadoop, etc.[CLICK] It’s also important…I would argue equally important…to make the platform easy to operate. Components like Apache Ambari for provisioning, management and monitoring of the cluster, Oozie for job & workflow scheduling and a new framework called Apache Falcon for Data Lifecycle Management fit here.[CLICK] So all of that: Core and Platform Services, Data Services, and Operational Services all come together into what I think of as “Enterprise Hadoop”.[CLICK] Ensuring that Enterprise Hadoop can be flexibly deployed across operating systems and virtual environments like Linux, Windows, and VMware is important. Targeting Cloud environments like Amazon Web Services, Microsoft Azure, Rackspace OpenCloud, and OpenStack is increasingly important. As is the ability to provide enterprise Hadoop pre-configured within a Hardware appliance like Teradata’s Big Analytics Appliance helps enterprises deploy Hadoop quickly, easily and in a familiar way.
As mentioned previously, SQL for Hadoop has been a hot topic for the past 6 months or so. And rightly so. There are easily millions of people with SQL skills that would like to leverage those skills as they look to gain insight and value from data stored in Hadoop. With that as backdrop, at the beginning of the year, the Stinger Initiative was rolled out. It’s focus was to rally the Apache Hive community around the goals of making Hive 100X faster, so it can handle those interactive querying use cases, and making Hive more SQL compliant so its BI use cases are richer. Oh, and by the way, this work needs to happen in a way that PRESERVES Hive’s awesome capability of processing ginormous data sets. As part of the Stinger Initiative, a new data processing framework has emerged as a sibling to MapReduce. This project is called Apache Tez and it handles the interactive querying use cases for Hive by eliminating needless HDFS writes that have traditionally slowed down Hive. Since Tez is built on YARN, Interactive SQL querying use cases can now run natively IN Hadoop and coexist nicely with classic MapReduce processing – yielding predictable performance and SLAs for apps running in the cluster.
Everybody’s adopting Hadoop as a data processing platform because it accepts any kind of data and can process at almost any scale.But, as people adopt Hadoop and throw all this data on they start to find other challenges. For example how do you ensure data is being processed reliably? How do you know I’m not keeping data that is too old? If you process data globally, how do you deal with multi-datacenter replication?The challenge the tools that exist for Hadoop including tools like Oozie, Distcp and others operate at a very low level, so you need expert developers to build and test data processing solutions. This sort of custom development takes a lot of time and money and is error prone since you deal at such a low level.Still everybody does it this way because there aren’t real alternatives. I see a lot of people who use custom scripts to delete files when they get too old. This approach has a lot of drawbacks.Hadoop traditionally doesn’t provide native tools that solve problems like retention, anonymization, reprocessing and other needs.Falcon’s solves this by letting developers work at a much higher level of abstraction.Falcon provides native APIs for data processing, retention, replication and others that abstract away low level tools like scheduling and the mechanical details of replication.With Falcon developers do more, do it easier, and avoid common mistakes.Avoiding common mistakes is probably the most important thing.Data management on Hadoop is not easy, and Falcon was developed by engineers who worked on large scale data management at Yahoo complete with all the battle scars it brings.Falcon has a lot of the practical lessons learned baked into its APIs and ready for developers to simply use.Question: What data lifecycle management needs do you have in your environment?
Operators can firewall cluster without end user access to “gateway node”Users see one cluster end-point that aggregates capabilities for data access, metadata and job controlProvide perimeter security to make Hadoop security setup easierEnable integration enterprise and cloud identity management environmentsVerificationVerify identity tokenSAML, propagation of identityAuthenticationEstablish identity at Gateway to Authenticate with LDAP + AD
One thing I’ve learned in my last 10 years of working in the enterprise open source arena is that it’s best to think of “Community” in a broad way. In the Hadoop space, there is clearly the open source community. Without the innovative Apache open source technology, none of us would be here today.There’s also the end user community that spans the tech-savvy early adopter types as well as the more pragmatic and conservative adopters. Then the 3rd piece is the broader ecosystem that integrates with, extends, enhances, builds on, etc.
Now let’s expand the scope to include ALL of the sponsors!I love this slide because it is very BUSY!The cool thing is that we have almost 70 sponsors that provide really nice coverage across all layers of the data stack. This is a great example that the Hadoop market is maturing quite nicely!
Hadoop Wave ONE started in 2006 and did a GREAT job at Web-scale Batch-oriented data processing. A vibrant community and strong enterprise interest propelled Hadoop across the Chasm at the end of 2012.
The 2nd wave of Hadoop has started and it will continue to fuel Hadoop on its path through mainstream adoption. Everyone in this room is at the forefront of a movement that will have lasting impact across the industry. Hadoop has the opportunity to process half the world’s data. There’s still a lot of work to be done.
Where are weWhere does it go from hereWhat’s nextCommunityNew projects incubated: Falcon, Knox, and moreHadoop 2 and the YARN based architecture coming in for landing (beta vote)Certification of YARN based applications – Hortonworks just announcedEcosystemVC’s invested $1.4M in Big Data companies in 2012 and 2013 even bigger (now huge investment into tools for accessing data in Hadoop, indicating “it has arrived”)Virtually every provider that touches data in any shape has brought Hadoop inJob postingsCommercial adoptionProjects going live at scaleAmazing commercial use cases that you will hear more aboute.g Cardinal Health, Home Depot? phenomenal examples of application to healthcare