2. By 2015, Organizations that
Build a Modern Information
Management System Will
Outperform their Peers
Financially by 20 Percent.
– Gartner, Mark Beyer, “Information Management in the 21st Century”
3. New Sources
(sentiment, clickstream, geo, sensor, …)
Traditional Data ArchitectureAPPLICATIO
NS
DATASYSTEMS
TRADITIONAL REPOS
RDBMS EDW MPP
DATA
SOURCES
OLTP, POS SYSTEMS
Business
Analytics
Custom
Applications
Packaged
Applications
Pressured
TRADITIONAL REPOS
RDBMS EDW MPP
OPERATIONAL
TOOLS
MANAGE &
MONITOR
DEV & DATA
TOOLS
BUILD & TEST
Traditional Sources
(RDBMS, OLTP, OLAP)
5. New Sources
(sentiment, clickstream, geo, sensor, …)
Modern Data Architecture EnabledAPPLICATIONSDATASYSTEMS
DATA
SOURCES
OLTP, POS
SYSTEMS
Business
Analytics
Custom
Applications
Packaged
Applications
TRADITIONAL REPOS
RDBMS EDW MPP
Traditional Sources
(RDBMS, OLTP, OLAP)
MANAGE &
MONITOR
OPERATIONAL
TOOLS
BUILD & TEST
DEV & DATA
TOOLS
ENTERPRISE
HADOOP PLATFORM
6. Agile “Data Lake” Solution Architecture
Capture All Data Process & Structure
1 2 Distribute Results
3 Feedback & Retain
4
Dashboards,
Reports,
Visualization, …
Web, Mobile,
CRM, ERP,
Point of sale
Business
Transactions
& Interactions
Business
Intelligence
& Analytics
Classic Data
Integration & ETL
Logs & Text Data
Sentiment Data
Structured
DB Data
Clickstream Data
Geo & Tracking Data
Sensor & Machine Data
Enterprise
Hadoop
Platform
7. BATCH INTERACTIVE STREAMING GRAPH IN-MEMORY HPC MPIONLINE OTHER…
Key Requirement of a “Data Lake”
Store ALL DATA in one place…
…and Interact with that data in MULTIPLE WAYS
HDFS (Redundant, Reliable Storage)
8. Applications Run Natively IN Hadoop
BATCH
MapReduce
INTERACTIVE
Tez
STREAMING
Storm
GRAPH
Giraph
IN-MEMORY
Spark
HPC MPI
OpenMPI
ONLINE
HBase
OTHER…
ex. Search
YARN Takes Hadoop Beyond Batch
Applications run “IN” Hadoop versus “ON” Hadoop…
…with Predictable Performance and Quality of Service
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
9. Ex. SQL-IN-Hadoop with Apache Hive
Stinger Initiative
Focus Areas
Make Hive 100X Faster
Make Hive SQL Compliant HDFS2
YARN
HIVE
SQL
MAP
REDUCE
Business
Analytics
Custom
Apps
TEZ
10. Making Hadoop Enterprise Ready
OS/VM Cloud Appliance
Enterprise Hadoop Platform
PLATFORM
SERVICES
Enterprise Readiness
High Availability, Disaster
Recovery,Security and
Snapshots
OPERATIONAL
SERVICES
Manage & Operate
at Scale
DATA
SERVICES
Store, Process
and Access Data
CORE
Distributed
Storage & Processing
12. Managing and Processing Data at Scale
and Across Datacenters
UA2
Ad Servers
Click Servers
Beacon Servers
Fraud Service
Global RTFB
LHR1
Ad Servers
Click Servers
Beacon Servers
UJ1
Ad Servers
Click Servers
Beacon Servers
Billing Service
Download Servers
HKG1
Ad Servers
Click Servers
Beacon Servers
UA2-Ruby
RAW Logs
UA2-Global
RAW Logs
LHR1-Emerald
RAW Logs
UJ1-Topaz
RAW Logs
HKG1-Opal
Summaries
15. Ecosystem Completes the Puzzle
Data Systems
Applications, Business Tools, & Dev Tools
Infrastructure & Systems Management
16. Thank You to Our Sponsors
Data Systems
Applications, Business Tools, & Dev Tools
Infrastructure & Systems Management
17. Hadoop Wave ONE: Web-scale Batch Apps
time
relative%
customers
Customers want
solutions & convenience
Customers want
technology & performance
Source: Geoffrey Moore - Crossing the Chasm
2006 to 2012
Web-Scale
Batch Applications
Innovators,
technology
enthusiasts
Early
adopters,
visionaries
Early
majority,
pragmatists
Late
majority,
conservatives
Laggards,
Skeptics
TheCHASM
18. Customers want
solutions & convenience
Customers want
technology & performance
Hadoop Wave TWO: Broad Enterprise Apps
time
relative%
customers
Source: Geoffrey Moore - Crossing the Chasm
Innovators,
technology
enthusiasts
Early
adopters,
visionaries
Early
majority,
pragmatists
Late
majority,
conservatives
Laggards,
Skeptics
TheCHASM
2013 & Beyond
Batch, Interactive, Online,
Streaming, etc., etc.
Editor's Notes
Thank you all for attending Hadoop Summit! For those who have attended previoiusHadoop Summits: Welcome back!For those new to Hadoop Summit: Welcome to the Hadoop herd!I’d like to spend the next 30 minutes focused on Hadoop’s opportunity to power modern enterprise data architectures. I’ve seen a lot of open source technologies and waves of IT change during my days at JBoss, Red Hat, SpringSource and VMware, but I’ve not seen anything quite like this Hadoop wave.We’re clearly at the forefront of a movement of something BIG, so savor the moment!Title: Hadoop Powers Modern Enterprise Data ArchitecturesBig data is everywhere and in many formats. We see it on commercials. We hear it in conversations over coffee. It is an expanding topic in the boardroom. At the center of the big data discussion is Apache Hadoop which has evolved from a tool for web-scale early adopters to an enterprise data platform that addresses the needs of mainstream businesses. In this talk Shaun Connolly, VP Corporate Strategy for Hortonworks, will discuss how Hadoop has given rise to a next-generation enterprise data architecture that is uniquely capable of storing, refining, and deriving new business insights from ALL types of data in a way that compliments existing enterprise systems and tools.Connolly will walk through how enterprises are utilizing Hadoop to refine and explore multi-structured information and enrich their applications with new insights. He will look at real-world use cases where Hadoop has helped produce more business value, augment productivity or identify new and potentially lucrative opportunities. Over the coming years, Hadoop could be in a position to process more than half the world's data. While there is much work to be done to achieve this lofty goal, Connolly will highlight how the community and broader solution ecosystem have made great strides towards solidifying Hadoop's place within the enterprise.
Gartner talks about how the IT landscape is being changed by the Nexus of Forces: namely Mobile, Social, Cloud, and Information (aka Big Data). Hadoop is clearly an Information Management technology, but if you think about it, Hadoop has its massive legs in Mobil, Social, and Cloud. It’s certainly a unique technology!To frame up my talk, I chose this quote from Mark Beyer of Gartner:“By 2015, organizations that build a modern information management system will outperform their peers financially by 20 percent.”Whether it’s opening up new business opportunities or outperforming your competitors by 20% or more, the important point to be made is that big data technologies offer very real and compelling BUSINESS and FINANCIAL value to go along with the innovative TECHNOLOGY that is able to do things never before possible.What I ALSO like about this quote is that it’s NOT a new quote. It was made about 1.5 years ago in late 2011!
Let’s set some context before digging into the Modern Data Architecture.While overly simplistic, this graphic represents the traditional data architecture:- A set of data sources producing data- A set of data systems to capture and store that data: most typically a mix of RDBMS and data warehouses- A set of custom and packaged applications as well as business analytics that leverage the data stored in those data systems. Your environment is undoubtedly more complicated, but conceptually it is likely similar. This architecture is tuned to handle TRANSACTIONS and data that fits into a relational database.[CLICK] Fast-forward to recent years and this traditional architecture has become PRESSURED with New Sources of data that aren’t handled well by existing data systems. So in the world of Big Data, we’ve got classic TRANSACTIONS and New Sources of data that come from what I refer to as INTERACTIONS and OBSERVATIONS.INTERACTIONS come from such things as Web Logs, User Click Streams, Social Interactions & Feeds, and User-Generated Content including video, audio, and images.OBSERVATIONS tend to come from the “Internet of Things”. Sensors for heat, motion, and pressure and RFID and GPS chips within such things as mobile devices, ATM machines, automobiles, and even farm tractors are just some of the “things” that output Observation data.
So let’s consider those NEW SOURCES of data and get a sense of the scope involved by considering some stats from IDC.[CLICK] According to IDC, 2.8ZB of data were created and replicated in 2012.A Zettabyte for those unfamiliar with the term is 1 BILLION Terabytes.[CLICK] 85% of that is from New Sources of Data.[CLICK] Out of that 85%, machine-generated data is a key driver in the growth and just that one new source of data is expected to grow by 15X by 2020.[CLICK] Fast-forward to 2020 and we’ll have 40 Zettabytes of data in the digital universe! This represents 50-fold growth from the beginning of 2010.[CLICK] Needless to say, wrestling that scale of data is like this poor guy trying to wrestle a champion Sumo athlete. Overwhelmed and outmatched to say the least. I’ve been using this graphic for the past 10 years or so. Given the world of big data we live in, I just had to trot this picture out once more. It just says it all, doesn’t it?
As the volume of data has exploded, we’ve seen organizations acknowledge that not all data belongs in a traditional data system. The drivers are both cost and technology. As volumes grow, database licensing costs as well as the corresponding hardware costs can become prohibitive. And traditional databases are not ideal for handling very large datasets of varying data types. People want to store data quickly in its RAW format and apply structure and a schema later…after its been processed a bit more.Enter Enterprise Hadoop as a peer to traditional data systems. The momentum for Hadoop is NOT about replacing traditional databases. Rather it’s about adding it in to handle this big data problem and doing so in a way that integrates easily with existing data systems, tools and approaches.This means it must interoperate with:- Existing applications and BI tools- Existing databases and data warehouses for loading data to / from the data warehouse- Development tools used for building custom applications- Operational tools for managing and monitoringMainstream enterprises want to get the benefits of new technologies in ways that leverage existing skills and integrate with existing systems.
In order to illustrate how Hadoop fits within the broader enterprise data architecture, I prefer to use a data flow diagram rather than the classic stack diagram we just covered.We are seeing may customers that want to deploy what we’ve been referring to as a “Data Lake” Solution Architecture that puts them in a position to maximize the value from ALL of their data: transactions + interactions + observations.At the highest level, we have three major areas of data processing, the first two of which are familiar to most enterprises:1. Business Transactions & Interactions2. Business Intelligence & AnalyticsEnterprise IT has been connecting systems via classic Data Integration and ETL processing, as illustrated in Step 1 above, for many years in order to deliver STRUCTURED and REPEATABLE analysis. In this step, the business determines the questions to ask and IT collects and structures the data needed to answer those questions.[CLICK] As we’ve discussed, New Data Sources representing Interactions and Observations have come onto the scene. And Enterprise Hadoop has appeared as a new system capable of capturing ALL of this multi-structured data into one place. Hadoop acts as a “Data Lake” if you will. Some call it a Data Reservoir, a Catch Basin, a Data Refinery, the foundation for a Data Hub & Spoke architecture. Regardless of name, it’s a place where ALL data can be brought together where it can then be flexibly aggregated and transformed into useful formats that help fuel new insights for the business. Structure and schema is applied when needed, NOT as a prerequisite before landing the data. [CLICK] The next step is about getting the data in the right format to those who need it. Some folks will cordon off ponds of data, to keep with our metaphor, for data scientists, researchers, or particular departments to interact with specific data of interest. Tools like Hive and HBase are commonly used for interacting with Hadoop data directly.Mainstream enterprises also benefit from integrating Enterprise Hadoop with their systems powering Business Transactions & Interactions and Business Intelligence & Analytics in order to open up the ability for them to get a richer and more informed 360 ̊ view of customers, for example. By directly integrating Enterprise Hadoop with Business Intelligence & Analytics solutions, companies can enhance their ability to more accurately understand the customer behaviors (aka Interactions) that lead to or inhibit their Transactions.Moreover, systems focused on Business Transactions & Interactions can benefit. Complex analytic models and calculations of key parameters can be performed in Hadoop and flow downstream to fuel online data systems powering business applications with the goal of more accurately targeting customers with the best and most relevant offers, for example.[CLICK] Since Hadoop is great at cost-effectively retaining large volumes of data for long periods of time, feedback loops enable a valuable closed-loop analytics system. Retaining the past 10 years of historical “Black Friday” retail data, for example, can benefit the business, especially if it’s blended with other data sources such as 10 years of weather data accessed from a third party data provider. The point here is that the opportunities for creating value from multi-structured data sources available inside and outside the enterprise are virtually endless if you have a platform that can do it cost effectively and at scale.A couple of final points before I move on:1. Capturing all data in Hadoop does not mean that your existing transaction and analytics applications need to be forklifted to run on top of Hadoop. The point here is that you can ALSO store data in Hadoop that’s in those systems. Yes, the data gets stored twice, but the flexibility and agility in doing so far exceeds the incremental expense….especially given the commodity nature of hardware that Hadoop uses.2. And one final point on the Data Lake. The goal isn’t to fill up Lake Superior right away. Most companies start with a small lake of data needed for targeted applications and over time, direct more and more streams of data into the lake. Let success beget more success.
So as mainstream enterprises begin to store ALL of their data in one place, there’s a clear and growing desire to not only work with that data using classic, batch-oriented MapReduce, but a much wider range of interaction patterns.[CLICK] Interactive SQL solutions running on or next to Hadoop have gotten lots of press over recent months. Online data systems that store their data in HDFS are on the rise. As is Streaming and Complex Event Processing solutions, and Graph Processing. In-Memory Data Processing is another area. Even classic HPC Message Passing Interface apps are storing data in HDFS.The point here is that as enterprises store all data in one place, they increasingly need to interact with that data in a wide variety of ways.
We are facing an exciting generational change in the Hadoop space.The first wave of Hadoop was about HDFS and MapReduce where MapReduce had a split brain, so to speak. It was a framework for massive distributed data processing, but it also had all of the Job Management and Task Tracking capabilities built into it.The second wave of Hadoop is upon us and a component called YARN has emerged that generalizes all of that Cluster Resource Management in a way where MapReduce is NOW just one of many frameworks or applications that can run atop YARN. Simply put, YARN is the distributed operating system for data processing applications. For those curious, YARN stands for “Yet Another Resource Negotiator”. [CLICK] YARN enables applications to run natively IN Hadoop versus ON HDFS or next to Hadoop. [CLICK] Why is that important? Because businesses want the ability to run more applications on their Hadoop data, and do so with predictable performance and quality of service. Mixed workload management enables customers to protect against one application or user hogging cluster resources and starving the other applications running in the Hadoop cluster. [CLICK] Businesses do NOT want to stovepipe clusters based on batch processing versus interactive SQL versus online data serving versus real-time streaming use cases. They're adopting a big data strategy so they can get ALL of their data in one place and access that data in a wide variety of ways. This second wave of Hadoop represents a major rearchitecture that has been underway for 3 or 4 years. And this slide shows just a sampling of other open source projects that are or will be leveraging YARN in the not so distant future. Apache Tez is a new framework that I’ll cover in a bit. Folks at Yahoo have shared open source code that enables Twitter Storm to run on YARN. Apache Giraph is a graph processing system that is YARN enabled. Spark is an in-memory data processing system built at Berkeley that’s been recently contributed as an Apache Software Foundation project. OpenMPI is an open source Message Passing Interface system for HPC that works on YARN. These are just a few examples.
As I just mentioned, the topic of SQL for Hadoop has been a hot topic for the past 6 months or so. And rightly so. There are easily millions of people with SQL skills thatwould like to leverage those skills as they look to gain insight and value from data stored in Hadoop. With that as backdrop, at the beginning of the year, the Stinger Initiative was rolled out. It’s focus was to rally the Apache Hive community around the goals of making Hive 100X faster, so it can handle those interactive querying use cases, and making Hive more SQL compliant so its BI use cases are richer. Oh, and by the way, this work needs to happen in a way that PRESERVES Hive’s awesome capability of processing ginormous data sets. Well, Eric14 will cover the details of where the Stinger effort stands; it’s made awesome progress.What I wanted to highlight here was that as part of the Stinger Initiative effort, a new data processing framework has appeared to help handle the interactive querying use cases for Hive. This project is called Apache Tez and it helps eliminate needless HDFS writes that have traditionally slowed down Hive. Instead of a complex DAG of MapReduce steps, Tez helps create a Map-Reduce-Reduce paradigm that is much faster. The net-out of this is that Interactive SQL querying use cases can now run natively IN Hadoop since Tex is built on YARN. This helps ensure that Interactive Queries and classic MapReduce processing can coexist nicely within the same cluster with predictable performance and SLAs.
So enterprise Hadoop lies at the heart of the next-generation data architecture.Let’s outline what’s required in and around Hadoop in order to make it easy to use and consume by the enterprise.At the center, we start with Apache Hadoop for distributed file storage and data processing (a la HDFS, MapReduce, and YARN).[CLICK] In order to enable Hadoop within mainstream enterprises, we need to address enterprise concerns such as high availability, disaster recovery, snapshots, security, etc. And the community has been hard at work in both the 1.0 and 2.0 lines of Hadoop addressing these needs. There are also new incubator projects such as Apache Knox, that Eric will cover later, for improving user access to Hadoop clusters.[CLICK] And on top of this, we need to provide data services that make it easy to move data in and out of the platform, process and transform the data into useful formats, and enable people and other systems to access the data easily. This is where components like Apache Hive for SQL access, HCatalog for describing and managing your tables within Hadoop, Pig for script-based data processing, HBase for online data serving, Sqoop and Flume for getting data into Hadoop, etc.[CLICK] It’s also important…I would argue equally important…to make the platform easy to operate. Components like Apache Ambari for provisioning, management and monitoring of the cluster, Oozie for job & workflow scheduling and a new framework called Apache Falcon for Data Lifecycle Management fit here.[CLICK] So all of that: Core and Platform Services, Data Services, and Operational Services all come together into what I think of as “Enterprise Hadoop”.[CLICK] Ensuring that Enterprise Hadoop can be flexibly deployed across operating systems and virtual environments like Linux, Windows, and VMware is important. Targeting Cloud environments like Amazon Web Services, Microsoft Azure, Rackspace OpenCloud, and OpenStack is increasingly important. As is the ability to provide enterprise Hadoop pre-configured within a Hardware appliance like Teradata’s Big Analytics Appliance helps enterprises deploy Hadoop quickly, easily and in a familiar way.
With that as backdrop, I’d like to talk about the need for better Data Lifecycle Management capabilities in Hadoop clusters. And to do so, I’d like to welcome MohitSaxena, the VP and Technology Founder of InMobi to the stage. For those unfamiliar with InMobi, they are a company focused on mobile advertising and have been recently voted one of 50 disruptive companies by MIT Technology Review. InMobi has been using Hadoop for many years and their technologists have been very active code contributors in the Apache Hadoop community. I’ve asked Mohit to join us today to share a little bit about how and why InMobi uses Hadoop and share some thoughts on how his team handles the challenge of managing data at scale and across datacenters.[SHAUN shakes Mohit’s hand and CLICKS to next slide]
[SHAUN] Mohit, we’ve got a high level diagram of your data processing architecture. Why don’t you set some context for InMobi by sharing some of the impressive business metrics and Hadoop cluster metrics behind this picture:[MOHIT]~1.5 Trillion ads requested per year20 Billion messages streamed per year 2 Billion monetization events6 Clusters ranging from 40 to 250 nodes each20 Million Hadoop jobs submitted by users2 Billion MapReduce slots used in Hadoop[SHAUN]Pretty impressive solution architecture! One of the common questions I get from enterprise customers is how to deal with Data Lifecycle Management in Hadoop environments. You and your team addressed those needs by creating a framework that you ultimately contributed to the Apache Software Foundation as Apache Falcon.[TRANSITION TO NEXT SLIDE]
[SHAUN]Please share the story behind Falcon for the audience.[MOHIT]Discuss what problems you were looking to address with the technology that ultimately became Falcon: specifically how to handle such things as orchestrating data ingest and data processing pipelines, disaster recovery and data retention scenarios, etc.Also share why you decided to contribute the project to Apache. [SHAUN] Everybody, please join me in thanking Mohit for joining us today and sharing his story. It’s amazing to see how companies like InMobi can help accelerate the process of making Hadoop a more enterprise viable data platform.
I’ve been in enterprise open source for almost a decade. One thing I’ve learned along the way is that it’s best to think of “Community” in a broad way. In the Hadoop space, there is clearly the open source community. Without the innovative Apache open source technology, none of us would be here today.For really impactful and industry-changing open source technologies, there’s also the end user community. This community spans the tech-savvy early adopter types as well as the more pragmatic and conservative adopter types who want a more “whole solution”. Then the 3rd piece is the broader ecosystem that integrates with, extends, enhances, builds on, etc.One of the reasons I asked Mohit from InMobi to come on stage and share his story is that InMobi is a great example of an End User who is VERY ACTIVE in the open source Community.This room is filled with people across these 3 areas and each of these perspectives is CRITICALLY IMPORTANT if Hadoop is to be all it can be. So my simple ask of you is:GET INVOLVED…in whatever way makes sense for you and your business.
The ecosystem plays a critical role in rounding out solution architectures around Apache Hadoop. This slide outlines 3 major layers of the data stack and conveniently lists the Hadoop Summit platinum sponsors. Starting from the bottom, we have Infrastructure and Systems Management. Above that we have Data Management Systems, Data Movement, and Integration solutions. Then at the top, we have Development Tools, Business Tools, and Applications that ride on top. I’d like to thank:Cisco, Microsoft, Kognitio, IBM, Teradata, Datameer, Karmasphere, Platfora, SAS, and Splunk for being platinum sponsors!I also want to thank Yahoo for co-hosting this event with Hortonworks!
Now let’s expand the scope to include ALL of the sponsors!I love this slide because it is very BUSY!The cool thing is that we have almost 70 sponsors that provide really nice coverage across all layers of the data stack. This is a great example that the Hadoop market is maturing quite nicely!
So I’d like to end my session with a quick summary of where the Hadoop market stands today.Hadoop Wave ONE started in 2006 and did a GREAT job at Web-scale Batch-oriented data processing. A vibrant community and strong enterprise interest propelled Hadoop across the Chasm at the end of 2012.
The 2nd wave of Hadoop has started and it will continue to fuel Hadoop on its path through mainstream adoption. Everyone in this room is at the forefront of a movement that will have lasting impact across the industry. As Rob mentioned in his opening remarks, Hadoop has the opportunity to process half the world’s data. There’s still a lot of work to be done.My simple ask of you is: GET INVOLVED…in whatever way makes sense for you and your business.Thank you and have a great conference!