With the rise of Apache Hadoop, a next-generation enterprise data architecture is emerging that connects the systems powering business transactions and business intelligence. Hadoop is uniquely capable of storing, aggregating, and refining multi-structured data sources into formats that fuel new business insights. Organizations that embrace solution architectures focused on maximizing the value from ALL data will put themselves in a position to drive more business, enhance productivity, or discover new and lucrative business opportunities. Over the coming years, Hadoop could be in a position to process more than half the world’s data. There is still much work to be done, however, if Hadoop is to achieve this lofty goal. In this talk Shaun Connolly, VP Corporate Strategy for Hortonworks, will look at Hadoop’s role in the enterprise architecture and how it compliments existing enterprise systems.
4. What is Big Data?
Transactions + Interactions
Petabytes
BIG DATA Mobile Web + Observations
Sentiment SMS/MMS
User Click Stream
= BIG DATA
Speech to Text
Social Interactions & Feeds
Terabytes WEB Web logs
Spatial & GPS Coordinates
A/B testing
Sensors / RFID / Devices
Behavioral Targeting
Gigabytes CRM Business Data Feeds
Dynamic Pricing
Segmentation External Demographics
Search Marketing
Customer Touches User Generated Content
ERP
Megabytes Affiliate Networks
Purchase detail Support Contacts HD Video, Audio, Images
Dynamic Funnels
Purchase record
Offer details Offer history Product/Service Logs
Payment record
Increasing Data Variety and Complexity
5. Big Data Market Drivers
Business
1 Enable new business models & drive faster growth (20%+)
2 Find insights for competitive advantage & optimal returns
Technical
3 Data continues to grow exponentially
4 Data is increasingly everywhere and in many formats
5 Traditional solutions not designed for new requirements
Financial
6 Cost of data systems, as % of IT spend, continues to grow
7 Cost advantages of commodity hardware & open source
7. Next-Generation Data Architecture
Unstructured Business CRM, ERP
Data
Transactions Web, Mobile
& Interactions Point of sale
Log files Enterprise
Hadoop
Exhaust Data
Platform
Classic Data
Integration & ETL
Social Media
Sensors,
devices Business Dashboards,
Intelligence Reports,
& Analytics Visualization, …
DB data
1 Capture Big Data 2 Process & Structure 3 Distribute Results 4 Feedback & Retain
8. Making Hadoop Enterprise Ready
OPERATIONAL DATA
SERVICES SERVICES
Manage & Store,
Operate at Process and
Scale Access Data
Distributed
HADOOP CORE Storage & Processing
Enterprise Readiness: HA,
PLATFORM SERVICES DR, Snapshots, Security, …
ENTERPRISE HADOOP PLATFORM
OS / VM Cloud Appliance
9. Existing Data Architecture
APPLICATIONS
Business Custom Enterprise
Analytics Applications Applications
DEV & DATA
TOOLS
BUILD &
TEST
DATA SYSTEMS
OPERATIONAL
TOOLS
MANAGE &
RDBMS EDW MP MONITOR
TRADITIONAL REPOS P
DATA SOURCES
Traditional Sources
OLTP,(RDBMS, OLTP, OLAP)
POS
SYSTEMS
10. An Emerging Data Architecture
APPLICATIONS
Business Custom Enterprise
Analytics Applications Applications
DEV & DATA
TOOLS
BUILD &
TEST
DATA SYSTEMS
OPERATIONAL
TOOLS
ENTERPRISE
MANAGE &
HADOOP PLATFORM MONITOR
RDBMS EDW MP
TRADITIONAL REPOS P
DATA SOURCES
Traditional Sources New Sources
OLTP,(RDBMS, OLTP, OLAP) (web logs, email, sensors, social media)
MOBILE
POS DATA
SYSTEMS
12. Interoperating With Your Tools
APPLICATIONS
Microsoft Applications
DEV & DATA
TOOLS
DATA SYSTEMS
OPERATIONAL
TOOLS
ENTERPRISE
HADOOP PLATFORM
TRADITIONAL REPOS Viewpoint
DATA SOURCES
Traditional Sources New Sources
OLTP,(RDBMS, OLTP, OLAP) (web logs, email, sensors, social media)
MOBILE
POS DATA
SYSTEMS
14. Hadoop Common Patterns of Use
Business Cases
“Right-time” Access to Data
Batch Interactive Online
Refine Explore Enrich
ENTERPRISE
HADOOP PLATFORM
Big Data
Transactions, Interactions, Observations
15. Operational Data Refinery
Enric
Refine Explore
h
APPLICATIONS
Business Custom Enterprise Transform & refine ALL
Analytics Applications Applications sources of data
Also known as Data
Reservoir or Catch Basin
3
DATA SYSTEMS
ENTERPRISE
HADOOP
2 1 Capture
RDBMS EDW MPP PLATFORM
TRADITIONAL REPOS
2 Process
1
DATA SOURCES
Traditional Sources New Sources 3 Distribute & Retain
(RDBMS, OLTP, OLAP) (web logs, email, sensor data, social media)
16. Big Data Exploration & Visualization
Refine Explore Enrich
APPLICATIONS
Business Custom Enterprise Leverage “data lake”
Analytics Applications Applications to perform iterative
investigation for value
3
DATA SYSTEMS
ENTERPRISE
HADOOP
2 1 Capture
RDBMS EDW MPP PLATFORM
TRADITIONAL REPOS
2 Process
1
DATA SOURCES
Traditional Sources New Sources 3 Explore & Visualize
(RDBMS, OLTP, OLAP) (web logs, email, sensor data, social media)
17. Application Enrichment
Refine Explore Enrich
APPLICATIONS
Custom Enterprise Create intelligent
Applications Applications applications
3
Collect data, create
analytical models and
deliver to online apps
DATA SYSTEMS
ENTERPRISE
HADOOP
2 1 Capture
RDBMS EDW MPP NOSQL PLATFORM
TRADITIONAL REPOS
2 Process & Compute
1
DATA SOURCES
Traditional Sources New Sources 3 Deliver Model
(RDBMS, OLTP, OLAP) (web logs, email, sensor data, social media)
18. Big Data: Optimize Outcomes at Scale
Media o p ti m i z e Content
Intelligence o p ti m i z e Detection
Finance o p ti m i z e Algorithms
Advertising o p ti m i z e Performance
Fraud o p ti m i z e Prevention
Retail / Wholesale o p ti m i z e Inventory turns
Manufacturing o p ti m i z e Supply chains
Healthcare o p ti m i z e Patient outcomes
Education o p ti m i z e Learning outcomes
Government o p ti m i z e Citizen services
Source: Geoffrey Moore. Hadoop Summit 2012 keynote presentation.
19. Market Transitioning into Early Majority
relative %
customers
The CHASM
Innovators, Early Early
Late majority, Laggards,
technology adopters, majority,
conservatives Skeptics
enthusiasts visionaries pragmatists
time
Customers want Customers want
technology & performance solutions & convenience
Source: Geoffrey Moore - Crossing the Chasm
20. At Hortonworks, we believe that by the end
of 2015, more than half the world's data
will be processed by Apache Hadoop.
Welcome to Hadoop Summit and
Enjoy the Conference!
Notes de l'éditeur
Title: Hadoop's Role in the Enterprise ArchitectureWith the rise of Apache Hadoop, a next-generation enterprise data architecture is emerging that connects the systems powering business transactions and business intelligence. Hadoop is uniquely capable of storing, aggregating, and refining multi-structured data sources into formats that fuel new business insights. Organizations that embrace solution architectures focused on maximizing the value from ALL data will put themselves in a position to drive more business, enhance productivity, or discover new and lucrative business opportunities. Over the coming years, Hadoop could be in a position to process more than half the world's data. There is still much work to be done, however, if Hadoop is to achieve this lofty goal. In this talk Shaun Connolly, VP Corporate Strategy for Hortonworks, will look at Hadoop's role in the enterprise architecture and how it compliments existing enterprise systems.
Thank you all for attending Hadoop Summit! I’d like to spend the next 30 minutes focused on Hadoop’s opportunity to power next-generation data architectures. I’ve been involved in open source for many years, having worked at JBoss back in 2004, then at Red Hat through 2008. After that I joined SpringSource and ultimately VMware through 2011. So I’ve seen a lot of open source technologies and waves of excitement and passionate users. But I’ve not seen anything quite like this Big Bata and Hadoop phenomenon.
So our backdrop is BIG DATA.GARTNER REPORT ON 12 October 2012: http://www.gartner.com/id=2195915Big Data Drives Rapid Changes in Infrastructure and $232 Billion in IT Spending Through 2016Big data has become a major driver of IT spending. The benefits to organizations for adding big data to their information management and analytics infrastructure will force a more rapid cycle of replacing existing solutions.IDC study:http://cdn.idc.com/research/Predictions12/Main/downloads/IDCTOP10Predictions2012.pdfIDC projects that the digital universe will reach 40 zettabytes (ZB) by 2020, resulting in a 50-fold growth from the beginning of 2010According to the study, 2.8ZB of data will have been created and replicated in 2012.Machine-generated data is a key driver in the growth of the world’s data – which is projected to increase 15x by 2020.So the topic of big data is increasingly important….but like any presentation these days about Big Data, we’ve got to start off with a definition, right?I kinda like to describe Big Data using a simple equation.As I see it, Big Data = Transactions + Interactions + ObservationsMeaning, it not only spans your current highly structured transactional data sources, it includes new forms of data that represent interactions (i.e. website interactions, social interactions, etc.) and observations (i.e data coming off of sensors, devices, etc.)So, for all the burgeoning data scientists in the audience…there’s your equation!
For the visual thinkers out there, let’s expand our mathematical model to show some concrete examples.ERP, SCM, CRM, and transactional Web applications are classic examples of systems processing Transactions. Highly structured data in these systems is typically stored in SQL databases.Interactions are about how people and things interact with each other or with your business. Web Logs, User Click Streams, Social Interactions & Feeds, and User-Generated Content are classic places to find Interaction data.Observational data tends to come from the “Internet of Things”. Sensors for heat, motion, pressure and RFID and GPS chips within such things as mobile devices, ATM machines, and even aircraft engines provide just some examples of “things” that output Observation data.Most folks would agree that video is “big” data. The analysis of what’s happening in that video (ie. What you, me, and others are doing in the video) may not be “big” but it is valuable and it does fit under our umbrella.Moreover, business data feeds and publicly available data sets are also “big data”.So we should not minimize our thinking to just data that flows through an organization.Ex. The mortgage-related data you may have COULD benefit from being blended with external data found in Zillow, for example.The government, for example, has the Open Data Initiative. Which means that more and more data is being made publicly available.One of the use cases I find interesting is the Predictive Policing use case where state/local law enforcement is using analytics appied to crime databases and other publicly available data to help predict where and when pockets of crime might be springing up. These proactive analytics efforts have yielded real reductions in crime!Anyhow, this is what Big Data means to me…hopefully it makes sense to you.
The market drivers for big data span Business, Technical, and Financial.From a business perspective, the promise of big data is to find insights for competitive advantage, enable new business models, or optimize existing models. From a technical perspective, as we discussed, volumes of data continue to grow and data is very multi-striuctured in nature which poses a challenge for traditional systems that have inherently assumed relational row/column structure.And from a financial perspective, while the cost of data systems continues to grow, the rise of commodity hardware and open source platforms like Hadoop are enabling an economic model that makes it possible to gather large volumes in one place to be processed in a way that does not break the bank.So, we’ve covered an overview of big data and the market drivers behind why it’s important. Your CIO, like many these days, believes it’s a top 3 initiative and has tasked you with coming up with a strategy.
So how many feel like this poor guy getting started with his big data strategy?Well, let’s start off with a look at a next-generation data architecture that leverages new platforms like Hadoop in a way that integrates with your existing systems.
So I’d like to talk about how Hadoop can fit within broader enterprise data architecture with the goal of maximizing the value from ALL of your data: transactions + interactions + observations. At the highest level, I see three broad areas of data processing: Business Transactions & Interactions Business Intelligence & Analytics Big Data RefineryEnterprise IT has been connecting systems via classic ETL processing, as illustrated in Step 1 above, for many years in order to deliver structured and repeatable analysis. In this step, the business determines the questions to ask and IT collects and structures the data needed to answer those questions. The “Big Data Refinery”, as highlighted in Step 2, is a new system capable of storing, aggregating, and transforming a wide range of multi-structured raw data sources into usable formats that help fuel new insights for the business. The Big Data Refinery provides a cost-effective platform for unlocking the potential value within data and discovering the business questions worth answering with this data. A popular example of big data refining is processing Web logs, clickstreams, social interactions, social feeds, and other user generated data sources into more accurate assessments of customer churn or more effective creation of personalized offers. More interestingly, there are businesses deriving value from processing large video, audio, and image files. Retail stores, for example, are leveraging in-store video feeds to help them better understand how customers navigate the aisles as they find and purchase products. Retailers that provide optimized shopping paths and intelligent product placement within their stores are able to drive more revenue for the business. In this case, while the video files may be big in size, the refined output of the analysis is typically small in size but potentially big in value.With that as backdrop, Step 3 takes the model further by showing how the Big Data Refinery interacts with the systems powering Business Transactions & Interactions and Business Intelligence & Analytics. Interacting in this way opens up the ability for businesses to get a richer and more informed 360 ̊ view of customers, for example.By directly integrating the Big Data Refinery with existing Business Intelligence & Analytics solutions that contain much of the transactional information for the business, companies can enhance their ability to more accurately understand the customer behaviors that lead to the transactions.Moreover, systems focused on Business Transactions & Interactions can also benefit from connecting with the Big Data Refinery. Complex analytics and calculations of key parameters can be performed in the refinery and flow downstream to fuel runtime models powering business applications with the goal of more accurately targeting customers with the best and most relevant offers, for example.Since the Big Data Refinery is great at retaining large volumes of data for long periods of time, the model is completed with the feedback loops illustrated in Steps 4 and 5. Retaining the past 10 years of historical “Black Friday” retail data, for example, can benefit the business, especially if it’s blended with other data sources such as 10 years of weather data accessed from a third party data provider. The point here is that the opportunities for creating value from multi-structured data sources available inside and outside the enterprise are virtually endless if you have a platform that can do it cost effectively and at scale.
So enterprise Hadoop lies at the heart of the next-generation data architecture.Let’s outline what’s required in and around Hadoop in order to make it easy to use and consume by the enterprise.At the center, we start with Apache Hadoop for distributed file storage and processing (a la MapReduce).In order to enable Hadoop within mainstream enterprises, we need to address enterprise concerns such as high availability, disaster recovery, snapshots, security, etc. And on top of this, we need to provide data services that make it easy to move data in and out of the platform, process and transform the data into useful formats, and enable people and other systems to access the data easily.This is where components like Apache Hive, Pig, HBase, HCatalog, and other tools fit.Making it easy for data workers is important, but it’s also important to make the platform easier to operate.Components like Apache Ambari that address provisioning, management and monitoring of the cluster are important here.So all of that: Core and Platform Services, Data Services, and Operational Services all come together into a vision of “enterprise Hadoop”.Ensuring that Enterprise Hadoop Platform can be flexibly deployed across operating systems and virtual environments like Linux, Windows, and Vmware is important.Targeting Cloud environments like Amazon Web Services, Microsoft Azure, Rackspace OpenCloud, and OpenStack is increasingly important.As is the ability to provide enterprise Hadoop pre-configured within a Hardware appliance like Teradata’s Big Analytics Appliance helps pull Hadoo into enterprises as well.
While overly simplistic, this graphic represents what we commonly see as a general data architecture:A set of data sources producing dataA set of data systems to capture and store that data: most typically a mix of RDBMS and data warehousesA set of applications that leverage the data stored in those data systems. These could be package BI applications (Business Objects, Tableau, etc), Enterprise Applications (e.g. SAP) or Custom Applications (e.g. custom web applications), ranging from ad-hoc reporting tools to mission-critical enterprise operations applications.Your environment is undoubtedly more complicated, but conceptually it is likely similar.
As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets).Instead, we increasingly see Hadoop – and HDP in particular – being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with:Existing applications – such as Tableau, SAS, Business Objects, etc,Existing databases and data warehouses for loading data to / from the data warehouseDevelopment tools used for building custom applicationsOperational tools for managing and monitoring
In October 2010, I attended the Hadoop World event in New York City where there was a keynote presentation by Larry Feinsmith of JP Morgan Chase. Larry provided great insight into how JP Morgan Chase was using Hadoop. Great creative use cases! But the point that stuck with me long after the event was the importance of figuring out how Hadoop can and should be integrated with existing IT investments. While Larry said he loves the innovation happening on the open source community, he also said that enterprises like JP Morgan Chase will not throw away all of their existing investments!They want ways that enable them to get the benefits of new technologies in ways that leverage existing skills and integrate with existing systems.
It is for that reason that we focus on HDP interoperability across all of these categories:Data systemsHDP is endorsed and embedded with SQL Server, Teradata and moreBI tools: HDP is certified for use with the packaged applications you already use: from Microsoft, to Tableau, Microstrategy, Business Objects and moreWith Development tools: For .Net developers: Visual studio, used to build more than half the custom applications in the world, certifies with HDP to enable microsoft app developers to build custom apps with HadoopFor Java developers: Spring for Apache Hadoop enables Java developers to quickly and easily build Hadoop based applications with HDPOperational toolsIntegration with System Center, and with Teradata viewpoint
So…if I haven’t made it crystal clear for you yet, maybe this visual will get the point across.Enterprise Hadoop makes a great tag team with you existing tools to enable a next-generation data architecture that positions you to refine and explore vast quantities of multistructured data and enrich your applications and services that drive your business.
So we’ve covered the overall architecture and how Hadoop fits, let’s discuss the patterns of use that we’re seeing for using Hadoop.At a high level, we describe the 3 key patterns of use as Refine, Explore, and Enrich.Refine captures the data into the platform and transforms (or refines it) into the desired formats.Explore is about creating laks of data that you can interactively surf through to find valuable insights.Enrich is about leveraging analytics and models to influence your online applications, making them more intelligent.So while some categorize Hadoop as just a Batch platform, it is increasingly being used and evolving to serve a wide range of usage patterns that span Batch, Interactive, and Online needs.Let me cover these patterns in a little more detail.
Across all of our user base, we have identified just 3 separate usage patterns – sometimes more than one is used in concert during a complex project, but the patterns are distinct nonetheless. These are Refine, Explore and Enrich.The first of these, the Refine case, is probably the most common today. It is about taking very large quantities of data and using Hadoop to distill the information down into a more manageable data set that can then be loaded into a traditional data warehouse for usage with existing tools. This is relatively straightforward and allows an organization to harness a much larger data set for their analytics applications while leveraging their existing data warehousing and analytics tools.Using the graphic here, in step 1 data is pulled from a variety of sources, into the Hadoop platform in step 2, and then in step 3 loaded into a data warehouse for analysis by existing BI tools
A second use case is what we would refer to as Data Exploration – this is the use case in question most commonly when people talk about “Data Science”.In simplest terms, it is about using Hadoop as the primary data store rather than performing the secondary step of moving data into a data warehouse. To support this use case you’ve seen all the BI tool vendor rally to add support for Hadoop – and most commonly HDP – as a peer to the database and in so doing allow for rich analytics on extremely large datasets that would be both unwieldy and also costly in a traditional data warehouse. Hadoop allows for interaction with a much richer dataset and has spawned a whole new generation of analytics tools that rely on Hadoop (HDP) as the data store.To use the graphic, in step 1 data is pulled into HDP, it is stored and processed in Step 2, before being surfaced directly into the analytics tools for the end user in Step 3.
The final use case is called Application Enrichment.This is about incorporating data stored in HDP to enrich an existing application. This could be an on-line application in which we want to surface custom information to a user based on their particular profile. For example: if a user has been searching the web for information on home renovations, in the context of your application you may want to use that knowledge to surface a custom offer for a product that you sell related to that category. Large web companies such as Facebook and others are very sophisticated in the use of this approach.In the diagram, this is about pulling data from disparate sources into HDP in Step 1, storing and processing it in Step 2, and then interacting with it directly from your applications in Step 3, typically in a bi-directional manner (e.g. request data, return data, store response).
When all is said and done, the ultimate goal of big data processing is to optimize outcomes at scale. Geoffrey Moore, author of Crossing the Chasm, gave these good examples across various vertical industries.
And speaking of Geoffrey Moore, let me close out by covering where Hadoop is from a crossing the chasm perspective.Based on our engagement with enterprise customers, we believe Hadoop has transitioned into the early majority and is therefore being used by more mainstream enterprises.Horizontal patterns of use emerge in this stage as well as what Geoffrey Moore calls “bowling pins” or vertical solutions.The net out is that enterprise Hadoop offers exciting promise, but it is still early in it maturity cycle. You can do a lot with the technology, but there’s more to do to harden it for broader mainstream adoptions.
And with that, let me close out with the guiding vision we have at Hortonworks.