DOWNLOAD the whitepaper here: http://hortonworks.com/wp-content/plugins/download-monitor/download.php?id=71
As an organization laser focused on developing, distributing and supporting Apache Hadoop for enterprise customers, we have been fortunate to have a unique vantage point.
We’re delighted to share with you these slides and our new whitepaper ‘Apache Hadoop Patterns of Use’. The patterns discussed in the slides and whitepaper are:
Refine: Collect data and apply a known algorithm to it in a trusted operational process.
Explore: Collect data and perform iterative investigation for value.
Enrich: Collect data, analyze and present salient results for online apps.
We hope you enjoy the content.
2. Existing Data Architecture
APPLICATIONS
Business Custom Enterprise
Analytics Applications Applications
DEV & DATA
TOOLS
BUILD &
TEST
DATA SYSTEMS
OPERATIONAL
TOOLS
MANAGE &
RDBMS EDW MP MONITOR
TRADITIONAL REPOS P
DATA SOURCES
Traditional Sources
OLTP,(RDBMS, OLTP, OLAP)
POS
SYSTEMS
3. Next-Generation Data Architecture
APPLICATIONS
Business Custom Enterprise
Analytics Applications Applications
DEV & DATA
TOOLS
BUILD &
TEST
DATA SYSTEMS
OPERATIONAL
TOOLS
ENTERPRISE
MANAGE &
HADOOP PLATFORM MONITOR
RDBMS EDW MP
TRADITIONAL REPOS P
DATA SOURCES
Traditional Sources New Sources
OLTP,(RDBMS, OLTP, OLAP) (web logs, email, sensors, social media)
POS
SYSTEMS
4. Hadoop Common Patterns of Use
Business Cases
“Right-time” Access to Data
Batch Interactive Online
Refine Explore Enrich
HORTONWORKS
DATA PLATFORM
Big Data
Transactions, Interactions, Observations
5. Operational Data Refinery
Enric
Refine Explore
h
APPLICATIONS
Business Custom Enterprise Transform & refine ALL
Analytics Applications Applications sources of data
Also known as Data
Reservoir or Catch Basin
3
DATA SYSTEMS
HORTONWORKS
DATA PLATFORM 2 1 Capture
RDBMS EDW MPP
TRADITIONAL REPOS
2 Process
1
DATA SOURCES
Traditional Sources New Sources 3 Distribute & Retain
(RDBMS, OLTP, OLAP) (web logs, email, sensor data, social media)
6. Big Data Exploration & Visualization
Refine Explore Enrich
APPLICATIONS
Business Custom Enterprise Leverage “data lake”
Analytics Applications Applications to perform iterative
investigation for value
3
DATA SYSTEMS
HORTONWORKS
DATA PLATFORM 2 1 Capture
RDBMS EDW MPP
TRADITIONAL REPOS
2 Process
1
DATA SOURCES
Traditional Sources New Sources 3 Explore & Visualize
(RDBMS, OLTP, OLAP) (web logs, email, sensor data, social media)
7. Application Enrichment
Refine Explore Enrich
APPLICATIONS
Custom Enterprise Create intelligent
Applications Applications applications
3
Collect data, create
analytical models and
deliver to online apps
DATA SYSTEMS
HORTONWORKS
DATA PLATFORM 2 1 Capture
RDBMS EDW MPP NOSQL
TRADITIONAL REPOS
2 Process & Compute
1
DATA SOURCES
Traditional Sources New Sources 3 Deliver Model
(RDBMS, OLTP, OLAP) (web logs, email, sensor data, social media)
While overly simplistic, this graphic represents what we commonly see as a general data architecture:A set of data sources producing dataA set of data systems to capture and store that data: most typically a mix of RDBMS and data warehousesA set of applications that leverage the data stored in those data systems. These could be package BI applications (Business Objects, Tableau, etc), Enterprise Applications (e.g. SAP) or Custom Applications (e.g. custom web applications), ranging from ad-hoc reporting tools to mission-critical enterprise operations applications.Your environment is undoubtedly more complicated, but conceptually it is likely similar.
As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets).Instead, we increasingly see Hadoop – and HDP in particular – being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with:Existing applications – such as Tableau, SAS, Business Objects, etc,Existing databases and data warehouses for loading data to / from the data warehouseDevelopment tools used for building custom applicationsOperational tools for managing and monitoring
So we’ve covered the overall architecture and how Hadoop fits, let’s discuss the patterns of use that we’re seeing for using Hadoop.At a high level, we describe the 3 key patterns of use as Refine, Explore, and Enrich.Refine captures the data into the platform and transforms (or refines it) into the desired formats.Explore is about creating laks of data that you can interactively surf through to find valuable insights.Enrich is about leveraging analytics and models to influence your online applications, making them more intelligent.So while some categorize Hadoop as just a Batch platform, it is increasingly being used and evolving to serve a wide range of usage patterns that span Batch, Interactive, and Online needs.Let me cover these patterns in a little more detail.
Across all of our user base, we have identified just 3 separate usage patterns – sometimes more than one is used in concert during a complex project, but the patterns are distinct nonetheless. These are Refine, Explore and Enrich.The first of these, the Refine case, is probably the most common today. It is about taking very large quantities of data and using Hadoop to distill the information down into a more manageable data set that can then be loaded into a traditional data warehouse for usage with existing tools. This is relatively straightforward and allows an organization to harness a much larger data set for their analytics applications while leveraging their existing data warehousing and analytics tools.Using the graphic here, in step 1 data is pulled from a variety of sources, into the Hadoop platform in step 2, and then in step 3 loaded into a data warehouse for analysis by existing BI tools
A second use case is what we would refer to as Data Exploration – this is the use case in question most commonly when people talk about “Data Science”.In simplest terms, it is about using Hadoop as the primary data store rather than performing the secondary step of moving data into a data warehouse. To support this use case you’ve seen all the BI tool vendor rally to add support for Hadoop – and most commonly HDP – as a peer to the database and in so doing allow for rich analytics on extremely large datasets that would be both unwieldy and also costly in a traditional data warehouse. Hadoop allows for interaction with a much richer dataset and has spawned a whole new generation of analytics tools that rely on Hadoop (HDP) as the data store.To use the graphic, in step 1 data is pulled into HDP, it is stored and processed in Step 2, before being surfaced directly into the analytics tools for the end user in Step 3.
The final use case is called Application Enrichment.This is about incorporating data stored in HDP to enrich an existing application. This could be an on-line application in which we want to surface custom information to a user based on their particular profile. For example: if a user has been searching the web for information on home renovations, in the context of your application you may want to use that knowledge to surface a custom offer for a product that you sell related to that category. Large web companies such as Facebook and others are very sophisticated in the use of this approach.In the diagram, this is about pulling data from disparate sources into HDP in Step 1, storing and processing it in Step 2, and then interacting with it directly from your applications in Step 3, typically in a bi-directional manner (e.g. request data, return data, store response).