3. Outline
• What is the Hadoop Data Reservoir (HDR)?
• Requirements and Solutions
• Hadoop Data Reservoir in Practice
• Demo
• Q&A
4. What is the Hadoop Data Reservoir (HDR)?
• Central Hadoop cluster for the enterprise
• Serves as the Storage and the Source of data for
self-service business analytics
• Provides Processing for data preparation and
advanced analytics
The Hadoop Data Reservoir
eliminates data silos, reduces costs,
and makes business analytics agile.
6. HDR is Not a Replacement for the EDW
• EDWs require upfront planning
• EDWs require major ongoing IT
maintenance and staffing
• EDWs are not self-service
7. HDR Origin: Interviews with Enterprise IT
• Platfora interviewed over 200
enterprise IT professionals working
with Hadoop
• Summer 2011 through early 2012
• Topic of interview: challenges using
Hadoop for business intelligence &
analytics
8. What is Your Vision for Hadoop?
• “I want Hadoop to be the central repository of all the data people
need.”
• “We shouldn’t have to plan too much before we store data.”
• “Cost should only be a minor factor in how long we kept data around.”
• “I want to give everyone access to the data and break down the existing
silos. But it needs to be secure.”
• “IT would not have to be involved in day-to-day management.”
9. “I’m a bit out on a limb here. I pushed to use Hadoop to collect data that we
Out on a Limb
were dropping before. But now it’s taking way more time to make use of it
then I expected.”
Stock Photo
9
10. The Missing Link to HDR
Automatic /
Fast /
Iterative
Unbounded
FLEXIBLE
Hadoop Data
“SOFTWARE DEFINED”
Web-based
Reservoir
Business Intelligence
DATA MARTS
Performance, Self-Service, and Security
12. Queries must be consistently fast
Modern BI applications are driving more Modern Data Discovery BI
and more queries all the time.
A single HDR user should not be able to
impact other users simply because they
asked the wrong question.
Each move results in a new query.
“We’re addicted to sub-second. If it takes longer
than that for any reason, something is wrong.”
13. Most Queries are Straightforward, but Big
“What’s the trend of female visitors clicking on ads on the
Traffic
autos channel over time?”
Logs
Advertising
???
Logs
Clicks
User
Demographics
Big Hadoop cluster
Months
2.4 PB total
700M records/day
Processing the answer
400 GB/day
could touch 10s of billions
2B user records
of records.
14. Solution: Aggregate Tables Stored In-Memory
• Pre-calculated summary
tables, summarizing data to a
coarser grain
• Dramatically reduces data
required to answer a question
• Keeps redundant processing
off the batch system (Hadoop)
• Keep summary data in
memory to provide sub-
second access
14
16. Finding Data in the Reservoir
Sales
Shipments
Hadoop Distributed File
System (HDFS) is organized
like other common FS: a
directory structure
Sentiment Web Logs
Info
Datasets in HDFS could be a
single file or 10,000+ files,
Customer Interactions
commonly organized by
Demographics
directory
Business users must be able to find data to
answers their questions
16
17. Aggregations Must Be Fully Automatic
• Building aggregate tables requires planning and up-
front decisions
• Must choose the metrics, dimensions, granularity
• In practice, this is an iterative process, and the first
attempt is usually wrong
• Aggregate tables must be maintained
• Each time new data arrives
• Sliding window tables (i.e. last 30 days): data in, data out
For HDR to be self-service, this must be
automatic.
18. Drilling Through the Aggregation
Netflow Example
Raw Data in Hadoop
Aggregate Tables
Milliseconds
Hours, Days
Source IP Address
# of Machines
Destination IP Address
# of Flows
“What happened between
Application
Total Flow Size (KB)
10:03-10:04am?”
Packets
Application
Bytes
100MB Compressed
26B records/month
Fast
400GB compressed
Slow
Need to “drill through the aggregation” to get more detail,
or add dimensionality. And, it needs to be self-service.
18
19. Augmenting Datasets
• Users must be able to augment data with
sources outside of the HDR
• I.e. market research or demographics
• Commonly needs to be combined at the raw
level, before data is aggregated
21. Modern Data Security Requirements
• Hadoop provides:
• File and directory based permissions (like Unix)
• Secure authentication (via Kerberos)
• However, enterprises require a finer level of data
security control
• Datasets – could be one or many files, spanning directories
• Columns – datasets likely have many columns, with
different security permissions
• Rows – can span many files, and directories
• Solution must abstract file-level security and
enforce a finer level of control
21
22. Strong and Secure; Collaborative Sharing
• In a self-service model, security must be strong
and clear
• End-users will need to understand what they can
access and what they can’t
• Security administrators must be able to enforce
security centrally, down to the raw data
• As a centralized system, HDR must integrate
with directory services for authentication and
group membership
22
24. Platfora: Interest-Driven PipelineTM
Automatic /
Fast /
Iterative
Unbounded
FLEXIBLE
Hadoop Data
“SOFTWARE DEFINED”
Web-based
Reservoir
Business Intelligence
DATA MARTS
Performance, Self-Service, and Security
25. Edmunds.com
• Beta participant since January 2013
• Moved to Hadoop because of explosive data
growth and promise of agility
• Web, mobile, visitor demographic data
• Use Case: optimize the matching of visitors with
Founded in 1966:
the cars they are looking for
”For the purpose of publishing • Correlating browsers with the cars they are actually
new and used automotive pricing
guides to assist automobile
buying
buyers”
• Platfora has made big data accessible to the
business
Online Innovators:
• Increased access from 5 to 50 users
• First auto information
website
• Decreased time to value from months to hours
• True Market Value®, True
Cost to Own®, and My Car
Match
“Before, if we wanted access to Hadoop data, we wouldn’t even try.
With Platfora our analysts can access anything they need.”
27. Introducing Platfora’s Integrated Platform
Web-based Business
Vizboard
Intelligence Application
+
Lens
Scale-out, In-Memory
Data Mart & Processing Engine
+
Dataset
Automated Hadoop
Data Refinery
Powerful Closed-loop Analysis of Big Data
28. Summary
• The Hadoop Data Reservoir vision is driven from
requirements of enterprise Hadoop users
• HDR eliminates data silos, reduces costs, and
makes business analytics agile
• To make HDR a reality, it needs to provide:
• Performance
• Self-service
• Security
28
Notes de l'éditeur
Introduction to me.What I do.
First, let me explain what Hadoop is:Apache Hadoop is an open source software project originally invented by Google.It enables the distributed processing of large data sets across clusters of commodity servers. Hadoop provides an inexpensive massively scalable solution to storing structured and unstructured raw data.The Hadoop Data Reservoir is a vision of what Hadoop can be for your enterprise.
Before I go any further, I’d like to make sure I describe what the HDR is not. And sometimes this gets confused.
Upfront planning: what data will we collect? how will the data be modeled to answer our business questions? how will we make access to the data fast for all of our users? (the questions are almost endless)Ongoing maintenance: when will we refresh the data in the EDW? when datasets change, do we start over?Self-service: should be obvious that EDWs are the domain of the IT team. But the vision of the HDR implies this is self-service. When we see what is required, we’ll see that this is no easy task
How did we come to the concept of the HDR? THE VISION CAME OUT THROUGH THE INTERVIEWSStory:Developed a script of questionsPeople were at different places in their cycleThese were not data scientists. These were not people that had built their application on Hadoop (LinkedIn “People I know”)Cross section of industries: online media, financial services (banks and credit cards), federal government, retail, ecommerce, etc
But, reality was that none of these interviewees had reached the vision of the HDR. In fact, this is my image of the folks we were talking to.Talk about the enlightened IT user.
What is the thing that goes in between the HDR and the end user?The challenges with the Hadoop Data Reservoir:Missing link between the massive amount of raw data stored in HDR and access for business usersAccess has been self-limited to expert users who now data modeling and SQLIT teams must performing expensive ad-hoc data extractions into existing infrastructureAccess to the data in HDR must be high performance, self-service, and secure.
Should be about 1:40pm
Despite data size, queries must be fastIt’s not that queries just needed to be fast, they needed to be consistently fast.Modern tools require the ability to ask successive questions.As the centralized resource, you have many many questions being asked at once. The problem is when someone asks the wrong question in Hadoop, it impacts everyone.
Explain the media company data. Desire to get a 360 view of the customer on their site.A straightforward question such as the one posed here potentially requires touching 10s of billions of records to the process the answer.
Highly scalable architecture. Merv Adrian a few months ago: “One of the biggest technical challenges for BI in the Big Data era is deciding what is in memory. Fractal Cache does that efficiently and automatically.“The single most dramatic way to affect performance in a large data warehouse is to provide a proper set of aggregate (summary) records that coexist with the primary base records. Aggregates can have a very significant effect on performance, in some cases speeding queries by a factor of one hundred or even one thousand. No other means exist to harvest such spectacular gains.” – Ralph Kimball
You’ve heard of “drilling-down” on something. Or even drilling up. Use example of Region -> States -> Metro -> City -> Stores.Back to the Netflow example of our interviewee. He had 26B rows of raw data in Hadoop, per month. We built aggregate tables which reduced the grain and removed dimensionality, and made our work really fast.But what happens if, in our self-service Data Reservoir, the end user wants to get more detail from the raw data in Hadoop? We can’t just query it directly, because it will take too long, and I won’t have a rich set of metrics or dimensions to use to answer questions. I need to be able to drill through the aggregation. And since HDR is self-service, I need to be able to do this without involving my colleagues in IT.
Example of making sure data doesn’t get away.
Platfora addressed the challenges of HDR with the interest driven pipeline.Platfora software instantly transforms raw data in Hadoop into interactive, in-memory business intelligence. No ETL or data warehouse required.Platfora is a full stack of technology that spans from raw data the Hadoop Data Reservoir all the way to BI and analytics for the end user.In the past this would require at least three separate products.Platfora is the first product to completely rebuild the traditional business analytics stack from the ground up.
Platfora is made of three components – and none of these are more important than another – they all work together seamlessly.Platfora puts a very pretty face on Hadoop. Stunningly beautiful web-based BI interface. MAKES HADOOP DATA BEAUTIFUL.A scale-out in-memory data processing engine. MAKES HADOOP DATA FAST.Platfora drives Hadoop like a work engine. Automatically generating pushing jobs to Hadoop to do the heavy lifting without needing experts. MAKES HADOOP USABLE.These components work together. Based on what the user needs in the BI layer, the Lenses are automatically refined, the Hadoop data refinery does the heavy lifting without needing programming.Story: as we were working on the early designs for the product we thought about the old world that users were complaining about. Three separate layers – each with heavy expert intervention in-between. It reminded us of the way phones used to work. Remember managing contacts? iPhone analogy. Vertically integrated.