Big Data brings big promise and also big challenges, the primary and most important one being the ability to deliver Value to business stakeholders who are not data scientists!
Qlik Sense and Big Data
Making Big Data Relevant for the Business User
Bob Hardaway – Solution Architect
2 October, 2014
And now they coming, yeah, now they coming
Out from the shadows
To take me to the club because they know
That I shut this down, 'cause they been watching all my windows
They gathered up the wall and listening
You understand, they got a plan for us
I bet you didn't know that I was dangerous
Intelligence Community Comprehensive National
Cyber Security Initiative Data Center (ICCNCSIDC)
Capable of processing all forms of communication, including the
complete contents of private emails, cell phone calls, and Internet
searches, as well as all types of personal data trails—parking receipts,
travel itineraries, bookstore purchases, and other digital 'pocket litter'.
Big Data comes with big challenges
The Big Data bottleneck
Reports
Data Scientists
Business Users
Big Data
“ many organizations lack the skills required to exploit big data
”
“ most of these skills are in short supply and rare in the market at large
”
“ data science encompasses hard skills
”
Source: Gartner Big Data Hype Cycle Report 2013
Qlik relieves the Big Data bottleneck
The Big Data bottleneck
Data Scientists
Reports
Analytics &
Discovery
Big Data Business Users
QlikView’s user-centric Business Discover approach gives
decision-makers access to the benefits of Big Data
Big Data happens in every part of History
Paper Print Computer Internet
• Medium to write
ideas and
information
• Not enough writers
to disseminate
• Technology to
distribute
information
• No place to store
• Place to store
• Can’t keep up with
computing
requirements
• Distributed
computing globally
• Too many Emails
to read
We always create more than we can consume!
The Internet of Things (IoT)
• Cisco estimates 50B connected
devices by 2020
• Intel says 15B by 2015
• Uber adds 70000 drivers per
week
• AirBnB had 42M bookings last
year
• ZipCar lets you reserve a
parking space anywhere
The Physical Web – Google project to de-App devices
“People should be able to walk up to any smart device – a vending machine, a
poster, a toy, a bus stop, a rental car – and not have to download an app first,” –
Scott Jenson
Quantifying Big Data
Bigness is the least important thing … it’s the insights that can be
gained from interactions vs. transactions … the customer experience
vs. the value of what was purchased
- Stephen Brobst, CTO Teradata
Real time streaming data
High volumes in Low latency
Complexity in processing, analysis
and deriving insights
12TB/day across 80 servers
32 billion rows per day
Very large data sets
Order of 100s of TB to PBs
Structured & Unstructured Data, living together
(OLTP, DW, data marts)
text, audio, video, click streams, log files, etc
75TB compressed data processing/day
7500+ analytical jobs per day
15TB per day @ 1:7 compression ratio
4 PB storage
Images - Flat file - DNA 4TB of TIFF to 11mn PDF files
Using Hadoop in < 24hours
A Less Alliterative Definition
• Big Data is about analyzing ALL your data, ALL the time
– Traditional BI systems operate on assumptions, and limited data
sets that preclude true discovery and insight
– The Same question gets asked over and over
• The cost of analysis has always been the limiting factor for
Business Intelligence
– Solutions have to be justified before they are deployed
• Big Data is about storing everything, cheaply and letting the
User look for value
• Big Data is about driving the business based on Data
• Big Data doesn’t solve every problem, but it does put the
User in charge of the process
Hadoop – A Brief History
Cutting joins
Yahoo, estimates
a billion page
index will cost
$500k and
$30k/mos to
support
A 1400n Yahoo
cluster sorts 500GB
in 59s. Cloudera
launches
Google releases a
paper on GFS,
based on a
distributed search
platform called
Nutch Hadoop promoted to
top level Apache
project, predictive
search index creation
time reduced from
12days to 8hrs
Yahoo spins
remaining Hadoop
folks out into
Hortonworks
Apache Spark
becomes the most
contributed to
Hadoop related
project
3rd Hadoop World
conf attracts 2300
developers, vs 275
the first time
Cloudera adds real-time
search, based
on Lucene, also
created by Cutting
2006 2008 2011 2013
2014
Real-time
Analytics
Big Data is much more than just storage
Extreme Analytic
Engines
Big Data Exploration, DW/ETL
Pre-processing
Big Data Cache + BI
Infrastructure
Prepare for Big Data Business Demands
Real-Time Agility
Advanced Analytic
Capability
Transformation and
Exploration
Advanced Data
Management
1
4
3
2
1
Popular “Big Data” Myths
• You need to have Ga-zinga-bytes to deploy a Big Data solution
– Typical Cloudera Cluster is 15-20 nodes, < 10TB of data
– Hadoop storage is 3-400% cheaper than an EDW
• Hadoop is all you need
– Hadoop is an enabling technology that provides the foundation for
Big Data solutions
– Focus today is on data management
• The RDBMS is dead
– RDBMS is still critical – but not for high volume, low quality analytics
• ew can’t handle Big Data
– Reality is a Human can’t handle Big Data
– It’s all about the use case
– Direct Discovery is a unique approach
Gartner Top Big Data Challenges
You need to determine
your goals/objectives
Qlik can help you with
these challenges
Turn Big Data (lots of dots) Into Small Data (Insights)
The Value in Big Data Comes from Context and Relevance
More History
They’re both the same number of bricks!
The same volume of data, same schema.
You choose what is relevant to your analysis.
More Categories
Hard Disk
Drives (HDD)
Solid State
Storage (SSD)
Random
Access
Memory (RAM)
Speed (t/TB) 3300s 1000-300s 1s
Price $/TB $ 50 $ 500 $ 4500
• Keep data in memory when the value obtained from processing it is high
• Leave data on disk when it is inactive or the value from processing it is low
Value
Size
The Big Data Value Chain
Fine, Big Data is here,
but
what are the Big Data Use Cases
that matter to my Business?
Initially Hadoop Came About to Reduce Costs
• How cheaply?
– By one estimate running a 75-node, 300TB Hadoop cluster costs
$1.05 Million over 3 years.
– Simply for an RDBMS may cost 2.5x for the same time period.
• This type of savings means companies can keep ‘more’ or all of
their data.
• Hadoop is for storage, not analytics
– Data storage remains the most common use case for Hadoop
• Example:
– Expedia is moving from DB2 to Cloudera with expected savings
of approximately $100 million per year.
But Big Data Technologies are Evolving Rapidly
• 2010 – Download Apache Hadoop, cobble together surplus
hardware, hire a couple java developers
• 2012 – CDH 4 from Cloudera reduces deployment time from days
to minutes
• 2013 – AWS introduces Elastic Map Reduce (EMR)
• 2014 – Google Counters with Google Compute Engine (GCE)
• Platform Vendors cover more than just Hadoop-like capabilities
– Map-Reduce for large scale, batch processing
– NoSQL for real-time, adhoc query with operational performance
– Spark/Solr/Impala for real-time analytics
– R Integration for deep predictive/advanced analytics
– All need a delivery agent (aka Visualization tool) to bring the
benefit to the business
Big Data Use Cases are About Finding Value
• Internet (Expedia)
– Search Index Generation
– User Engagement Behavior
– Targeting / Advertising
Optimizations
– Recommendations
• BioMed (Carefusion)
– Computational BioMedical
Systems
– Bioinformatics
– Data Mining and Genome
Analysis
• Financial (Metlife / Wells Fargo)
– Prediction Models
– Fraud Analysis
– Portfolio Risk Management
• Telecom (BritTelecom/DeutscheTele)
– Call data records
– Set top & DVR streams
• Social (Facebook)
– Recommendations
– Network Graphs
– Feed Updates
• Enterprise Operations
– email and image processing
– Robust ETL
– Data Archival
– Natural Language Processing
• Media & Entertainment (DIRECTV)
– Customer 360
– Marketing Campaigns
• Agriculture (ADM)
– Process “agri” stream
– Mineral Management
• Image (Corbis)
– Geo-Spatial processing
• Education (State of …)
– Systems Research
– Statistical analysis of the web
Big Data Ecosystem is Much More Than Just Hadoop
Data Visualization, Statistical & In-memory Analytics
Open source Distributed Processing Frameworks
Big Data Analytic Appliances
Massively Parallel Processing Platforms
Big data Integration
Packaged Mapreduce platforms
Big Insights &
Streams
Big Data
Appliance
HANA
splunk >
Insight Comes from Big Data, in Context
NoSQL
Databases
SAP HANA
Google
BigQuery
Batch
Real-time
Hadoop
Advanced
Analytics
Platform
Vendors
Leveraging QlikView for Big Data Discovery
Define Your Use Case
• A Hybrid approach that
– Provides any/all business stakeholder with a simple but
powerful environment for exploring data, without
– Limiting or filtering what data is available for analysis
when
• Follow the Value
– Start with simple questions:
• What data do we already have they we are not making
good use of today?
– Let your business decide where the exploration goes
• The technologies are cost effective, flexible and designed
for a business-first methodology
QlikView Direct Discovery
• Combines the associative capabilities of the QlikView in-memory
dataset with a query model where:
The aggregated query result is passed back to a QlikView object
without being loaded into the QlikView data model
The result set is still part of the associative experience
Capability to Drill to Detail records
QlikView In-Memory Data Model
QlikView Application
Direct Discovery
Batch Load
Complement Hadoop and EDW co-existence
Data
Warehouse
Aggregates
Direct Discovery
Broad Application to
discover new trends
Deep Application to
confirm and take action
Move highly valuable data
to EDW for more broad accessibility
Point QlikView to new source
Big Data Business Needs
Descriptive Analytics Predictive Analytics
DATA
Clinical,
Claims,
Monitoring,
others
How are we doing? What might happen in
the future?
Prescriptive Analytics
Best course of action
given objectives,
requirements &
constraints
How many claims did we pay
today?
Which of tomorrow’s claims
might be requesting an
Emergency Room (ER)
admission?
What would be effective
steps to reduce probability of
ER admission?
Qlikview is leader in Descriptive but barely plays in Predictive
and Prescriptive. Radically different algorithmic and
visualization concepts are needed to play in that arena
King.com: Big Data in Action
• 1.6B rows of data per day in Hadoop —
– 211M rows per day extracted for analysis in QlikView
• Customer browsing activity:
– Player Interactions within each game
– Many additional metrics
• Results: Marketing ROI of campaigns achieved for the first
time (# of players, # of games played, time played, etc.)
The Bloor Group write in “Why In-Memory Technology will dominate Big Data” from Kognitio download site http://www.kognitio.com/information-center/reports/
If the goal is to accelerate BI activities dramatically, the natural approach is to have an in memory
processing resource that can be used where it makes a difference, flowing the data
from disk through SSD to memory in order to support those BI workloads. In other words,
data is kept in memory when the value obtained from processing it is high, and data stays on
disk when it is inactive or the value from processing it is low.