SlideShare une entreprise Scribd logo
1  sur  81
Télécharger pour lire hors ligne
SQL..
.
SQL!
SQL?
SQL
Hadoop
BI Isn’t Big Data, Big Data Isn’t BI
September, 2015
Mark Madsen
www.ThirdNature.net
@markmadsen
© Third Nature Inc.
Summary
Common uses and commodity technology
lead to
Novel practices
lead to
Different data and different technology needs
lead to
New architectures
Lead to
Common uses and commodity technology 
© Third Nature Inc.
Our ideas about
information and
how it’s used are
outdated.
© Third Nature Inc.
How We Think of Users
Our design point is the 
passive consumer of 
information.
Proof: methodology
▪ IT role is requirements, 
design, build, deploy, 
administer
▪ User role is run reports
Self‐serve BI is not like 
picking the right doughnut 
from a box.
Slide 4
© Third Nature Inc.
How We Think of Users
Our design point is the 
passive consumer of 
information.
Proof: methodology
▪ IT role is requirements, 
design, build, deploy, 
administer
▪ User role is run reports
Self‐serve BI is not like 
picking the right doughnut 
from a box.
How We Want Users to 
Think of Us
© Third Nature Inc.
How We Think of Users What Users Really Think
© Third Nature Inc.
We think of BI as publishing, an old metaphor.
Publishing has value, but may not be actionable.
© Third Nature Inc.
Planning data strategy means understanding the 
context of data use so we can build infrastructure
Monitor
Analyze
Exceptions
Analyze
Causes
Decide Act
No problem No idea Do nothing
We need to focus on what people do with information
as the primary task, not on the data or the technology.
© Third Nature Inc.
General model for organizational use of data
Monitor
Analyze
Exceptions
Analyze
Causes
Decide Act
No problem No idea Do nothing
Act within the process
Usually real-time to daily
© Third Nature Inc.
Origin of BI and data warehouse concepts
The general concept of a 
separate architecture for BI 
has been around longer, but 
this paper by Devlin and 
Murphy is the first formal 
data warehouse architecture 
and definition published.
10
“An architecture for a business and
information system”, B. A. Devlin,
P. T. Murphy, IBM Systems Journal,
Vol.27, No. 1, (1988)
Slide 10Copyright Third Nature, Inc.
© Third Nature Inc.
Origins: in 1988 there was only big hair.
▪ No real commercial email, public internet barely started
▪ Storage state of the art: 100MB, cost $10,000/GB
▪ Oracle Applications v1 GL released; SAP goes public, 
enters US market
▪ Unix is mostly run by long‐haired freaks
▪ Mobile was this
This is the context: scarcity of data, of system resources, of automated 
systems outside core financials, of money to pay for storage.
© Third Nature Inc.
General model for organizational use of data
Collect
new data
Monitor
Analyze
Exceptions
Analyze
Causes
Decide Act
No problem No idea Do nothing
Act on the process
Usually days/longer timeframe
Copyright Third Nature, Inc.
© Third Nature Inc.
You need to be able to support both paths
Collect
new data
Monitor
Analyze
Exceptions
Analyze
Causes
Decide Act
Act on the process
Act within the process
Conventional BI, addition of EDM
Causal analysis, “data science”
Copyright Third Nature, Inc.
© Third Nature Inc.
The usage models for conventional BI
Collect
new data
Monitor
Analyze
Exceptions
Analyze
Causes
Decide Act
No problem No idea Do nothing
Act on the process
Usually days/longer timeframe
Act within the process
Usually real-time to daily
This is what we’ve been
doing with BI so far: static
reporting, dashboards,
ad-hoc query, OLAP
Copyright Third Nature, Inc.
© Third Nature Inc.
The usage models for analytics and “big data” 
Collect
new data
Monitor
Analyze
Exceptions
Analyze
Causes
Decide Act
No problem No idea Do nothing
Act on the process
Usually days/longer timeframe
Act within the process
Usually real-time to daily
Analytics and big data is
focused on new use
cases: deeper analysis,
causes, prediction,
optimizing decisions
This isn’t ad-hoc,
reporting, or OLAP.
Copyright Third Nature, Inc.
© Third Nature Inc.
When you first give people access to information 
that was unavailable…
OH GOD
I can see into forever
© Third Nature Inc.
After a while it becomes the new normal
© Third Nature Inc.
As practices evolve based on new capabilities…
A new level of 
complexity 
develops over 
top of the 
older, now 
better 
understood 
processes, 
leading to new 
data and 
analysis needs.
© Third Nature Inc.
I never said the
“E” in EDW meant
“everything”…
What do you
mean, “Just
doughnuts?”
© Third Nature Inc.
The data warehouse vs business agility
All the data
Common, typed, tabular data
The bottleneck is you
© Third Nature Inc.
It’s going to get a lot worse
Not E
E
Conclusion: any methodology built on the premise that you 
must know and model all the data first is untenable 
© Third Nature Inc.
Old market says: There’s nothing wrong with what 
you have, just keep buying new products from us
© Third Nature Inc.
The emerging big data market has an answer…
© Third Nature Inc.
The data lake
© Third Nature Inc.
The data lake after a little while
© Third Nature Inc.
TANSTAAFL
When replacing the old 
with the new (or ignoring 
the new over the old) you 
always make tradeoffs, 
and usually you won’t see 
them for a long time.
Technologies are not 
perfect replacements for 
one another. Often not 
better, only different.
© Third Nature Inc.
“Big data is unprecedented.”
‐ Anyone involved with big data in even the 
most barely perceptible way
© Third Nature Inc.
We’ve been here before
Source: Bill Schmarzo, EMC
© Third Nature Inc.
“Big” is well supported by databases now
Source:Noumenal,Inc.
© Third Nature Inc.
Orders of magnitude: 20 years ago TB, today PB
Shifts in data availability by orders of magnitude 
necessitate new means of managing and using it.
© Third Nature Inc.
Analytics embiggens the data volume problem
Many of the processing problems are O(n2) or worse, so 
moderate data can be a problem for DB‐based platforms
© Third Nature Inc.
Much of the big data value comes from analytics
BI is a retrieval problem, not a computational problem.
Five basic things you can do with analytics
▪Prediction – what is most likely to happen?
▪Estimation – what’s the future value of a variable?
▪Description – what relationships exist in the data?
▪Simulation – what could happen?
▪Prescription – what should you do?
Slide 36
Copyright Third Nature, Inc.
Copyright Third Nature, Inc.
© Third Nature Inc.
Most people do not need special technologyNumberofpeople
The distribution of data
size is about normal, yet
these guys set the tone of
the market today.
Bigness of data
Copyright Third Nature, Inc.
© Third Nature Inc.
Analytics: This is really raw data under storageNumberofjobs
Microsoft study of 174,000 analytic
jobs in their cluster: median size ???
Bigness of data
Copyright Third Nature, Inc.
© Third Nature Inc.
Working data for analytics most often not bigNumberofjobs
14 GB
Smallness of data
Copyright Third Nature, Inc.
© Third Nature Inc.
An (overly) Simple Division of the Problem SpaceComputation
LittleLots
Data volume
Little Lots
Big analytics, little data
Specialized computing,
modeling problems:
supercomputing, GPUs
Big analytics, big data
Complex math over large
data volumes requires
shared nothing architectures
Little analytics, little data
The entry point; SAS, SMP
databases, even OLAP
cubes can work
Little analytics, big data
The BI/DW space, for the
most part, with work done
in databases
© Third Nature Inc.© Third Nature Inc.
What makes data “big”?
Very large amounts
Hierarchical structures
Nested structures
Linked structures
Encoded values
Non‐standard (for a 
database) types
Deep structure
Human authored text
“big” is better off being defined as “complex” or “hard to manage”
Copyright Third Nature, Inc.
© Third Nature Inc.
Categorizing the measurement data we collect
The convenient data is the 
transactional data.
▪ Goes in the DW and is used, even 
if it isn’t the right measurement.
The inconvenient data is 
observational data.
▪ It’s not neat, clean, or designed 
into most systems of operation.
The difficult and misleading data 
is declarative data.
▪ What people say and what they 
do require ground truth.
We need an architecture that 
supports all three categories.
Copyright Third Nature, Inc.
© Third Nature Inc.
Transactions vs “big data”
The classic example of “structured data”
Transaction data includes:
▪ quantification details (date, value, count)
▪ reference data for explanation (product, 
customer, account)
▪ Lots of meaningful information
Reference data is usually shared across the 
organization, hence its importance. There 
are two parts:
▪ identifier to uniquely identify the subject
▪ descriptive attributes with common or 
standardized value domains
Transaction details
Reference data
© Third Nature Inc.
Today it’s different data: observations, not transactions
Sensor data doesn’t fit well with current methods of collection and
storage, or with the technology to process and analyze it.
Copyright Third Nature, Inc.
© Third Nature Inc.
Big data as a type of data: Transactions vs. Events
Transactions:
▪ Each one is valuable
▪ Mutable
▪ The elements of a transaction can be aggregated easily
▪ A set of transactions does not usually have important ordering 
or dependency
Events:
▪ A single event often has no value, e.g. what is the value of one 
click in a series? Some events are extremely valuable, but this 
is only detectable within the context of other events.
▪ Elements of events are often not easily aggregated
▪ A set of events usually has a natural order and dependencies
▪ Immutable
© Third Nature Inc.
Example “big data”: Web tracking data
USER_ID 301212631165031
SESSION_ID 590387153892659
VISIT_DATE 1/10/2010 0:00
SESSION_START_DATE 1:41:44 AM
PAGE_VIEW_DATE 1/10/2010 9:59
DESTINATION_URL
https://www.phisherking.com/gifts/store/LogonForm?mmc=
link‐src‐email‐_‐m100109‐_‐44IOJ1‐_‐shop&langId=‐
1&storeId=1055&URL=BECGiftListItemDisplay
REFERRAL_NAME Google.com
REFERRAL_URL
http://www.google.com/search?sourceid=navclient&aq=0h&
oq=Italian&ie=UTF8&rlz=1T4ACGW_enUS386US387&q=italia
n+rose&fu=0&ifi=1&dtd=204&xpc=1KoLqh374s
PAGE_ID PROD_24259_CARD
REL_PRODUCTS PROD_24654_CARD, PROD_3648_FLOWERS
SITE_LOCATION_NAME VALENTINE'S DAY MICROSITE
SITE_LOCATION_ID SHOP‐BY‐HOLIDAY VALENTINES DAY
IP_ADDRESS 67.189.110.179
BROWSER_OS_NAME
MOZILLA/4.0 (COMPATIBLE; MSIE 7.0; AOL 9.0; WINDOWS 
NT 5.1; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322)
© Third Nature Inc.
Web tracking data has a nested structure
USER_ID 301212631165031
SESSION_ID 590387153892659
VISIT_DATE 1/10/2010 0:00
SESSION_START_DATE 1:41:44 AM
PAGE_VIEW_DATE 1/10/2010 9:59
DESTINATION_URL
https://www.phisherking.com/gifts/store/LogonForm?mmc=
link‐src‐email‐_‐m100109‐_‐44IOJ1‐_‐shop&langId=‐
1&storeId=1055&URL=BECGiftListItemDisplay
REFERRAL_NAME Direct
REFERRAL_URL ‐
PAGE_ID PROD_24259_CARD
REL_PRODUCTS PROD_24654_CARD, PROD_3648_FLOWERS
SITE_LOCATION_NAME VALENTINE'S DAY MICROSITE
SITE_LOCATION_ID SHOP‐BY‐HOLIDAY VALENTINES DAY
IP_ADDRESS 67.189.110.179
BROWSER_OS_NAME
MOZILLA/4.0 (COMPATIBLE; MSIE 7.0; AOL 9.0; WINDOWS 
NT 5.1; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322)
“unstructured” data
embedded in the
logged message:
complex strings
© Third Nature Inc.
The missing ingredient from most big data
© Third Nature Inc.
The creation, flow and use of data is different for 
transactions and machine‐generated events
Data entry Extract Cleanse Load UseStore
Transactions
MDM
Generate Store
Use
UseCleanse
Program
Capture
This runs at human speed
This runs at machine speed, with higher latency feedback cycles
We collect large volumes of text, a rare practice 
ten years ago. Today we can turn text into data.
Categories,
taxonomies
Topics, genres,
relationships,
abstracts
Sentiment, tone,
opinion
Words & counts,
keywords, tags
Entities
people, places,
things, events, IDs
Copyright Third Nature, Inc.
© Third Nature Inc.
You can store this data in an RDBMS, but…
Example data: Twitter Message API Payload
Looks like:
This is really just a record format
much like a DB row.
Datetime, userID, name, location,
description, message, message
metadata, etc.
But it’s In json or xml.
© Third Nature Inc.
@markmadsen Check out: From #MongoDB to #Cassandra: 
Why The Atlas Platform Is Migrating http://owl.li/cvxFK
A tweet has lots of fields, but one important one
The payload is free text but has other elements:
From these things you likely want to generate or link to 
reference data.
‘To’ username Hashtag HashtagURL
© Third Nature Inc.© Third Nature Inc.
Internal payload elements form a new graph
The @elements point to 
other records and create a 
deeply linked structure.
You have to assemble the 
linked structure to see 
what’s really there, which 
means repeated scanning 
some/all of the data.
The derived pattern is 
interesting data, 
sometimes more than the 
individual messages.
© Third Nature Inc.© Third Nature Inc.
There are many patterns in the data
Follower / following networks are easy – they are explicit 
and independent of the events.
Community detection requires looking at patterns of @ 
communication in addition to follow relationships.
What do you do with these after discovery?
Follower network Conversational communities
© Third Nature Inc.
More data: patterns emerge from lots of event data
Patterns emerge from 
the underlying structure 
of the entire dataset.
The patterns are more 
interesting than sums 
and counts of the events.
Web paths: clicks in a 
session as network node 
traversal.
Email: traffic analysis 
producing a network
The event stream is a source for analysis, generating
another set of data that is the source for different analysis.
© Third Nature Inc.
Big changes for data warehousing workloads
The results of analytic 
processing can, often do, 
feed back into the 
system from which they 
originate.
Much of the data is being 
read, written and 
processed in real time.
Our design point was not 
changing tables and 
ephemeral patterns.
Unstructured is Not Really Unstructured
Slide 58
Unstructured data isn’t 
really unstructured: 
language has structure.   
Text can contain traditional 
structured data elements. 
The problem is that the 
content is unmodeled.
© Third Nature Inc. Slide 59
THE BIG CHANGE ISN’T
TECHNOLOGY, IT’S ARCHITECTURE
© Third Nature Inc.
There are really three workloads to consider, not two
1. Operational: OLTP systems
2. Analytic: OLAP systems
3. Processing: Computational systems
Unit of focus:
1. Transaction
2. Query
3. Computation
Different problems require different platforms
© Third Nature Inc.
Workloads
OLTP BI Analytics
Access Read‐Write Read‐only Read‐mostly
Predictability Predictable Unpredictable Fixed path
Selectivity High Low Low
Retrieval Low Low High
Latency Milliseconds < seconds msecs to days
Concurrency Huge Moderate 1 to huge
Model 3NF, nested object Dim, denorm BWT
Task size Small Large Small to huge
© Third Nature Inc.
These do exactly the same thing:
One is a set of technologies. One is an architecture.
An idea promoted by big data vendors
Data
Warehouse
© Third Nature Inc.
Reality: Hadoop disaggregates the database
One of the key things Hadoop does is to separate the 
storage, execution and API layers of a database. This 
allows for processing flexibility, but it does not permit 
one to build a reliable, high performance database 
across the layers.
Hadoop distributed filesystem (HDFS)
General-purpose data engines
Abstraction layers
Storage management
© Third Nature Inc.
A more specific look at layers and engines
Base storage
SQL, MDX
Kylin
Storage mgmt
Engine
Abstraction 
layer / API
You can program to any layer you
choose. Some projects already build on
top of multiple others.
Language/API Engine
Hadoop distributed filesystem (HDFS)
MapReduce Tez
Cascading
Spark
Storage (filetypes in HDFS, Hbase, etc)
Crunch
Pig
Hive
SparkSQL
NativeAPI
Giraph
Hive
Crunch
Pig
Impala
Drill
Presto
NativeAPI
NativeAPI
Hive
Pig
NativeAPI
Hbase
Phoenix
© Third Nature Inc.
An important Hadoop + cloud computing benefit
Scalability is free – if your task requires 10 units of 
work, you can decide when you want results:
10 servers, 1 unit of time
Cost is the same. Not true of the conventional IT model
Time
1 server, 10 units of time
X X
© Third Nature Inc.
Hadoop: a summary of the magic
1. Provides both storage and complex processing as part 
of the same platform
2. Makes parallel programming more accessible
3. Schemaless (just files) therefore flexible
4. Inexpensive, reliable scale‐out
5. Potential for fast, scalable ingest
6. Cheaper than a database (for non‐database work)
The bad stuff:
▪ Not great for mutable data
▪ Mostly file‐based sequential processing, or you store data 
many times in different datastores (locality is important)
▪ Minimal data management (today)
© Third Nature Inc.
The geography has been redefined
The box we created:
• not any data, rigidly typed data
• not any form, tabular rows and 
columns of typed data
• not any latency, persist what the 
DB can keep up with
• not any process, only queries
The digital world was diminished 
to only what’s inside the box until 
we forgot the box was there.
© Third Nature Inc.
Layered data architecture
The DW assumed a single flat 
model of data, DB in the center. 
New technology enables new 
ways to organize data:
▪ Raw – straight from the source
▪ Enhanced –cleaned, standardized
▪ Integrated – modeled, 
augmented, ~semi‐persistent
▪ Derived – analytic output, 
pattern based sets, ephemeral
Implies a new technology architecture 
and data modeling approaches.
© Third Nature Inc.
Decouple the Data Architecture
The core of the data warehouse isn’t the 
database, it’s the data architecture that the 
database and tools implement.
We need a data architecture that is not limiting:
▪ Deals with change more easily and at scale
▪ Does not enforce requirements and models up front
▪ Does not limit the format or structure of data
▪ Assumes the range of data latencies in and out, from 
streaming to one‐time bulk
© Third Nature Inc.
Deconstructing the data warehouse
There are three 
things happening 
in a DW:
▪ Data acquisition
▪ Data management
▪ Data delivery
Isolate them from 
one another.
Data
Warehouse
© Third Nature Inc.
Integrate
Manage
Decouple the data architecture by stage
Use
In reality, you are building three systems, not one. Treat them that way.
Collect
Transactions Observations Declarations
© Third Nature Inc.
Food supply chain: an analogy for data
Multiple contexts of use, differing quality levels
© Third Nature Inc.
Data infrastructure is a platform
▪ Any data – structures, forms
▪ Any latency –in motion, at rest
▪ Any process – query, algorithm, transformation
▪ Any access – SQL, API, queue, file movement
© Third Nature Inc.
The evolution of DW is to a data platform, which means 
separating application from infrastructure.
Derived data
Raw data
Infrastructure layer:
Process and analyze
Store and manage
Application layer:
Deliver and use
The new model also encompasses data at rest and data in motion
Multiple access methods
Enhanced
data
Multiple ingest methods
BI, data extracts, 
analytics, applications
The platform has to do more than serve queries; it has to be read-write.
© Third Nature Inc.
Away from “one throat to choke”, back to best of breed
“The extremely specialized 
nature of mass production 
raises the costs of product 
change and therefore slows 
down innovation.”
‐ Abernathy, 1978
Tight coupling leads to slow 
changes.
In a rapidly evolving market 
componentized architectures, 
modularity  and loose coupling 
are favorable over monolithic 
stacks, single‐vendor 
architectures and tight 
coupling.
© Third Nature Inc.
Staff and skills are a problem in a build market
@BigDataBorat: Give man Hadoop
cluster he gain insight for a day. Teach
man build Hadoop cluster he soon
leave for better job #bigdata
© Third Nature Inc.
Technology Adoption
Some people can’t resist 
getting the next new thing 
because it’s new and new is 
always better.
Many IT organizations are like 
this, promoting a solution and 
hunting for the problem that 
matches it.
Better to ask “What is the 
problem for which this 
technology is the answer?”
Copyright Third Nature, Inc.
© Third Nature Inc.
Four core capabilities big data technologies add
1. Unlimited scale of storage, processing
▪ Agility, faster turnaround for new data requests (but not a
replacement for BI)
▪ Fewer staff to accomplish same goals
2. New data accessibility
▪ More data retained for longer period
▪ Access to data unused due to cost or processing limits
▪ Any digital information becomes usable data
3. Scalable realtime processing
▪ Brings ability to monitor and act on data as events occur
4. Arbitrary analytics
▪ Faster analysis
▪ Deeper analysis
▪ More broadly accessible analytics
© Third Nature Inc.
As a technology moves from emerging to commodity the 
nature of acquiring, using and managing it changes
Generate
options
Innovation
Novel practice
Maximize value
Maturation
Constrain
choices
Adaptation
Good practice
Optimize
Standardize /
minimize choice
Acquisition
Best practice
Minimize costs
SaturationInnovation
Copyright Third Nature, Inc.
Agile & open 
source* methods 
6 Sigma & process 
methods
© Third Nature Inc.
Today: repeating the experience of the 80s & 90s
This is the turbulent
phase of the market
as it goes through
rapid development,
then product and
service changes.
Copyright Third Nature, Inc.
The Internet combined with commodity computing is forcing a new
business and IT structural evolution, already underway.
Maturation SaturationInnovation
© Third Nature Inc.
How we develop best practices: survival bias
We don’t need best practices, we need worst failures.Copyright Third Nature, Inc.
© Third Nature Inc.
Welcome to the big data revolution, more of an evolution
Be pragmatic, not dogmatic
© Third Nature Inc.
CC Image Attributions
Thanks to the people who supplied the creative commons licensed images used in this presentation:
acorn_blue.jpg ‐ http://www.flickr.com/photos/rogersmith/314324893/
wheat_field.jpg ‐ http://www.flickr.com/photos/ecstaticist/1120119742/
Phone dump ‐ Richard Barnes
ponies in field.jpg ‐ http://www.flickr.com/photos/bulle_de/352732514/
straw men.jpg ‐ http://www.flickr.com/photos/robinellis/6034919721/
text composition ‐ http://flickr.com/photos/candiedwomanire/60224567/
girl on cell tokyo .jpg ‐ http://flickr.com/photos/8024992@N06/986538717/
hamadan people mosaic.jpg ‐ http://flickr.com/photos/hamed/225868856/
twitter_network_bw.jpg ‐ http://www.flickr.com/photos/dr/2048034334/
klein_bottle_red.jpg ‐ http://flickr.com/photos/sveinhal/2081201200/
donuts_4_views.jpg ‐ http://www.flickr.com/photos/le_hibou/76718773/
subway dc metro  ‐ http://flickr.com/photos/musaeum/509899161/
About the Presenter
Mark Madsen is president of Third 
Nature, a consulting and advisory firm 
focused on analytics, business 
intelligence and data management. 
Mark is an award‐winning author, 
architect and CTO. Over the past ten 
years Mark received awards for his work 
from the American Productivity & 
Quality Center, TDWI, and the 
Smithsonian Institute. He is an 
international speaker, a contributor to 
Forbes, member of the O’Reilly Strata 
program committee. For more 
information or to contact Mark, follow 
@markmadsen on Twitter or visit  
http://ThirdNature.net 
© Third Nature Inc.
About Third Nature
Third Nature is a consulting and advisory firm focused on new and
emerging technology and practices in information strategy, analytics,
business intelligence and data management. If your question is related to
data, analytics, information strategy and technology infrastructure then
you‘re at the right place.
Our goal is to help organizations solve problems using data. We offer
education, consulting and research services to support business and IT
organizations as well as technology vendors.
We fill the gap between what the industry analyst firms cover and what IT
needs. We specialize in strategy and architecture, so we look at emerging
technologies and markets, evaluating how technologies are applied to
solve problems rather than evaluating product features.

Contenu connexe

Tendances

Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018mark madsen
 
How to understand trends in the data & software market
How to understand trends in the data & software marketHow to understand trends in the data & software market
How to understand trends in the data & software marketmark madsen
 
Innovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerInnovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerMicrosoft
 
2013: Trends from the Trenches
2013: Trends from the Trenches2013: Trends from the Trenches
2013: Trends from the TrenchesChris Dagdigian
 
2015 CDC Workshop on ScienceDMZ
2015 CDC Workshop on ScienceDMZ2015 CDC Workshop on ScienceDMZ
2015 CDC Workshop on ScienceDMZChris Dagdigian
 
The Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data ManagementThe Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data Managementmark madsen
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)mark madsen
 
BioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesBioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesChris Dagdigian
 
BioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology ExchangeBioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology ExchangeChris Dagdigian
 
Taming Big Science Data Growth with Converged Infrastructure
Taming Big Science Data Growth with Converged InfrastructureTaming Big Science Data Growth with Converged Infrastructure
Taming Big Science Data Growth with Converged InfrastructureThe BioTeam Inc.
 
Operationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the EnterpriseOperationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the Enterprisemark madsen
 
Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...mark madsen
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Chris Dagdigian
 
Cloud Security for Life Science R&D
Cloud Security for Life Science R&DCloud Security for Life Science R&D
Cloud Security for Life Science R&DChris Dagdigian
 
Building Data Science Teams
Building Data Science TeamsBuilding Data Science Teams
Building Data Science TeamsEMC
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson
 
Lean approach to IT development
Lean approach to IT developmentLean approach to IT development
Lean approach to IT developmentMark Krebs
 
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingBio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingChris Dagdigian
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance Qubole
 
2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentationChris Dagdigian
 

Tendances (20)

Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018
 
How to understand trends in the data & software market
How to understand trends in the data & software marketHow to understand trends in the data & software market
How to understand trends in the data & software market
 
Innovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerInnovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringer
 
2013: Trends from the Trenches
2013: Trends from the Trenches2013: Trends from the Trenches
2013: Trends from the Trenches
 
2015 CDC Workshop on ScienceDMZ
2015 CDC Workshop on ScienceDMZ2015 CDC Workshop on ScienceDMZ
2015 CDC Workshop on ScienceDMZ
 
The Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data ManagementThe Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data Management
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
 
BioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesBioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the Trenches
 
BioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology ExchangeBioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology Exchange
 
Taming Big Science Data Growth with Converged Infrastructure
Taming Big Science Data Growth with Converged InfrastructureTaming Big Science Data Growth with Converged Infrastructure
Taming Big Science Data Growth with Converged Infrastructure
 
Operationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the EnterpriseOperationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the Enterprise
 
Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
 
Cloud Security for Life Science R&D
Cloud Security for Life Science R&DCloud Security for Life Science R&D
Cloud Security for Life Science R&D
 
Building Data Science Teams
Building Data Science TeamsBuilding Data Science Teams
Building Data Science Teams
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Lean approach to IT development
Lean approach to IT developmentLean approach to IT development
Lean approach to IT development
 
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingBio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
 
2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation
 

En vedette

The Analytics Data Store: Information Supply Framework
The Analytics Data Store: Information Supply FrameworkThe Analytics Data Store: Information Supply Framework
The Analytics Data Store: Information Supply FrameworkMartyn Richard Jones
 
Briefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analyticsBriefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analyticsmark madsen
 
Crossing the chasm with a high performance dynamically scalable open source p...
Crossing the chasm with a high performance dynamically scalable open source p...Crossing the chasm with a high performance dynamically scalable open source p...
Crossing the chasm with a high performance dynamically scalable open source p...mark madsen
 
A Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing CustomersA Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing Customersmark madsen
 
Determine the Right Analytic Database: A Survey of New Data Technologies
Determine the Right Analytic Database: A Survey of New Data TechnologiesDetermine the Right Analytic Database: A Survey of New Data Technologies
Determine the Right Analytic Database: A Survey of New Data Technologiesmark madsen
 
VCCP Kin Production Director Chris Chaundler - Sustaining brand conversation
VCCP Kin Production Director Chris Chaundler - Sustaining brand conversationVCCP Kin Production Director Chris Chaundler - Sustaining brand conversation
VCCP Kin Production Director Chris Chaundler - Sustaining brand conversationThe_IPA
 
Blueprint for integrating big data analytics and bi
Blueprint for integrating big data analytics and biBlueprint for integrating big data analytics and bi
Blueprint for integrating big data analytics and biDataWorks Summit
 
Text visualization - by Jeff Clark
Text visualization -  by Jeff ClarkText visualization -  by Jeff Clark
Text visualization - by Jeff ClarkCindy Xiao
 
Integrating BI - Data Warehouse and Big Data
Integrating BI - Data Warehouse and Big DataIntegrating BI - Data Warehouse and Big Data
Integrating BI - Data Warehouse and Big DataAccenture Analytics
 
Next generation big data bi
Next generation big data biNext generation big data bi
Next generation big data biStanley Wang
 
The State of Open Source BI Adoption
The State of Open Source BI AdoptionThe State of Open Source BI Adoption
The State of Open Source BI Adoptionmark madsen
 
Malaysia Big Data Analytics Initiative: 2015 Imperatives
Malaysia Big Data Analytics Initiative: 2015 ImperativesMalaysia Big Data Analytics Initiative: 2015 Imperatives
Malaysia Big Data Analytics Initiative: 2015 ImperativesPeter Kua
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseRob Winters
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)mark madsen
 
BI congres 2014-5: from BI to big data - Jan Aertsen - Pentaho
BI congres 2014-5: from BI to big data - Jan Aertsen - PentahoBI congres 2014-5: from BI to big data - Jan Aertsen - Pentaho
BI congres 2014-5: from BI to big data - Jan Aertsen - PentahoBICC Thomas More
 
How big data is transforming BI
How big data is transforming BIHow big data is transforming BI
How big data is transforming BIDeZyre
 
What is bi analytics and big data
What is bi analytics and big dataWhat is bi analytics and big data
What is bi analytics and big datagaliasisense
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecturemark madsen
 
Big Data and BI Best Practices
Big Data and BI Best PracticesBig Data and BI Best Practices
Big Data and BI Best PracticesYellowfin
 

En vedette (20)

The Analytics Data Store: Information Supply Framework
The Analytics Data Store: Information Supply FrameworkThe Analytics Data Store: Information Supply Framework
The Analytics Data Store: Information Supply Framework
 
Briefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analyticsBriefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analytics
 
Crossing the chasm with a high performance dynamically scalable open source p...
Crossing the chasm with a high performance dynamically scalable open source p...Crossing the chasm with a high performance dynamically scalable open source p...
Crossing the chasm with a high performance dynamically scalable open source p...
 
A Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing CustomersA Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing Customers
 
Determine the Right Analytic Database: A Survey of New Data Technologies
Determine the Right Analytic Database: A Survey of New Data TechnologiesDetermine the Right Analytic Database: A Survey of New Data Technologies
Determine the Right Analytic Database: A Survey of New Data Technologies
 
VCCP Kin Production Director Chris Chaundler - Sustaining brand conversation
VCCP Kin Production Director Chris Chaundler - Sustaining brand conversationVCCP Kin Production Director Chris Chaundler - Sustaining brand conversation
VCCP Kin Production Director Chris Chaundler - Sustaining brand conversation
 
Blueprint for integrating big data analytics and bi
Blueprint for integrating big data analytics and biBlueprint for integrating big data analytics and bi
Blueprint for integrating big data analytics and bi
 
Text visualization - by Jeff Clark
Text visualization -  by Jeff ClarkText visualization -  by Jeff Clark
Text visualization - by Jeff Clark
 
Integrating BI - Data Warehouse and Big Data
Integrating BI - Data Warehouse and Big DataIntegrating BI - Data Warehouse and Big Data
Integrating BI - Data Warehouse and Big Data
 
Next generation big data bi
Next generation big data biNext generation big data bi
Next generation big data bi
 
The State of Open Source BI Adoption
The State of Open Source BI AdoptionThe State of Open Source BI Adoption
The State of Open Source BI Adoption
 
Malaysia Big Data Analytics Initiative: 2015 Imperatives
Malaysia Big Data Analytics Initiative: 2015 ImperativesMalaysia Big Data Analytics Initiative: 2015 Imperatives
Malaysia Big Data Analytics Initiative: 2015 Imperatives
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data Warehouse
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)
 
BI congres 2014-5: from BI to big data - Jan Aertsen - Pentaho
BI congres 2014-5: from BI to big data - Jan Aertsen - PentahoBI congres 2014-5: from BI to big data - Jan Aertsen - Pentaho
BI congres 2014-5: from BI to big data - Jan Aertsen - Pentaho
 
How big data is transforming BI
How big data is transforming BIHow big data is transforming BI
How big data is transforming BI
 
What is bi analytics and big data
What is bi analytics and big dataWhat is bi analytics and big data
What is bi analytics and big data
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecture
 
Big Data and BI Best Practices
Big Data and BI Best PracticesBig Data and BI Best Practices
Big Data and BI Best Practices
 

Similaire à Bi isn't big data and big data isn't BI (updated)

How to succeed at data without even trying!
How to succeed at data without even trying!How to succeed at data without even trying!
How to succeed at data without even trying!Dylan
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionInside Analysis
 
Putting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPutting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPeculium Crypto
 
Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019mark madsen
 
Review on the Ted Talk- What do we do with all this big data?
Review on the Ted Talk- What do we do with all this big data?Review on the Ted Talk- What do we do with all this big data?
Review on the Ted Talk- What do we do with all this big data?TanayKarnik1
 
Implementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White PaperImplementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White Papershashanksalunkhe12
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAmpoolIO
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceJuuso Parkkinen
 
What makes an effective data team?
What makes an effective data team?What makes an effective data team?
What makes an effective data team?Snowplow Analytics
 
A strategy for security data analytics - SIRACon 2016
A strategy for security data analytics - SIRACon 2016A strategy for security data analytics - SIRACon 2016
A strategy for security data analytics - SIRACon 2016Jon Hawes
 
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...Dario Mangano
 
Wake up and smell the data
Wake up and smell the dataWake up and smell the data
Wake up and smell the datamark madsen
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategyHimanshu Bari
 
Data Product Management by Tinder Group PM
Data Product Management by Tinder Group PMData Product Management by Tinder Group PM
Data Product Management by Tinder Group PMProduct School
 
Making big data work
Making big data work Making big data work
Making big data work Ed Thewlis
 
Semantech Inc. - Mastering Enterprise Big Data - Intro
Semantech Inc. - Mastering Enterprise Big Data - IntroSemantech Inc. - Mastering Enterprise Big Data - Intro
Semantech Inc. - Mastering Enterprise Big Data - IntroStephen Lahanas
 
Landing a career in data science
Landing a career in data scienceLanding a career in data science
Landing a career in data scienceParul Pandey
 

Similaire à Bi isn't big data and big data isn't BI (updated) (20)

How to succeed at data without even trying!
How to succeed at data without even trying!How to succeed at data without even trying!
How to succeed at data without even trying!
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop Adoption
 
Data and data scientists are not equal to money david hoyle
Data and data scientists are not equal to money   david hoyleData and data scientists are not equal to money   david hoyle
Data and data scientists are not equal to money david hoyle
 
Putting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPutting data science in your business a first utility feedback
Putting data science in your business a first utility feedback
 
Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019
 
Review on the Ted Talk- What do we do with all this big data?
Review on the Ted Talk- What do we do with all this big data?Review on the Ted Talk- What do we do with all this big data?
Review on the Ted Talk- What do we do with all this big data?
 
Implementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White PaperImplementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White Paper
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
What makes an effective data team?
What makes an effective data team?What makes an effective data team?
What makes an effective data team?
 
A strategy for security data analytics - SIRACon 2016
A strategy for security data analytics - SIRACon 2016A strategy for security data analytics - SIRACon 2016
A strategy for security data analytics - SIRACon 2016
 
Demystifying ML/AI
Demystifying ML/AIDemystifying ML/AI
Demystifying ML/AI
 
Horse meat or beef? (3) D Murphy, National Grid, 21/3/13
Horse meat or beef? (3) D Murphy, National Grid, 21/3/13Horse meat or beef? (3) D Murphy, National Grid, 21/3/13
Horse meat or beef? (3) D Murphy, National Grid, 21/3/13
 
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
 
Wake up and smell the data
Wake up and smell the dataWake up and smell the data
Wake up and smell the data
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategy
 
Data Product Management by Tinder Group PM
Data Product Management by Tinder Group PMData Product Management by Tinder Group PM
Data Product Management by Tinder Group PM
 
Making big data work
Making big data work Making big data work
Making big data work
 
Semantech Inc. - Mastering Enterprise Big Data - Intro
Semantech Inc. - Mastering Enterprise Big Data - IntroSemantech Inc. - Mastering Enterprise Big Data - Intro
Semantech Inc. - Mastering Enterprise Big Data - Intro
 
Landing a career in data science
Landing a career in data scienceLanding a career in data science
Landing a career in data science
 

Plus de mark madsen

A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou RangeA Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Rangemark madsen
 
Don't let data get in the way of a good story
Don't let data get in the way of a good storyDon't let data get in the way of a good story
Don't let data get in the way of a good storymark madsen
 
Don't follow the followers
Don't follow the followersDon't follow the followers
Don't follow the followersmark madsen
 
Exploring cloud for data warehousing
Exploring cloud for data warehousingExploring cloud for data warehousing
Exploring cloud for data warehousingmark madsen
 
Open Data: Free Data Isn't the Same as Freeing Data
Open Data: Free Data Isn't the Same as Freeing DataOpen Data: Free Data Isn't the Same as Freeing Data
Open Data: Free Data Isn't the Same as Freeing Datamark madsen
 
Exploring cloud for data warehousing
Exploring cloud for data warehousingExploring cloud for data warehousing
Exploring cloud for data warehousingmark madsen
 
Big Data Wonderland: Two Views on the Big Data Revolution
Big Data Wonderland: Two Views on the Big Data RevolutionBig Data Wonderland: Two Views on the Big Data Revolution
Big Data Wonderland: Two Views on the Big Data Revolutionmark madsen
 
Using Data Virtualization to Integrate With Big Data
Using Data Virtualization to Integrate With Big DataUsing Data Virtualization to Integrate With Big Data
Using Data Virtualization to Integrate With Big Datamark madsen
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolutionmark madsen
 

Plus de mark madsen (9)

A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou RangeA Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
 
Don't let data get in the way of a good story
Don't let data get in the way of a good storyDon't let data get in the way of a good story
Don't let data get in the way of a good story
 
Don't follow the followers
Don't follow the followersDon't follow the followers
Don't follow the followers
 
Exploring cloud for data warehousing
Exploring cloud for data warehousingExploring cloud for data warehousing
Exploring cloud for data warehousing
 
Open Data: Free Data Isn't the Same as Freeing Data
Open Data: Free Data Isn't the Same as Freeing DataOpen Data: Free Data Isn't the Same as Freeing Data
Open Data: Free Data Isn't the Same as Freeing Data
 
Exploring cloud for data warehousing
Exploring cloud for data warehousingExploring cloud for data warehousing
Exploring cloud for data warehousing
 
Big Data Wonderland: Two Views on the Big Data Revolution
Big Data Wonderland: Two Views on the Big Data RevolutionBig Data Wonderland: Two Views on the Big Data Revolution
Big Data Wonderland: Two Views on the Big Data Revolution
 
Using Data Virtualization to Integrate With Big Data
Using Data Virtualization to Integrate With Big DataUsing Data Virtualization to Integrate With Big Data
Using Data Virtualization to Integrate With Big Data
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolution
 

Dernier

TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?RemarkSemacio
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...HyderabadDolls
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxAniqa Zai
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 

Dernier (20)

TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 

Bi isn't big data and big data isn't BI (updated)

  • 2. © Third Nature Inc. Summary Common uses and commodity technology lead to Novel practices lead to Different data and different technology needs lead to New architectures Lead to Common uses and commodity technology 
  • 3. © Third Nature Inc. Our ideas about information and how it’s used are outdated.
  • 4. © Third Nature Inc. How We Think of Users Our design point is the  passive consumer of  information. Proof: methodology ▪ IT role is requirements,  design, build, deploy,  administer ▪ User role is run reports Self‐serve BI is not like  picking the right doughnut  from a box. Slide 4
  • 5. © Third Nature Inc. How We Think of Users Our design point is the  passive consumer of  information. Proof: methodology ▪ IT role is requirements,  design, build, deploy,  administer ▪ User role is run reports Self‐serve BI is not like  picking the right doughnut  from a box. How We Want Users to  Think of Us
  • 6. © Third Nature Inc. How We Think of Users What Users Really Think
  • 7. © Third Nature Inc. We think of BI as publishing, an old metaphor. Publishing has value, but may not be actionable.
  • 8. © Third Nature Inc. Planning data strategy means understanding the  context of data use so we can build infrastructure Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing We need to focus on what people do with information as the primary task, not on the data or the technology.
  • 9. © Third Nature Inc. General model for organizational use of data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act within the process Usually real-time to daily
  • 10. © Third Nature Inc. Origin of BI and data warehouse concepts The general concept of a  separate architecture for BI  has been around longer, but  this paper by Devlin and  Murphy is the first formal  data warehouse architecture  and definition published. 10 “An architecture for a business and information system”, B. A. Devlin, P. T. Murphy, IBM Systems Journal, Vol.27, No. 1, (1988) Slide 10Copyright Third Nature, Inc.
  • 11. © Third Nature Inc. Origins: in 1988 there was only big hair. ▪ No real commercial email, public internet barely started ▪ Storage state of the art: 100MB, cost $10,000/GB ▪ Oracle Applications v1 GL released; SAP goes public,  enters US market ▪ Unix is mostly run by long‐haired freaks ▪ Mobile was this This is the context: scarcity of data, of system resources, of automated  systems outside core financials, of money to pay for storage.
  • 12. © Third Nature Inc. General model for organizational use of data Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act on the process Usually days/longer timeframe Copyright Third Nature, Inc.
  • 13. © Third Nature Inc. You need to be able to support both paths Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act Act on the process Act within the process Conventional BI, addition of EDM Causal analysis, “data science” Copyright Third Nature, Inc.
  • 14. © Third Nature Inc. The usage models for conventional BI Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act on the process Usually days/longer timeframe Act within the process Usually real-time to daily This is what we’ve been doing with BI so far: static reporting, dashboards, ad-hoc query, OLAP Copyright Third Nature, Inc.
  • 15. © Third Nature Inc. The usage models for analytics and “big data”  Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act on the process Usually days/longer timeframe Act within the process Usually real-time to daily Analytics and big data is focused on new use cases: deeper analysis, causes, prediction, optimizing decisions This isn’t ad-hoc, reporting, or OLAP. Copyright Third Nature, Inc.
  • 16. © Third Nature Inc. When you first give people access to information  that was unavailable… OH GOD I can see into forever
  • 17. © Third Nature Inc. After a while it becomes the new normal
  • 18. © Third Nature Inc. As practices evolve based on new capabilities… A new level of  complexity  develops over  top of the  older, now  better  understood  processes,  leading to new  data and  analysis needs.
  • 19. © Third Nature Inc. I never said the “E” in EDW meant “everything”… What do you mean, “Just doughnuts?”
  • 20. © Third Nature Inc. The data warehouse vs business agility All the data Common, typed, tabular data The bottleneck is you
  • 21. © Third Nature Inc. It’s going to get a lot worse Not E E Conclusion: any methodology built on the premise that you  must know and model all the data first is untenable 
  • 22. © Third Nature Inc. Old market says: There’s nothing wrong with what  you have, just keep buying new products from us
  • 23. © Third Nature Inc. The emerging big data market has an answer…
  • 24. © Third Nature Inc. The data lake
  • 25. © Third Nature Inc. The data lake after a little while
  • 26. © Third Nature Inc. TANSTAAFL When replacing the old  with the new (or ignoring  the new over the old) you  always make tradeoffs,  and usually you won’t see  them for a long time. Technologies are not  perfect replacements for  one another. Often not  better, only different.
  • 27. © Third Nature Inc. “Big data is unprecedented.” ‐ Anyone involved with big data in even the  most barely perceptible way
  • 28. © Third Nature Inc. We’ve been here before Source: Bill Schmarzo, EMC
  • 29. © Third Nature Inc. “Big” is well supported by databases now Source:Noumenal,Inc.
  • 30. © Third Nature Inc. Orders of magnitude: 20 years ago TB, today PB Shifts in data availability by orders of magnitude  necessitate new means of managing and using it.
  • 31. © Third Nature Inc. Analytics embiggens the data volume problem Many of the processing problems are O(n2) or worse, so  moderate data can be a problem for DB‐based platforms
  • 32. © Third Nature Inc. Much of the big data value comes from analytics BI is a retrieval problem, not a computational problem. Five basic things you can do with analytics ▪Prediction – what is most likely to happen? ▪Estimation – what’s the future value of a variable? ▪Description – what relationships exist in the data? ▪Simulation – what could happen? ▪Prescription – what should you do? Slide 36 Copyright Third Nature, Inc. Copyright Third Nature, Inc.
  • 33. © Third Nature Inc. Most people do not need special technologyNumberofpeople The distribution of data size is about normal, yet these guys set the tone of the market today. Bigness of data Copyright Third Nature, Inc.
  • 34. © Third Nature Inc. Analytics: This is really raw data under storageNumberofjobs Microsoft study of 174,000 analytic jobs in their cluster: median size ??? Bigness of data Copyright Third Nature, Inc.
  • 35. © Third Nature Inc. Working data for analytics most often not bigNumberofjobs 14 GB Smallness of data Copyright Third Nature, Inc.
  • 36. © Third Nature Inc. An (overly) Simple Division of the Problem SpaceComputation LittleLots Data volume Little Lots Big analytics, little data Specialized computing, modeling problems: supercomputing, GPUs Big analytics, big data Complex math over large data volumes requires shared nothing architectures Little analytics, little data The entry point; SAS, SMP databases, even OLAP cubes can work Little analytics, big data The BI/DW space, for the most part, with work done in databases
  • 37. © Third Nature Inc.© Third Nature Inc. What makes data “big”? Very large amounts Hierarchical structures Nested structures Linked structures Encoded values Non‐standard (for a  database) types Deep structure Human authored text “big” is better off being defined as “complex” or “hard to manage” Copyright Third Nature, Inc.
  • 38. © Third Nature Inc. Categorizing the measurement data we collect The convenient data is the  transactional data. ▪ Goes in the DW and is used, even  if it isn’t the right measurement. The inconvenient data is  observational data. ▪ It’s not neat, clean, or designed  into most systems of operation. The difficult and misleading data  is declarative data. ▪ What people say and what they  do require ground truth. We need an architecture that  supports all three categories. Copyright Third Nature, Inc.
  • 39. © Third Nature Inc. Transactions vs “big data” The classic example of “structured data” Transaction data includes: ▪ quantification details (date, value, count) ▪ reference data for explanation (product,  customer, account) ▪ Lots of meaningful information Reference data is usually shared across the  organization, hence its importance. There  are two parts: ▪ identifier to uniquely identify the subject ▪ descriptive attributes with common or  standardized value domains Transaction details Reference data
  • 40. © Third Nature Inc. Today it’s different data: observations, not transactions Sensor data doesn’t fit well with current methods of collection and storage, or with the technology to process and analyze it. Copyright Third Nature, Inc.
  • 41. © Third Nature Inc. Big data as a type of data: Transactions vs. Events Transactions: ▪ Each one is valuable ▪ Mutable ▪ The elements of a transaction can be aggregated easily ▪ A set of transactions does not usually have important ordering  or dependency Events: ▪ A single event often has no value, e.g. what is the value of one  click in a series? Some events are extremely valuable, but this  is only detectable within the context of other events. ▪ Elements of events are often not easily aggregated ▪ A set of events usually has a natural order and dependencies ▪ Immutable
  • 42. © Third Nature Inc. Example “big data”: Web tracking data USER_ID 301212631165031 SESSION_ID 590387153892659 VISIT_DATE 1/10/2010 0:00 SESSION_START_DATE 1:41:44 AM PAGE_VIEW_DATE 1/10/2010 9:59 DESTINATION_URL https://www.phisherking.com/gifts/store/LogonForm?mmc= link‐src‐email‐_‐m100109‐_‐44IOJ1‐_‐shop&langId=‐ 1&storeId=1055&URL=BECGiftListItemDisplay REFERRAL_NAME Google.com REFERRAL_URL http://www.google.com/search?sourceid=navclient&aq=0h& oq=Italian&ie=UTF8&rlz=1T4ACGW_enUS386US387&q=italia n+rose&fu=0&ifi=1&dtd=204&xpc=1KoLqh374s PAGE_ID PROD_24259_CARD REL_PRODUCTS PROD_24654_CARD, PROD_3648_FLOWERS SITE_LOCATION_NAME VALENTINE'S DAY MICROSITE SITE_LOCATION_ID SHOP‐BY‐HOLIDAY VALENTINES DAY IP_ADDRESS 67.189.110.179 BROWSER_OS_NAME MOZILLA/4.0 (COMPATIBLE; MSIE 7.0; AOL 9.0; WINDOWS  NT 5.1; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322)
  • 43. © Third Nature Inc. Web tracking data has a nested structure USER_ID 301212631165031 SESSION_ID 590387153892659 VISIT_DATE 1/10/2010 0:00 SESSION_START_DATE 1:41:44 AM PAGE_VIEW_DATE 1/10/2010 9:59 DESTINATION_URL https://www.phisherking.com/gifts/store/LogonForm?mmc= link‐src‐email‐_‐m100109‐_‐44IOJ1‐_‐shop&langId=‐ 1&storeId=1055&URL=BECGiftListItemDisplay REFERRAL_NAME Direct REFERRAL_URL ‐ PAGE_ID PROD_24259_CARD REL_PRODUCTS PROD_24654_CARD, PROD_3648_FLOWERS SITE_LOCATION_NAME VALENTINE'S DAY MICROSITE SITE_LOCATION_ID SHOP‐BY‐HOLIDAY VALENTINES DAY IP_ADDRESS 67.189.110.179 BROWSER_OS_NAME MOZILLA/4.0 (COMPATIBLE; MSIE 7.0; AOL 9.0; WINDOWS  NT 5.1; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322) “unstructured” data embedded in the logged message: complex strings
  • 44. © Third Nature Inc. The missing ingredient from most big data
  • 45. © Third Nature Inc. The creation, flow and use of data is different for  transactions and machine‐generated events Data entry Extract Cleanse Load UseStore Transactions MDM Generate Store Use UseCleanse Program Capture This runs at human speed This runs at machine speed, with higher latency feedback cycles
  • 47. © Third Nature Inc. You can store this data in an RDBMS, but…
  • 48. Example data: Twitter Message API Payload Looks like: This is really just a record format much like a DB row. Datetime, userID, name, location, description, message, message metadata, etc. But it’s In json or xml.
  • 49. © Third Nature Inc. @markmadsen Check out: From #MongoDB to #Cassandra:  Why The Atlas Platform Is Migrating http://owl.li/cvxFK A tweet has lots of fields, but one important one The payload is free text but has other elements: From these things you likely want to generate or link to  reference data. ‘To’ username Hashtag HashtagURL
  • 50. © Third Nature Inc.© Third Nature Inc. Internal payload elements form a new graph The @elements point to  other records and create a  deeply linked structure. You have to assemble the  linked structure to see  what’s really there, which  means repeated scanning  some/all of the data. The derived pattern is  interesting data,  sometimes more than the  individual messages.
  • 51. © Third Nature Inc.© Third Nature Inc. There are many patterns in the data Follower / following networks are easy – they are explicit  and independent of the events. Community detection requires looking at patterns of @  communication in addition to follow relationships. What do you do with these after discovery? Follower network Conversational communities
  • 52. © Third Nature Inc. More data: patterns emerge from lots of event data Patterns emerge from  the underlying structure  of the entire dataset. The patterns are more  interesting than sums  and counts of the events. Web paths: clicks in a  session as network node  traversal. Email: traffic analysis  producing a network The event stream is a source for analysis, generating another set of data that is the source for different analysis.
  • 53. © Third Nature Inc. Big changes for data warehousing workloads The results of analytic  processing can, often do,  feed back into the  system from which they  originate. Much of the data is being  read, written and  processed in real time. Our design point was not  changing tables and  ephemeral patterns.
  • 55. © Third Nature Inc. Slide 59 THE BIG CHANGE ISN’T TECHNOLOGY, IT’S ARCHITECTURE
  • 56. © Third Nature Inc. There are really three workloads to consider, not two 1. Operational: OLTP systems 2. Analytic: OLAP systems 3. Processing: Computational systems Unit of focus: 1. Transaction 2. Query 3. Computation Different problems require different platforms
  • 57. © Third Nature Inc. Workloads OLTP BI Analytics Access Read‐Write Read‐only Read‐mostly Predictability Predictable Unpredictable Fixed path Selectivity High Low Low Retrieval Low Low High Latency Milliseconds < seconds msecs to days Concurrency Huge Moderate 1 to huge Model 3NF, nested object Dim, denorm BWT Task size Small Large Small to huge
  • 58. © Third Nature Inc. These do exactly the same thing: One is a set of technologies. One is an architecture. An idea promoted by big data vendors Data Warehouse
  • 59. © Third Nature Inc. Reality: Hadoop disaggregates the database One of the key things Hadoop does is to separate the  storage, execution and API layers of a database. This  allows for processing flexibility, but it does not permit  one to build a reliable, high performance database  across the layers. Hadoop distributed filesystem (HDFS) General-purpose data engines Abstraction layers Storage management
  • 60. © Third Nature Inc. A more specific look at layers and engines Base storage SQL, MDX Kylin Storage mgmt Engine Abstraction  layer / API You can program to any layer you choose. Some projects already build on top of multiple others. Language/API Engine Hadoop distributed filesystem (HDFS) MapReduce Tez Cascading Spark Storage (filetypes in HDFS, Hbase, etc) Crunch Pig Hive SparkSQL NativeAPI Giraph Hive Crunch Pig Impala Drill Presto NativeAPI NativeAPI Hive Pig NativeAPI Hbase Phoenix
  • 61. © Third Nature Inc. An important Hadoop + cloud computing benefit Scalability is free – if your task requires 10 units of  work, you can decide when you want results: 10 servers, 1 unit of time Cost is the same. Not true of the conventional IT model Time 1 server, 10 units of time X X
  • 62. © Third Nature Inc. Hadoop: a summary of the magic 1. Provides both storage and complex processing as part  of the same platform 2. Makes parallel programming more accessible 3. Schemaless (just files) therefore flexible 4. Inexpensive, reliable scale‐out 5. Potential for fast, scalable ingest 6. Cheaper than a database (for non‐database work) The bad stuff: ▪ Not great for mutable data ▪ Mostly file‐based sequential processing, or you store data  many times in different datastores (locality is important) ▪ Minimal data management (today)
  • 63. © Third Nature Inc. The geography has been redefined The box we created: • not any data, rigidly typed data • not any form, tabular rows and  columns of typed data • not any latency, persist what the  DB can keep up with • not any process, only queries The digital world was diminished  to only what’s inside the box until  we forgot the box was there.
  • 64. © Third Nature Inc. Layered data architecture The DW assumed a single flat  model of data, DB in the center.  New technology enables new  ways to organize data: ▪ Raw – straight from the source ▪ Enhanced –cleaned, standardized ▪ Integrated – modeled,  augmented, ~semi‐persistent ▪ Derived – analytic output,  pattern based sets, ephemeral Implies a new technology architecture  and data modeling approaches.
  • 65. © Third Nature Inc. Decouple the Data Architecture The core of the data warehouse isn’t the  database, it’s the data architecture that the  database and tools implement. We need a data architecture that is not limiting: ▪ Deals with change more easily and at scale ▪ Does not enforce requirements and models up front ▪ Does not limit the format or structure of data ▪ Assumes the range of data latencies in and out, from  streaming to one‐time bulk
  • 66. © Third Nature Inc. Deconstructing the data warehouse There are three  things happening  in a DW: ▪ Data acquisition ▪ Data management ▪ Data delivery Isolate them from  one another. Data Warehouse
  • 67. © Third Nature Inc. Integrate Manage Decouple the data architecture by stage Use In reality, you are building three systems, not one. Treat them that way. Collect Transactions Observations Declarations
  • 68. © Third Nature Inc. Food supply chain: an analogy for data Multiple contexts of use, differing quality levels
  • 69. © Third Nature Inc. Data infrastructure is a platform ▪ Any data – structures, forms ▪ Any latency –in motion, at rest ▪ Any process – query, algorithm, transformation ▪ Any access – SQL, API, queue, file movement
  • 70. © Third Nature Inc. The evolution of DW is to a data platform, which means  separating application from infrastructure. Derived data Raw data Infrastructure layer: Process and analyze Store and manage Application layer: Deliver and use The new model also encompasses data at rest and data in motion Multiple access methods Enhanced data Multiple ingest methods BI, data extracts,  analytics, applications The platform has to do more than serve queries; it has to be read-write.
  • 71. © Third Nature Inc. Away from “one throat to choke”, back to best of breed “The extremely specialized  nature of mass production  raises the costs of product  change and therefore slows  down innovation.” ‐ Abernathy, 1978 Tight coupling leads to slow  changes. In a rapidly evolving market  componentized architectures,  modularity  and loose coupling  are favorable over monolithic  stacks, single‐vendor  architectures and tight  coupling.
  • 72. © Third Nature Inc. Staff and skills are a problem in a build market @BigDataBorat: Give man Hadoop cluster he gain insight for a day. Teach man build Hadoop cluster he soon leave for better job #bigdata
  • 73. © Third Nature Inc. Technology Adoption Some people can’t resist  getting the next new thing  because it’s new and new is  always better. Many IT organizations are like  this, promoting a solution and  hunting for the problem that  matches it. Better to ask “What is the  problem for which this  technology is the answer?” Copyright Third Nature, Inc.
  • 74. © Third Nature Inc. Four core capabilities big data technologies add 1. Unlimited scale of storage, processing ▪ Agility, faster turnaround for new data requests (but not a replacement for BI) ▪ Fewer staff to accomplish same goals 2. New data accessibility ▪ More data retained for longer period ▪ Access to data unused due to cost or processing limits ▪ Any digital information becomes usable data 3. Scalable realtime processing ▪ Brings ability to monitor and act on data as events occur 4. Arbitrary analytics ▪ Faster analysis ▪ Deeper analysis ▪ More broadly accessible analytics
  • 75. © Third Nature Inc. As a technology moves from emerging to commodity the  nature of acquiring, using and managing it changes Generate options Innovation Novel practice Maximize value Maturation Constrain choices Adaptation Good practice Optimize Standardize / minimize choice Acquisition Best practice Minimize costs SaturationInnovation Copyright Third Nature, Inc. Agile & open  source* methods  6 Sigma & process  methods
  • 76. © Third Nature Inc. Today: repeating the experience of the 80s & 90s This is the turbulent phase of the market as it goes through rapid development, then product and service changes. Copyright Third Nature, Inc. The Internet combined with commodity computing is forcing a new business and IT structural evolution, already underway. Maturation SaturationInnovation
  • 77. © Third Nature Inc. How we develop best practices: survival bias We don’t need best practices, we need worst failures.Copyright Third Nature, Inc.
  • 78. © Third Nature Inc. Welcome to the big data revolution, more of an evolution Be pragmatic, not dogmatic
  • 79. © Third Nature Inc. CC Image Attributions Thanks to the people who supplied the creative commons licensed images used in this presentation: acorn_blue.jpg ‐ http://www.flickr.com/photos/rogersmith/314324893/ wheat_field.jpg ‐ http://www.flickr.com/photos/ecstaticist/1120119742/ Phone dump ‐ Richard Barnes ponies in field.jpg ‐ http://www.flickr.com/photos/bulle_de/352732514/ straw men.jpg ‐ http://www.flickr.com/photos/robinellis/6034919721/ text composition ‐ http://flickr.com/photos/candiedwomanire/60224567/ girl on cell tokyo .jpg ‐ http://flickr.com/photos/8024992@N06/986538717/ hamadan people mosaic.jpg ‐ http://flickr.com/photos/hamed/225868856/ twitter_network_bw.jpg ‐ http://www.flickr.com/photos/dr/2048034334/ klein_bottle_red.jpg ‐ http://flickr.com/photos/sveinhal/2081201200/ donuts_4_views.jpg ‐ http://www.flickr.com/photos/le_hibou/76718773/ subway dc metro  ‐ http://flickr.com/photos/musaeum/509899161/
  • 81. © Third Nature Inc. About Third Nature Third Nature is a consulting and advisory firm focused on new and emerging technology and practices in information strategy, analytics, business intelligence and data management. If your question is related to data, analytics, information strategy and technology infrastructure then you‘re at the right place. Our goal is to help organizations solve problems using data. We offer education, consulting and research services to support business and IT organizations as well as technology vendors. We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in strategy and architecture, so we look at emerging technologies and markets, evaluating how technologies are applied to solve problems rather than evaluating product features.