SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Making the leap to BI on Hadoop
Predictive Analytics & Business Insights 2014
November 19, 2014
David P. Mariani
CEO
AtScale, Inc.
2
THE TRUTH
ABOUT DATA
2
“We think only 3% of the
potentially useful data is tagged,
and even less is analyzed.”
Source: IDC Predictions 2013: Big Data, IDC
“90% of the data in the world
today has been created in
the last two years”
Source: IBM
The Broken Promise
What We Wanted
Centralized Data Warehouse
What We Got
Data Marts
WHAT WE GOT
ETL + STAR SCHEMAS
6
INPUT DATA
ETL
MART MART MART
QUERY ENGINE
ANALYSIS TOOLS
DATA
WAREHOUSE
Traditional Data Architecture
7
INPUT DATA
ETL
MART MART MART
QUERY ENGINE
ANALYSIS TOOLS
DATA
WAREHOUSE
What’s Wrong with this Picture
 Highly complex
 Lots of people & skillsets
 Multiple copies of data
 Stale data
 Rigid schema
 Tough to change
Write Many StructuredEarly Transformation
8
It Takes an Army
BI Engineer
Design Reports/Dashboards
ETL Engineer
Automate Cube Load
BI Engineer
Design Cube
DBA
Automate Data Load
ETL Engineer
Write ETL Code
DBA
Create Tables
Data Warehouse Architect
Design Star Schema
SAN/NAS Engineer
Define Storage Architecture
9
Star Schema = Unnatural!
WHAT WE WANTED
SCHEMA ON DEMAND
11
Data Management Approaches
INPUT DATA
ETL
MART MART MART
QUERY ENGINE
ANALYSIS TOOLS
DATA
WAREHOUSE
Traditional Approach New Approach
INPUT DATA
ANALYSIS TOOLS
HADOOP
Time for a New Approach
VS
Write Once Semi-StructuredLate Transformation
✔ ✔ ✔
13
Not This, That
BI Engineer
Run Queries/Create
Reports
Hadoop Engineer
Create EXTERNAL Tables
Hadoop Engineer
Define location to store
files
BI Engineer
Design Reports/Dashboards
ETL Engineer
Automate Cube Load
BI Engineer
Design Cube
DBA
Automate Data Load
ETL Engineer
Write ETL Code
DBA
Create Tables
Data Warehouse Architect
Design Star Schema
SAN/NAS Engineer
Define Storage Architecture
VS
Example: Key-Values
Example: JSON
DEMO
MOBA Game Analytics
17
Demo: DOTA 2 – What the User Sees
Key Data Points: 5 vs. 5 players per match. Players choose ‘Heroes’, use ‘Items’ & earn ‘Gold’.
FOR THE DATA
SCIENTISTS!
Demo: Dota2 – Raw Data (JSON)
Match Details Player Details Player Profile
View Source
View Source
20
As Easy As 1,2,3
BI Engineer
Run Queries/Create
Reports
Hadoop Engineer
Create EXTERNAL Tables
Hadoop Engineer
Define location to store
files
21
Demo: DOTA 2 – Use Case 1
Question: Who are the most popular heroes?
22
Demo: DOTA 2 – Use Case 2
Question: Which heroes have the highest win rate?
23
Demo: DOTA 2 – Use Case 3
Question: What are the top 3 items associated with the best win rate?
24
Practical Applications
Time Server Analysis (session data)
Affinity Analysis
Segmentation Analysis
Many to Many
NO JOINS = HORIZONTAL SCALE
FOR THE
ORDINARY HUMAN!
27
DEMO
29
Summary: The Do’s & Don’ts
Capture data “as is” Pre-aggregate data
Apply schema on read Force schema on load
Land new data on Hadoop Land new data on relational
DBs
Create a data warehouse Create data marts
Leverage open source engines Invest in proprietary databases
Do Don’t
Business Intelligence Redefined

Contenu connexe

Tendances

Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
 
Data Engineering Efficiency @ Netflix - Strata 2017
Data Engineering Efficiency @ Netflix - Strata 2017Data Engineering Efficiency @ Netflix - Strata 2017
Data Engineering Efficiency @ Netflix - Strata 2017Michelle Ufford
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_dataNitin Kumar
 
Tableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of ThoughtTableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of ThoughtMongoDB
 
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoSpark Summit
 
Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ NetflixMichelle Ufford
 
FP&A with Spreadsheets and Spark with Oscar Castaneda-Villagran
FP&A with Spreadsheets and Spark with Oscar Castaneda-VillagranFP&A with Spreadsheets and Spark with Oscar Castaneda-Villagran
FP&A with Spreadsheets and Spark with Oscar Castaneda-VillagranDatabricks
 
Secrets of Enterprise Data Mining: SQL Saturday 328 Birmingham AL
Secrets of Enterprise Data Mining: SQL Saturday 328 Birmingham ALSecrets of Enterprise Data Mining: SQL Saturday 328 Birmingham AL
Secrets of Enterprise Data Mining: SQL Saturday 328 Birmingham ALMark Tabladillo
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit
 
Self Service Analytics at Twitch
Self Service Analytics at TwitchSelf Service Analytics at Twitch
Self Service Analytics at TwitchImply
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...confluent
 
MongoDB .local Munich 2019: A Complete Methodology to Data Modeling for MongoDB
MongoDB .local Munich 2019: A Complete Methodology to Data Modeling for MongoDBMongoDB .local Munich 2019: A Complete Methodology to Data Modeling for MongoDB
MongoDB .local Munich 2019: A Complete Methodology to Data Modeling for MongoDBMongoDB
 
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahLeveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahDatabricks
 
SQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveSQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveIBM Cloud Data Services
 
Accelerating Cloud Services and How to Match your Workload to the Right Intel...
Accelerating Cloud Services and How to Match your Workload to the Right Intel...Accelerating Cloud Services and How to Match your Workload to the Right Intel...
Accelerating Cloud Services and How to Match your Workload to the Right Intel...Amazon Web Services
 
Splunk - Buisness Intelligence tool
Splunk - Buisness Intelligence toolSplunk - Buisness Intelligence tool
Splunk - Buisness Intelligence toolArjun Ravindran
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataTreasure Data, Inc.
 

Tendances (20)

Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
 
Data Engineering Efficiency @ Netflix - Strata 2017
Data Engineering Efficiency @ Netflix - Strata 2017Data Engineering Efficiency @ Netflix - Strata 2017
Data Engineering Efficiency @ Netflix - Strata 2017
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_data
 
Tableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of ThoughtTableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of Thought
 
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott Cordo
 
Databricks delta
Databricks deltaDatabricks delta
Databricks delta
 
Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ Netflix
 
FP&A with Spreadsheets and Spark with Oscar Castaneda-Villagran
FP&A with Spreadsheets and Spark with Oscar Castaneda-VillagranFP&A with Spreadsheets and Spark with Oscar Castaneda-Villagran
FP&A with Spreadsheets and Spark with Oscar Castaneda-Villagran
 
Build Better Data-Driven Insights
Build Better Data-Driven InsightsBuild Better Data-Driven Insights
Build Better Data-Driven Insights
 
Secrets of Enterprise Data Mining: SQL Saturday 328 Birmingham AL
Secrets of Enterprise Data Mining: SQL Saturday 328 Birmingham ALSecrets of Enterprise Data Mining: SQL Saturday 328 Birmingham AL
Secrets of Enterprise Data Mining: SQL Saturday 328 Birmingham AL
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
 
Self Service Analytics at Twitch
Self Service Analytics at TwitchSelf Service Analytics at Twitch
Self Service Analytics at Twitch
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
 
MongoDB .local Munich 2019: A Complete Methodology to Data Modeling for MongoDB
MongoDB .local Munich 2019: A Complete Methodology to Data Modeling for MongoDBMongoDB .local Munich 2019: A Complete Methodology to Data Modeling for MongoDB
MongoDB .local Munich 2019: A Complete Methodology to Data Modeling for MongoDB
 
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahLeveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
 
SQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveSQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The Move
 
Accelerating Cloud Services and How to Match your Workload to the Right Intel...
Accelerating Cloud Services and How to Match your Workload to the Right Intel...Accelerating Cloud Services and How to Match your Workload to the Right Intel...
Accelerating Cloud Services and How to Match your Workload to the Right Intel...
 
Splunk - Buisness Intelligence tool
Splunk - Buisness Intelligence toolSplunk - Buisness Intelligence tool
Splunk - Buisness Intelligence tool
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
 
Meetup Data-science OVH
Meetup Data-science OVHMeetup Data-science OVH
Meetup Data-science OVH
 

En vedette

Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
JBoss Enterprise Data Services (Data Virtualization)
JBoss Enterprise Data Services (Data Virtualization)JBoss Enterprise Data Services (Data Virtualization)
JBoss Enterprise Data Services (Data Virtualization)plarsen67
 
IoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJIoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJDaniel Madrigal
 
Building an Intelligent Biobank to Power Research Decision-Making
Building an Intelligent Biobank to Power Research Decision-MakingBuilding an Intelligent Biobank to Power Research Decision-Making
Building an Intelligent Biobank to Power Research Decision-MakingDenodo
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it DataWorks Summit/Hadoop Summit
 
Open Source Ingredients for Interactive Data Analysis in Spark
Open Source Ingredients for Interactive Data Analysis in Spark Open Source Ingredients for Interactive Data Analysis in Spark
Open Source Ingredients for Interactive Data Analysis in Spark DataWorks Summit/Hadoop Summit
 
Machine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataMachine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataDataWorks Summit/Hadoop Summit
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseDataWorks Summit/Hadoop Summit
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJDaniel Madrigal
 

En vedette (20)

Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Real Time BI with Hadoop
Real Time BI with HadoopReal Time BI with Hadoop
Real Time BI with Hadoop
 
JBoss Enterprise Data Services (Data Virtualization)
JBoss Enterprise Data Services (Data Virtualization)JBoss Enterprise Data Services (Data Virtualization)
JBoss Enterprise Data Services (Data Virtualization)
 
Omid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBaseOmid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBase
 
IoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJIoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJ
 
Building an Intelligent Biobank to Power Research Decision-Making
Building an Intelligent Biobank to Power Research Decision-MakingBuilding an Intelligent Biobank to Power Research Decision-Making
Building an Intelligent Biobank to Power Research Decision-Making
 
Using Hadoop for Cognitive Analytics
Using Hadoop for Cognitive AnalyticsUsing Hadoop for Cognitive Analytics
Using Hadoop for Cognitive Analytics
 
Curb your insecurity with HDP
Curb your insecurity with HDPCurb your insecurity with HDP
Curb your insecurity with HDP
 
The Path to Wellness through Big Data
The Path to Wellness through Big DataThe Path to Wellness through Big Data
The Path to Wellness through Big Data
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
HIPAA Compliance in the Cloud
HIPAA Compliance in the CloudHIPAA Compliance in the Cloud
HIPAA Compliance in the Cloud
 
Real Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with SparkReal Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with Spark
 
Open Source Ingredients for Interactive Data Analysis in Spark
Open Source Ingredients for Interactive Data Analysis in Spark Open Source Ingredients for Interactive Data Analysis in Spark
Open Source Ingredients for Interactive Data Analysis in Spark
 
Machine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataMachine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of Data
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
Extreme Analytics @ eBay
Extreme Analytics @ eBayExtreme Analytics @ eBay
Extreme Analytics @ eBay
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
 

Similaire à Making the leap to BI on Hadoop by Mariani, dave @ atscale

Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Jos van Dongen
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneMongoDB
 
A Tale of Two BI Standards
A Tale of Two BI StandardsA Tale of Two BI Standards
A Tale of Two BI StandardsArcadia Data
 
L’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneL’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneMongoDB
 
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...Data Con LA
 
Data Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data WarehousingData Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data WarehousingAll Things Open
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationInside Analysis
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Denodo
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
Azure Database Options - NoSql vs Sql
Azure Database Options - NoSql vs SqlAzure Database Options - NoSql vs Sql
Azure Database Options - NoSql vs SqlAnne Bougie
 
Application Development and Data Modeling on Amazon DynamoDB
Application Development and Data Modeling on Amazon DynamoDBApplication Development and Data Modeling on Amazon DynamoDB
Application Development and Data Modeling on Amazon DynamoDBAmazon Web Services Japan
 
Wrangling data like a boss
Wrangling data like a bossWrangling data like a boss
Wrangling data like a bossStephanie Locke
 
Neo4j Aura on AWS: The Customer Choice for Graph Databases
Neo4j Aura on AWS: The Customer Choice for Graph DatabasesNeo4j Aura on AWS: The Customer Choice for Graph Databases
Neo4j Aura on AWS: The Customer Choice for Graph DatabasesNeo4j
 
Video Games at Scale: Improving the gaming experience with Apache Spark
Video Games at Scale: Improving the gaming experience with Apache SparkVideo Games at Scale: Improving the gaming experience with Apache Spark
Video Games at Scale: Improving the gaming experience with Apache SparkSpark Summit
 
Data Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph FactoryData Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph FactoryData Con LA
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015StampedeCon
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use CasesMax De Marzi
 
How To Become A Big Data Engineer | Big Data Engineer Skills, Roles & Respons...
How To Become A Big Data Engineer | Big Data Engineer Skills, Roles & Respons...How To Become A Big Data Engineer | Big Data Engineer Skills, Roles & Respons...
How To Become A Big Data Engineer | Big Data Engineer Skills, Roles & Respons...Simplilearn
 
Jumpstart: MongoDB BI Connector & Tableau
Jumpstart: MongoDB BI Connector & TableauJumpstart: MongoDB BI Connector & Tableau
Jumpstart: MongoDB BI Connector & TableauMongoDB
 

Similaire à Making the leap to BI on Hadoop by Mariani, dave @ atscale (20)

Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Database Shootout: What's best for BI?
Database Shootout: What's best for BI?
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
 
A Tale of Two BI Standards
A Tale of Two BI StandardsA Tale of Two BI Standards
A Tale of Two BI Standards
 
L’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneL’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazione
 
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
 
Data Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data WarehousingData Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data Warehousing
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data Implementation
 
datavault2.pptx
datavault2.pptxdatavault2.pptx
datavault2.pptx
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Azure Database Options - NoSql vs Sql
Azure Database Options - NoSql vs SqlAzure Database Options - NoSql vs Sql
Azure Database Options - NoSql vs Sql
 
Application Development and Data Modeling on Amazon DynamoDB
Application Development and Data Modeling on Amazon DynamoDBApplication Development and Data Modeling on Amazon DynamoDB
Application Development and Data Modeling on Amazon DynamoDB
 
Wrangling data like a boss
Wrangling data like a bossWrangling data like a boss
Wrangling data like a boss
 
Neo4j Aura on AWS: The Customer Choice for Graph Databases
Neo4j Aura on AWS: The Customer Choice for Graph DatabasesNeo4j Aura on AWS: The Customer Choice for Graph Databases
Neo4j Aura on AWS: The Customer Choice for Graph Databases
 
Video Games at Scale: Improving the gaming experience with Apache Spark
Video Games at Scale: Improving the gaming experience with Apache SparkVideo Games at Scale: Improving the gaming experience with Apache Spark
Video Games at Scale: Improving the gaming experience with Apache Spark
 
Data Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph FactoryData Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph Factory
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
How To Become A Big Data Engineer | Big Data Engineer Skills, Roles & Respons...
How To Become A Big Data Engineer | Big Data Engineer Skills, Roles & Respons...How To Become A Big Data Engineer | Big Data Engineer Skills, Roles & Respons...
How To Become A Big Data Engineer | Big Data Engineer Skills, Roles & Respons...
 
Jumpstart: MongoDB BI Connector & Tableau
Jumpstart: MongoDB BI Connector & TableauJumpstart: MongoDB BI Connector & Tableau
Jumpstart: MongoDB BI Connector & Tableau
 

Dernier

5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 

Dernier (17)

5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 

Making the leap to BI on Hadoop by Mariani, dave @ atscale

  • 1. Making the leap to BI on Hadoop Predictive Analytics & Business Insights 2014 November 19, 2014 David P. Mariani CEO AtScale, Inc.
  • 2. 2 THE TRUTH ABOUT DATA 2 “We think only 3% of the potentially useful data is tagged, and even less is analyzed.” Source: IDC Predictions 2013: Big Data, IDC “90% of the data in the world today has been created in the last two years” Source: IBM
  • 3. The Broken Promise What We Wanted Centralized Data Warehouse
  • 5. WHAT WE GOT ETL + STAR SCHEMAS
  • 6. 6 INPUT DATA ETL MART MART MART QUERY ENGINE ANALYSIS TOOLS DATA WAREHOUSE Traditional Data Architecture
  • 7. 7 INPUT DATA ETL MART MART MART QUERY ENGINE ANALYSIS TOOLS DATA WAREHOUSE What’s Wrong with this Picture  Highly complex  Lots of people & skillsets  Multiple copies of data  Stale data  Rigid schema  Tough to change Write Many StructuredEarly Transformation
  • 8. 8 It Takes an Army BI Engineer Design Reports/Dashboards ETL Engineer Automate Cube Load BI Engineer Design Cube DBA Automate Data Load ETL Engineer Write ETL Code DBA Create Tables Data Warehouse Architect Design Star Schema SAN/NAS Engineer Define Storage Architecture
  • 9. 9 Star Schema = Unnatural!
  • 11. 11 Data Management Approaches INPUT DATA ETL MART MART MART QUERY ENGINE ANALYSIS TOOLS DATA WAREHOUSE Traditional Approach New Approach INPUT DATA ANALYSIS TOOLS HADOOP
  • 12. Time for a New Approach VS Write Once Semi-StructuredLate Transformation ✔ ✔ ✔
  • 13. 13 Not This, That BI Engineer Run Queries/Create Reports Hadoop Engineer Create EXTERNAL Tables Hadoop Engineer Define location to store files BI Engineer Design Reports/Dashboards ETL Engineer Automate Cube Load BI Engineer Design Cube DBA Automate Data Load ETL Engineer Write ETL Code DBA Create Tables Data Warehouse Architect Design Star Schema SAN/NAS Engineer Define Storage Architecture VS
  • 17. 17 Demo: DOTA 2 – What the User Sees Key Data Points: 5 vs. 5 players per match. Players choose ‘Heroes’, use ‘Items’ & earn ‘Gold’.
  • 19. Demo: Dota2 – Raw Data (JSON) Match Details Player Details Player Profile View Source View Source
  • 20. 20 As Easy As 1,2,3 BI Engineer Run Queries/Create Reports Hadoop Engineer Create EXTERNAL Tables Hadoop Engineer Define location to store files
  • 21. 21 Demo: DOTA 2 – Use Case 1 Question: Who are the most popular heroes?
  • 22. 22 Demo: DOTA 2 – Use Case 2 Question: Which heroes have the highest win rate?
  • 23. 23 Demo: DOTA 2 – Use Case 3 Question: What are the top 3 items associated with the best win rate?
  • 24. 24 Practical Applications Time Server Analysis (session data) Affinity Analysis Segmentation Analysis Many to Many
  • 25. NO JOINS = HORIZONTAL SCALE
  • 27. 27
  • 28. DEMO
  • 29. 29 Summary: The Do’s & Don’ts Capture data “as is” Pre-aggregate data Apply schema on read Force schema on load Land new data on Hadoop Land new data on relational DBs Create a data warehouse Create data marts Leverage open source engines Invest in proprietary databases Do Don’t