SlideShare une entreprise Scribd logo
1  sur  29
Big Data Warehousing
Meetup - April 8, 2014
Building a Big Data Warehouse
on the Cloud in 30 Minutes
Sponsored By:
7:00 –
7:15
Networking (15 min)
Grab some food and drink... Make some friends.
7:15 –
7:35
Bob Eilbacher (20 min)
VP Sales
Caserta Concepts
Welcome + Intro
About the Meetup, about Caserta Concepts
+ Swag
7:35 –
8:20
Elliott Cordo (45 min)
Chief Architect
Caserta Concepts.
Building a Big Data Warehouse on the Cloud
Live demo of Amazon's AWS, S3, EMR, and
Redshift
8:20 –
8:40
Ben Sgro (20 min)
Sr. Software Engineer
Simulmedia
Implementing Redis on the Cloud
An ultra-low latency customer segmentation tool
with AWS Elasticache
8:40 –
9:00
Q&A (10 min)
More Networking (10 min)
Tell us what you’re up to…
Agenda
Gathering music brought to you by….
BIG DATA
a paranoid electronic music
project from the Internet,
formed out of a general
distrust for technology and
The Cloud (despite a
growing dependence on
them).
bigdata.fm
• Big Data is a complex, rapidly changing
landscape
• We want to share our stories and hear
about yours
• Great networking opportunity for like
minded data nerds
• Opportunities to collaborate on exciting
projects
• Founded by Caserta Concepts
• Big Data Analytics, DW, BI Consulting
About the BDW Meetup
A BDW Meetup Milestone
Real-world Data Science
w/Claudia Perlich
• Date:
• Tuesday May 27, 2014, 7:00 PM
• Location:
• New Work City, Broadway & Canal
• Sponsor:
• Revolution Analytics
Next BDW Meetup
Caserta Concepts
• Technology innovation company with expertise in:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Digital Media
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Consulting, Writing, Education
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
Innovation & Implementation
Listed as a Top 20 Most Promising
Data Analytics Consulting Companies
CIOReview looked at hundreds of data analytics consulting companies and shortlisted
the ones who are at the forefront of tackling the real analytics challenges.
A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial
board of CIOReview selected the Final 20.
Expertise & Offerings
Strategic Roadmap /
Assessment / Education /
Implementation
Data Warehousing/
ETL/Data Integration
BI/Visualization/
Analytics
Big Data
Analytics
Hadoop Distributions
Platforms/ETL
Analytics & BI
Caserta Partners
Client Portfolio
Finance. Healthcare
& Insurance
Retail/eCommerce
& Manufacturing
Education
& Services
Does this word cloud excite you?
Speak with us about our open positions: leslie@casertaconcepts.com
Join Our Network
Storm
Big Data Architect Hbase
Cassandra
SWAG
Big Data is like water.
There is little point in debating how much there is.
It’s the flow and use that matters.
#gigaomlive
@dominiek
3/20/2014
Gigaom Structure Data
BUILDINGA BIG DATA WAREHOUSE IN THE
CLOUD IN 30 MIN
Elliott Cordo
Chief Architect, Caserta Concepts
What is a Big Data Warehouse??
• An enterprise system providing reliable ah-hoc analytics,
reporting, and decision support
• Large Scale – Big Data
• Not only confined to traditional Dimensional model
Big Data Warehouse
• Data governance is still important!
• Data Quality
• Metadata: Naming, Lineage, etc
Data cannot be governed until it is structured
Big Data
Warehouse
Data Science
Workspace
Data Lake – Integrated
Sandbox
Landing – Source Data in “Full
Fidelity”
Cloud
• Infrastructure is not fun
• Months to server procurement
• Inability to handle growth
• Servers idling all day doing nothing
• Cloud to the rescue
• Unlimited cheap storage
• Provision new servers in minutes
• Use of elastic services!  EMR
• AWESOME for prototypes and POC’s
About our sample data
• Consumer Yelp Ratings
• Generated based on Kaggle dataset  100 million rows
• Model looks something like this:
f_reviews
d_date d_business
d_user
So let’s get cooking
1. Create an EMR cluster  On Demand Hadoop
1. Provision a Redshift cluster  Data Warehouse
Redshift
• Massive Parallel Processing
• Columnar DB’s that present themselves as relational
• MPP’s grew up in Parallel to Hadoop
• Impala, HAWQ are MPP’s themselves!
• OEM of Actian Matrix (formerly ParaAccel)
• A modern MPP, clean, reliable, SCHEMA AGNOSTIC
Redshift is cheap inexpensive?
Enterprise grade EDW @ $1000/TB per year
MPP Design Considerations
• JOINS
• Shuffle – data is large and distributed by key to servers
• Broadcast – data is small and gets distributed to all servers
• Collocated – all data needed for join is on same server
• Design Considerations for MPP
• Distribution Key
• Collocated joins
• Even distribution of work across the cluster
• Customer will work well
• Sort Key
• Fastest scan operations
• Primary date field is usually best
ETL – Transform your data
• S3 is the ultimate staging ground
• Use EMR for the heavy lifting:
• Run your ETL Program and kill it when done!
• Pay just for processing.
• PIG, native map reduce, streaming
• For the right use case HIVE or Impala can be used for
ETL too (mainly for aggregates, summaries)
Smaller data - don’t need EMR?
• Python ETL on EC2 (on Demand)
• Can later “graduate” to big data using Hadoop streaming
• Your favorite ETL tool is just fine too
Presentation Layer – Data Warehouse
How do you get your ETL data in?
• Hadoop distcp - High performance transfer of data from
S3 to HDFS
• Distributed COPY from S3 to Redshift
And how to orchestrate all of this?
• Amazon data pipelines
• AWS CLI
• Build a driver program using modules like Boto (Python)
• Cron or external scheduler
Back to AWS
1. Apply Redshift DDL and load tables
1. Run some queries
elliott@casertaconcepts.com

Contenu connexe

Tendances

Big Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your DataBig Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your Data
Kai Wähner
 

Tendances (20)

Moving to the Cloud: Modernizing Data Architecture in Healthcare
Moving to the Cloud: Modernizing Data Architecture in HealthcareMoving to the Cloud: Modernizing Data Architecture in Healthcare
Moving to the Cloud: Modernizing Data Architecture in Healthcare
 
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
 
NYC Cassandra March 13- lighting talk
NYC Cassandra March 13- lighting talkNYC Cassandra March 13- lighting talk
NYC Cassandra March 13- lighting talk
 
Global Big Data Conference Hyderabad-2Aug2013- Finance/Manufacturing Use Cases
Global Big Data Conference Hyderabad-2Aug2013- Finance/Manufacturing Use CasesGlobal Big Data Conference Hyderabad-2Aug2013- Finance/Manufacturing Use Cases
Global Big Data Conference Hyderabad-2Aug2013- Finance/Manufacturing Use Cases
 
Big Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your DataBig Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your Data
 
Beyond Batch: Is ETL still relevant in the API economy?
Beyond Batch: Is ETL still relevant in the API economy?Beyond Batch: Is ETL still relevant in the API economy?
Beyond Batch: Is ETL still relevant in the API economy?
 
AWS Webcast - Tibco Jaspersoft
AWS Webcast - Tibco JaspersoftAWS Webcast - Tibco Jaspersoft
AWS Webcast - Tibco Jaspersoft
 
NetApp at Gartner Symposium Show Guide
NetApp at Gartner Symposium Show GuideNetApp at Gartner Symposium Show Guide
NetApp at Gartner Symposium Show Guide
 
How Companies are Using Cloud-Based Data Visualization & Analytics to Transfo...
How Companies are Using Cloud-Based Data Visualization & Analytics to Transfo...How Companies are Using Cloud-Based Data Visualization & Analytics to Transfo...
How Companies are Using Cloud-Based Data Visualization & Analytics to Transfo...
 
You are not Facebook or Google? Why you should still care about Big Data and ...
You are not Facebook or Google? Why you should still care about Big Data and ...You are not Facebook or Google? Why you should still care about Big Data and ...
You are not Facebook or Google? Why you should still care about Big Data and ...
 
Data Warehousing in the Cloud: Practical Migration Strategies
Data Warehousing in the Cloud: Practical Migration Strategies Data Warehousing in the Cloud: Practical Migration Strategies
Data Warehousing in the Cloud: Practical Migration Strategies
 
Webinar: BI in the Sky - The New Rules of Cloud Analytics
Webinar: BI in the Sky - The New Rules of Cloud AnalyticsWebinar: BI in the Sky - The New Rules of Cloud Analytics
Webinar: BI in the Sky - The New Rules of Cloud Analytics
 
The API Lie
The API LieThe API Lie
The API Lie
 
CWIN17 Frankfurt / data_stax_personalisatontopowercx
CWIN17 Frankfurt / data_stax_personalisatontopowercxCWIN17 Frankfurt / data_stax_personalisatontopowercx
CWIN17 Frankfurt / data_stax_personalisatontopowercx
 
Initiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AIInitiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AI
 
(ARC312) Processing Money in the Cloud | AWS re:Invent 2014
(ARC312) Processing Money in the Cloud | AWS re:Invent 2014(ARC312) Processing Money in the Cloud | AWS re:Invent 2014
(ARC312) Processing Money in the Cloud | AWS re:Invent 2014
 
Hassle-Free Data Lake Governance: Automating Your Analytics with a Semantic L...
Hassle-Free Data Lake Governance: Automating Your Analytics with a Semantic L...Hassle-Free Data Lake Governance: Automating Your Analytics with a Semantic L...
Hassle-Free Data Lake Governance: Automating Your Analytics with a Semantic L...
 
Unleash the Power of Big Data and Machine Learning
Unleash the Power of Big Data and Machine LearningUnleash the Power of Big Data and Machine Learning
Unleash the Power of Big Data and Machine Learning
 
Why, How, When and When Not of Big Data For Startups
Why, How, When and When Not of Big Data For StartupsWhy, How, When and When Not of Big Data For Startups
Why, How, When and When Not of Big Data For Startups
 
Automate and Optimize Data Warehouse Migration to Snowflake
Automate and Optimize Data Warehouse Migration to SnowflakeAutomate and Optimize Data Warehouse Migration to Snowflake
Automate and Optimize Data Warehouse Migration to Snowflake
 

En vedette

Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metricsBenchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Rim Moussa
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
pcherukumalla
 
ツイートID生成とツイッターリアルタイム検索システムの話
ツイートID生成とツイッターリアルタイム検索システムの話ツイートID生成とツイッターリアルタイム検索システムの話
ツイートID生成とツイッターリアルタイム検索システムの話
Preferred Networks
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 

En vedette (20)

Data warehousev2.1
Data warehousev2.1Data warehousev2.1
Data warehousev2.1
 
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metricsBenchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
 
Business Intelligence: Data Warehouses
Business Intelligence: Data WarehousesBusiness Intelligence: Data Warehouses
Business Intelligence: Data Warehouses
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
 
Dw design 2_conceptual_model
Dw design 2_conceptual_modelDw design 2_conceptual_model
Dw design 2_conceptual_model
 
Difference between star schema and snowflake schema
Difference between star schema and snowflake schemaDifference between star schema and snowflake schema
Difference between star schema and snowflake schema
 
Open Source Datawarehouse
Open Source DatawarehouseOpen Source Datawarehouse
Open Source Datawarehouse
 
Complement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopComplement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & Hadoop
 
Multidimensional data models
Multidimensional data  modelsMultidimensional data  models
Multidimensional data models
 
DATA MART APPROCHES TO ARCHITECTURE
DATA MART APPROCHES TO ARCHITECTUREDATA MART APPROCHES TO ARCHITECTURE
DATA MART APPROCHES TO ARCHITECTURE
 
ツイートID生成とツイッターリアルタイム検索システムの話
ツイートID生成とツイッターリアルタイム検索システムの話ツイートID生成とツイッターリアルタイム検索システムの話
ツイートID生成とツイッターリアルタイム検索システムの話
 
Dimensional Modeling Basic Concept with Example
Dimensional Modeling Basic Concept with ExampleDimensional Modeling Basic Concept with Example
Dimensional Modeling Basic Concept with Example
 
Using SSRS Reports with SSAS Cubes
Using SSRS Reports with SSAS CubesUsing SSRS Reports with SSAS Cubes
Using SSRS Reports with SSAS Cubes
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 
Data Warehouse Best Practices
Data Warehouse Best PracticesData Warehouse Best Practices
Data Warehouse Best Practices
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about..."Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Cloud Computing and your Data Warehouse
Cloud Computing and your Data WarehouseCloud Computing and your Data Warehouse
Cloud Computing and your Data Warehouse
 

Similaire à Build a Big Data Warehouse on the Cloud in 30 Minutes

SendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data WarehousingSendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data Warehousing
Amazon Web Services
 
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
Amazon Web Services Korea
 

Similaire à Build a Big Data Warehouse on the Cloud in 30 Minutes (20)

Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data Analytics
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of Hadoop
 
Trivadis Azure Data Lake
Trivadis Azure Data LakeTrivadis Azure Data Lake
Trivadis Azure Data Lake
 
Enterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft AzureEnterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft Azure
 
Big Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS CloudBig Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS Cloud
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
SendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data WarehousingSendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data Warehousing
 
Take Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessTake Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven Business
 
AWS for Semiconductor and Electronics Design | Hsinchu, April 10
AWS for Semiconductor and Electronics Design | Hsinchu, April 10AWS for Semiconductor and Electronics Design | Hsinchu, April 10
AWS for Semiconductor and Electronics Design | Hsinchu, April 10
 
Analytics on the Cloud with Tableau on AWS
Analytics on the Cloud with Tableau on AWSAnalytics on the Cloud with Tableau on AWS
Analytics on the Cloud with Tableau on AWS
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
Exploiting Data Lakes: Architecture, Capabilities & Future
Exploiting Data Lakes: Architecture, Capabilities & FutureExploiting Data Lakes: Architecture, Capabilities & Future
Exploiting Data Lakes: Architecture, Capabilities & Future
 
Business in the Driver’s Seat – An Improved Model for Integration
Business in the Driver’s Seat – An Improved Model for IntegrationBusiness in the Driver’s Seat – An Improved Model for Integration
Business in the Driver’s Seat – An Improved Model for Integration
 
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
 
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
 
The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)
The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)
The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 

Plus de Caserta

Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
Caserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
Caserta
 

Plus de Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 

Dernier

Dernier (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Build a Big Data Warehouse on the Cloud in 30 Minutes

  • 1. Big Data Warehousing Meetup - April 8, 2014 Building a Big Data Warehouse on the Cloud in 30 Minutes Sponsored By:
  • 2. 7:00 – 7:15 Networking (15 min) Grab some food and drink... Make some friends. 7:15 – 7:35 Bob Eilbacher (20 min) VP Sales Caserta Concepts Welcome + Intro About the Meetup, about Caserta Concepts + Swag 7:35 – 8:20 Elliott Cordo (45 min) Chief Architect Caserta Concepts. Building a Big Data Warehouse on the Cloud Live demo of Amazon's AWS, S3, EMR, and Redshift 8:20 – 8:40 Ben Sgro (20 min) Sr. Software Engineer Simulmedia Implementing Redis on the Cloud An ultra-low latency customer segmentation tool with AWS Elasticache 8:40 – 9:00 Q&A (10 min) More Networking (10 min) Tell us what you’re up to… Agenda
  • 3. Gathering music brought to you by…. BIG DATA a paranoid electronic music project from the Internet, formed out of a general distrust for technology and The Cloud (despite a growing dependence on them). bigdata.fm
  • 4. • Big Data is a complex, rapidly changing landscape • We want to share our stories and hear about yours • Great networking opportunity for like minded data nerds • Opportunities to collaborate on exciting projects • Founded by Caserta Concepts • Big Data Analytics, DW, BI Consulting About the BDW Meetup
  • 5. A BDW Meetup Milestone
  • 6. Real-world Data Science w/Claudia Perlich • Date: • Tuesday May 27, 2014, 7:00 PM • Location: • New Work City, Broadway & Canal • Sponsor: • Revolution Analytics Next BDW Meetup
  • 7. Caserta Concepts • Technology innovation company with expertise in: • Big Data Solutions • Data Warehousing • Business Intelligence • Core focus in the following industries: • eCommerce / Retail / Marketing • Financial Services / Insurance • Healthcare / Digital Media • Established in 2001: • Increased growth year-over-year • Industry recognized work force • Consulting, Writing, Education • Data Science & Analytics • Data on the Cloud • Data Interaction & Visualization
  • 8. Innovation & Implementation Listed as a Top 20 Most Promising Data Analytics Consulting Companies CIOReview looked at hundreds of data analytics consulting companies and shortlisted the ones who are at the forefront of tackling the real analytics challenges. A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial board of CIOReview selected the Final 20.
  • 9. Expertise & Offerings Strategic Roadmap / Assessment / Education / Implementation Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics Big Data Analytics
  • 11. Client Portfolio Finance. Healthcare & Insurance Retail/eCommerce & Manufacturing Education & Services
  • 12. Does this word cloud excite you? Speak with us about our open positions: leslie@casertaconcepts.com Join Our Network Storm Big Data Architect Hbase Cassandra
  • 13. SWAG
  • 14. Big Data is like water. There is little point in debating how much there is. It’s the flow and use that matters. #gigaomlive @dominiek 3/20/2014 Gigaom Structure Data
  • 15. BUILDINGA BIG DATA WAREHOUSE IN THE CLOUD IN 30 MIN Elliott Cordo Chief Architect, Caserta Concepts
  • 16. What is a Big Data Warehouse?? • An enterprise system providing reliable ah-hoc analytics, reporting, and decision support • Large Scale – Big Data • Not only confined to traditional Dimensional model
  • 17. Big Data Warehouse • Data governance is still important! • Data Quality • Metadata: Naming, Lineage, etc Data cannot be governed until it is structured Big Data Warehouse Data Science Workspace Data Lake – Integrated Sandbox Landing – Source Data in “Full Fidelity”
  • 18. Cloud • Infrastructure is not fun • Months to server procurement • Inability to handle growth • Servers idling all day doing nothing • Cloud to the rescue • Unlimited cheap storage • Provision new servers in minutes • Use of elastic services!  EMR • AWESOME for prototypes and POC’s
  • 19. About our sample data • Consumer Yelp Ratings • Generated based on Kaggle dataset  100 million rows • Model looks something like this: f_reviews d_date d_business d_user
  • 20. So let’s get cooking 1. Create an EMR cluster  On Demand Hadoop 1. Provision a Redshift cluster  Data Warehouse
  • 21. Redshift • Massive Parallel Processing • Columnar DB’s that present themselves as relational • MPP’s grew up in Parallel to Hadoop • Impala, HAWQ are MPP’s themselves! • OEM of Actian Matrix (formerly ParaAccel) • A modern MPP, clean, reliable, SCHEMA AGNOSTIC
  • 22. Redshift is cheap inexpensive? Enterprise grade EDW @ $1000/TB per year
  • 23. MPP Design Considerations • JOINS • Shuffle – data is large and distributed by key to servers • Broadcast – data is small and gets distributed to all servers • Collocated – all data needed for join is on same server • Design Considerations for MPP • Distribution Key • Collocated joins • Even distribution of work across the cluster • Customer will work well • Sort Key • Fastest scan operations • Primary date field is usually best
  • 24. ETL – Transform your data • S3 is the ultimate staging ground • Use EMR for the heavy lifting: • Run your ETL Program and kill it when done! • Pay just for processing. • PIG, native map reduce, streaming • For the right use case HIVE or Impala can be used for ETL too (mainly for aggregates, summaries)
  • 25. Smaller data - don’t need EMR? • Python ETL on EC2 (on Demand) • Can later “graduate” to big data using Hadoop streaming • Your favorite ETL tool is just fine too
  • 26. Presentation Layer – Data Warehouse How do you get your ETL data in? • Hadoop distcp - High performance transfer of data from S3 to HDFS • Distributed COPY from S3 to Redshift
  • 27. And how to orchestrate all of this? • Amazon data pipelines • AWS CLI • Build a driver program using modules like Boto (Python) • Cron or external scheduler
  • 28. Back to AWS 1. Apply Redshift DDL and load tables 1. Run some queries