SlideShare une entreprise Scribd logo
1  sur  29
HADOOP DATA
RESERVOIR
REQUIREMENTS AND
SOLUTIONS
Peter Schlampp, VP Products
Peter Schlampp 
VP, Products
Outline
•  What is the Hadoop Data Reservoir (HDR)?

•  Requirements and Solutions

•  Hadoop Data Reservoir in Practice

•  Demo

•  Q&A
What is the Hadoop Data Reservoir (HDR)?
•  Central Hadoop cluster for the enterprise
•  Serves as the Storage and the Source of data for
   self-service business analytics
•  Provides Processing for data preparation and
   advanced analytics
    The Hadoop Data Reservoir
eliminates data silos, reduces costs,
and makes business analytics agile.
HDR is Not a Replacement for the EDW
HDR is Not a Replacement for the EDW

 •  EDWs require upfront planning
 •  EDWs require major ongoing IT
    maintenance and staffing
 •  EDWs are not self-service
HDR Origin: Interviews with Enterprise IT
•  Platfora interviewed over 200
   enterprise IT professionals working
   with Hadoop
•  Summer 2011 through early 2012
•  Topic of interview: challenges using
   Hadoop for business intelligence &
   analytics
What is Your Vision for Hadoop? 
•  “I want Hadoop to be the central repository of all the data people
   need.”

•  “We shouldn’t have to plan too much before we store data.”

•  “Cost should only be a minor factor in how long we kept data around.”

•  “I want to give everyone access to the data and break down the existing
   silos. But it needs to be secure.”

•  “IT would not have to be involved in day-to-day management.”
“I’m a bit out on a limb here. I pushed to use Hadoop to collect data that we
  Out on a Limb
 were dropping before. But now it’s taking way more time to make use of it
                               then I expected.”


                           Stock Photo




                                                                                9
The Missing Link to HDR


                 Automatic /
                 Fast /
                  Iterative
                Unbounded

                                FLEXIBLE
  Hadoop Data
             “SOFTWARE DEFINED”
                Web-based
   Reservoir
                                            Business Intelligence
                               DATA MARTS



            Performance, Self-Service, and Security
REQUIREMENT 1:
 PERFORMANCE
Queries must be consistently fast
Modern BI applications are driving more   Modern Data Discovery BI
     and more queries all the time.
                   
A single HDR user should not be able to
impact other users simply because they
       asked the wrong question.
                                           Each move results in a new query.




   “We’re addicted to sub-second. If it takes longer
    than that for any reason, something is wrong.”
Most Queries are Straightforward, but Big
                   “What’s the trend of female visitors clicking on ads on the
      Traffic
                     autos channel over time?”
       Logs 

 Advertising
                                       ???	
  
       Logs 




                                                 Clicks
       User
Demographics
                 Big Hadoop cluster
                         
                                                           Months
                     2.4 PB total
                            
                 700M records/day
                Processing the answer
                     400 GB/day
                 could touch 10s of billions
                  2B user records
                       of records.
Solution: Aggregate Tables Stored In-Memory
•  Pre-calculated summary
   tables, summarizing data to a
   coarser grain
  •  Dramatically reduces data
     required to answer a question
  •  Keeps redundant processing
     off the batch system (Hadoop)
•  Keep summary data in
   memory to provide sub-
   second access
                                               14
REQUIREMENT 2:
 SELF-SERVICE

                 15
Finding Data in the Reservoir
  Sales
                  Shipments
        Hadoop Distributed File
                                          System (HDFS) is organized
                                           like other common FS: a
                                              directory structure 
Sentiment                 Web Logs
Info
                                          Datasets in HDFS could be a
                                           single file or 10,000+ files,
 Customer                 Interactions
     commonly organized by
 Demographics
                                       directory



    Business users must be able to find data to
             answers their questions
                                      16
Aggregations Must Be Fully Automatic
•  Building aggregate tables requires planning and up-
   front decisions
   •  Must choose the metrics, dimensions, granularity
   •  In practice, this is an iterative process, and the first
      attempt is usually wrong
•  Aggregate tables must be maintained
   •  Each time new data arrives
   •  Sliding window tables (i.e. last 30 days): data in, data out
    For HDR to be self-service, this must be
                  automatic.
Drilling Through the Aggregation
                 Netflow Example

 Raw Data in Hadoop
                   Aggregate Tables
     Milliseconds
                        Hours, Days
  Source IP Address
                     # of Machines
                                                                           
Destination IP Address
                    # of Flows
           “What happened between
      Application
                    Total Flow Size (KB)
         10:03-10:04am?”
       Packets
                           Application
         Bytes
                                     100MB Compressed
 26B records/month
                        Fast
 400GB compressed
        Slow
                          Need to “drill through the aggregation” to get more detail,
                          or add dimensionality. And, it needs to be self-service.
                          
                                                               18
Augmenting Datasets
•  Users must be able to augment data with
   sources outside of the HDR
  •  I.e. market research or demographics


•  Commonly needs to be combined at the raw
   level, before data is aggregated
REQUIREMENT 3:
   SECURITY

                 20
Modern Data Security Requirements
•  Hadoop provides:
   •  File and directory based permissions (like Unix)
   •  Secure authentication (via Kerberos)
•  However, enterprises require a finer level of data
   security control
   •  Datasets – could be one or many files, spanning directories
   •  Columns – datasets likely have many columns, with
      different security permissions
   •  Rows – can span many files, and directories
•  Solution must abstract file-level security and
   enforce a finer level of control
                                                                     21
Strong and Secure; Collaborative Sharing
•  In a self-service model, security must be strong
   and clear
   •  End-users will need to understand what they can
      access and what they can’t
   •  Security administrators must be able to enforce
      security centrally, down to the raw data
•  As a centralized system, HDR must integrate
   with directory services for authentication and
   group membership
                                                        22
HADOOP DATA RESERVOIR
     IN PRACTICE

                        23
Platfora: Interest-Driven PipelineTM


                 Automatic /
                 Fast /
                  Iterative
                Unbounded

                                FLEXIBLE
  Hadoop Data
             “SOFTWARE DEFINED”
                Web-based
   Reservoir
                                            Business Intelligence
                               DATA MARTS



            Performance, Self-Service, and Security
Edmunds.com
                                  •  Beta participant since January 2013
                                  •  Moved to Hadoop because of explosive data
                                     growth and promise of agility
                                      •  Web, mobile, visitor demographic data
                                  •  Use Case: optimize the matching of visitors with
Founded in 1966:
                    the cars they are looking for
”For the purpose of publishing        •  Correlating browsers with the cars they are actually
new and used automotive pricing
guides to assist automobile
                                         buying
buyers”
                          •  Platfora has made big data accessible to the

                                    business
Online Innovators:
                   •  Increased access from 5 to 50 users
•    First auto information
     website
                         •  Decreased time to value from months to hours
•    True Market Value®, True
     Cost to Own®, and My Car
     Match
                          “Before, if we wanted access to Hadoop data, we wouldn’t even try.
                                         With Platfora our analysts can access anything they need.”
DEMO


        26
Introducing Platfora’s Integrated Platform
                             Web-based Business
              Vizboard
                            Intelligence Application
                                        +
               Lens
         Scale-out, In-Memory
                         Data Mart & Processing Engine
                                        +
              Dataset
         Automated Hadoop
                                 Data Refinery
                       Powerful Closed-loop Analysis of Big Data
Summary
•  The Hadoop Data Reservoir vision is driven from
   requirements of enterprise Hadoop users
•  HDR eliminates data silos, reduces costs, and
   makes business analytics agile
•  To make HDR a reality, it needs to provide:
  •  Performance
  •  Self-service
  •  Security
                                                     28
Hadoop Data Reservoir Webinar

Contenu connexe

Tendances

Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranJAX London
 
Infochimps #1 Big Data Platform for the Cloud
Infochimps #1 Big Data Platform for the CloudInfochimps #1 Big Data Platform for the Cloud
Infochimps #1 Big Data Platform for the CloudBrian Krpec
 
Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data ApplicationsRichard McDougall
 
Hw09 Data Processing In The Enterprise
Hw09   Data Processing In The EnterpriseHw09   Data Processing In The Enterprise
Hw09 Data Processing In The EnterpriseCloudera, Inc.
 
BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? Datameer
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2Calpont Corporation
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)Pavlo Baron
 
SPEVO13 - IW509 - Records Management and Search
SPEVO13 - IW509 - Records Management and SearchSPEVO13 - IW509 - Records Management and Search
SPEVO13 - IW509 - Records Management and SearchJohn F. Holliday
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetupRoby Chen
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop EMC
 
Oncrawl elasticsearch meetup france #12
Oncrawl elasticsearch meetup france #12Oncrawl elasticsearch meetup france #12
Oncrawl elasticsearch meetup france #12Tanguy MOAL
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012Gigaom
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data WarehousingThomas Kejser
 
Getting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudGetting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudRightScale
 
Microsoft SQL Azure - Cloud Based Database Datasheet
Microsoft SQL Azure - Cloud Based Database DatasheetMicrosoft SQL Azure - Cloud Based Database Datasheet
Microsoft SQL Azure - Cloud Based Database DatasheetMicrosoft Private Cloud
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 

Tendances (20)

Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
Infochimps #1 Big Data Platform for the Cloud
Infochimps #1 Big Data Platform for the CloudInfochimps #1 Big Data Platform for the Cloud
Infochimps #1 Big Data Platform for the Cloud
 
Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data Applications
 
Hw09 Data Processing In The Enterprise
Hw09   Data Processing In The EnterpriseHw09   Data Processing In The Enterprise
Hw09 Data Processing In The Enterprise
 
BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics?
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)
 
SPEVO13 - IW509 - Records Management and Search
SPEVO13 - IW509 - Records Management and SearchSPEVO13 - IW509 - Records Management and Search
SPEVO13 - IW509 - Records Management and Search
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetup
 
Treasure Data and Heroku
Treasure Data and HerokuTreasure Data and Heroku
Treasure Data and Heroku
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
 
Oncrawl elasticsearch meetup france #12
Oncrawl elasticsearch meetup france #12Oncrawl elasticsearch meetup france #12
Oncrawl elasticsearch meetup france #12
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data Warehousing
 
Getting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudGetting Started with Big Data in the Cloud
Getting Started with Big Data in the Cloud
 
Microsoft SQL Azure - Cloud Based Database Datasheet
Microsoft SQL Azure - Cloud Based Database DatasheetMicrosoft SQL Azure - Cloud Based Database Datasheet
Microsoft SQL Azure - Cloud Based Database Datasheet
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 

En vedette

10 - Xylem River Water Monitoring WORLD BANK-Sep-15
10 - Xylem River Water Monitoring WORLD BANK-Sep-1510 - Xylem River Water Monitoring WORLD BANK-Sep-15
10 - Xylem River Water Monitoring WORLD BANK-Sep-15indiawrm
 
Kit de ventas programa de afiliados
Kit de ventas programa de afiliadosKit de ventas programa de afiliados
Kit de ventas programa de afiliadosYenny Passanante
 
OPTIMIZATION-BASED MOTION PLANNING IN JOINT SPACE FOR WALKING ASSISTANCE WITH...
OPTIMIZATION-BASED MOTION PLANNING IN JOINT SPACE FOR WALKING ASSISTANCE WITH...OPTIMIZATION-BASED MOTION PLANNING IN JOINT SPACE FOR WALKING ASSISTANCE WITH...
OPTIMIZATION-BASED MOTION PLANNING IN JOINT SPACE FOR WALKING ASSISTANCE WITH...Shakas Technologies
 
Grafico semanal del eurodolar para el 04 05 2012
Grafico semanal del  eurodolar  para el 04 05 2012Grafico semanal del  eurodolar  para el 04 05 2012
Grafico semanal del eurodolar para el 04 05 2012Experiencia Trading
 
FIAB: Otras actividades de innovación en Bta.
FIAB: Otras actividades de innovación en Bta.FIAB: Otras actividades de innovación en Bta.
FIAB: Otras actividades de innovación en Bta.FIAB
 
Fpg pack sistema de nomina
Fpg pack sistema de nominaFpg pack sistema de nomina
Fpg pack sistema de nominaFernando Peña
 
267 kader kesehatan masyarakat
267 kader kesehatan masyarakat267 kader kesehatan masyarakat
267 kader kesehatan masyarakatAfni Wahyuni
 
Boletin pefc 54
Boletin pefc 54Boletin pefc 54
Boletin pefc 54FEARMAGA
 
Steve Portigal: User Research Friday
Steve Portigal: User Research FridaySteve Portigal: User Research Friday
Steve Portigal: User Research FridaySteve Portigal
 
Experimentos de físca1
Experimentos de físca1Experimentos de físca1
Experimentos de físca1joaovas1
 

En vedette (18)

10 - Xylem River Water Monitoring WORLD BANK-Sep-15
10 - Xylem River Water Monitoring WORLD BANK-Sep-1510 - Xylem River Water Monitoring WORLD BANK-Sep-15
10 - Xylem River Water Monitoring WORLD BANK-Sep-15
 
10936
1093610936
10936
 
Diseño de una publicación digital
Diseño de una publicación digitalDiseño de una publicación digital
Diseño de una publicación digital
 
Kit de ventas programa de afiliados
Kit de ventas programa de afiliadosKit de ventas programa de afiliados
Kit de ventas programa de afiliados
 
Epistemología
EpistemologíaEpistemología
Epistemología
 
11982
1198211982
11982
 
OPTIMIZATION-BASED MOTION PLANNING IN JOINT SPACE FOR WALKING ASSISTANCE WITH...
OPTIMIZATION-BASED MOTION PLANNING IN JOINT SPACE FOR WALKING ASSISTANCE WITH...OPTIMIZATION-BASED MOTION PLANNING IN JOINT SPACE FOR WALKING ASSISTANCE WITH...
OPTIMIZATION-BASED MOTION PLANNING IN JOINT SPACE FOR WALKING ASSISTANCE WITH...
 
Grafico semanal del eurodolar para el 04 05 2012
Grafico semanal del  eurodolar  para el 04 05 2012Grafico semanal del  eurodolar  para el 04 05 2012
Grafico semanal del eurodolar para el 04 05 2012
 
FIAB: Otras actividades de innovación en Bta.
FIAB: Otras actividades de innovación en Bta.FIAB: Otras actividades de innovación en Bta.
FIAB: Otras actividades de innovación en Bta.
 
Fpg pack sistema de nomina
Fpg pack sistema de nominaFpg pack sistema de nomina
Fpg pack sistema de nomina
 
Tpn66
Tpn66Tpn66
Tpn66
 
Ténis 2.0
Ténis 2.0Ténis 2.0
Ténis 2.0
 
North
NorthNorth
North
 
10532
1053210532
10532
 
267 kader kesehatan masyarakat
267 kader kesehatan masyarakat267 kader kesehatan masyarakat
267 kader kesehatan masyarakat
 
Boletin pefc 54
Boletin pefc 54Boletin pefc 54
Boletin pefc 54
 
Steve Portigal: User Research Friday
Steve Portigal: User Research FridaySteve Portigal: User Research Friday
Steve Portigal: User Research Friday
 
Experimentos de físca1
Experimentos de físca1Experimentos de físca1
Experimentos de físca1
 

Similaire à Hadoop Data Reservoir Webinar

Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopSri Kanth
 
Bi with apache hadoop(en)
Bi with apache hadoop(en)Bi with apache hadoop(en)
Bi with apache hadoop(en)Alexander Alten
 
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)GeeksLab Odessa
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...DataWorks Summit
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationData Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationHortonworks
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopHortonworks
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
 
Introduction to BIG DATA
Introduction to BIG DATA Introduction to BIG DATA
Introduction to BIG DATA Zeeshan Khan
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarCloudera, Inc.
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 

Similaire à Hadoop Data Reservoir Webinar (20)

Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Bi with apache hadoop(en)
Bi with apache hadoop(en)Bi with apache hadoop(en)
Bi with apache hadoop(en)
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Retail & CPG
Retail & CPGRetail & CPG
Retail & CPG
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationData Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop Implementation
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 
Introduction to BIG DATA
Introduction to BIG DATA Introduction to BIG DATA
Introduction to BIG DATA
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 

Plus de Platfora

The Rise of the Citizen Data Scientist
The Rise of the Citizen Data ScientistThe Rise of the Citizen Data Scientist
The Rise of the Citizen Data ScientistPlatfora
 
Views From The C-Suite: Who's Big on Big Data
Views From The C-Suite: Who's Big on Big DataViews From The C-Suite: Who's Big on Big Data
Views From The C-Suite: Who's Big on Big DataPlatfora
 
Driving A Data-Centric Culture: The Leadership Challenge
Driving A Data-Centric Culture: The Leadership ChallengeDriving A Data-Centric Culture: The Leadership Challenge
Driving A Data-Centric Culture: The Leadership ChallengePlatfora
 
Driving A Data-Centric Culture: A Bottom Up Opportunity
Driving A Data-Centric Culture: A Bottom Up OpportunityDriving A Data-Centric Culture: A Bottom Up Opportunity
Driving A Data-Centric Culture: A Bottom Up OpportunityPlatfora
 
Gain a Holistic View of your Customer's Journey
Gain a Holistic View of your Customer's JourneyGain a Holistic View of your Customer's Journey
Gain a Holistic View of your Customer's JourneyPlatfora
 
The Big Data Gusher: Big Data Analytics, the Internet of Things and the Oil B...
The Big Data Gusher: Big Data Analytics, the Internet of Things and the Oil B...The Big Data Gusher: Big Data Analytics, the Internet of Things and the Oil B...
The Big Data Gusher: Big Data Analytics, the Internet of Things and the Oil B...Platfora
 
Platfora Girl Geek Dinner
Platfora Girl Geek DinnerPlatfora Girl Geek Dinner
Platfora Girl Geek DinnerPlatfora
 
Platfora Data Visualization Meetup
Platfora Data Visualization MeetupPlatfora Data Visualization Meetup
Platfora Data Visualization MeetupPlatfora
 
Platfora Data Visualization Meetup
Platfora Data Visualization MeetupPlatfora Data Visualization Meetup
Platfora Data Visualization MeetupPlatfora
 
Platfora - Denver Data Science Meetup
Platfora - Denver Data Science MeetupPlatfora - Denver Data Science Meetup
Platfora - Denver Data Science MeetupPlatfora
 

Plus de Platfora (10)

The Rise of the Citizen Data Scientist
The Rise of the Citizen Data ScientistThe Rise of the Citizen Data Scientist
The Rise of the Citizen Data Scientist
 
Views From The C-Suite: Who's Big on Big Data
Views From The C-Suite: Who's Big on Big DataViews From The C-Suite: Who's Big on Big Data
Views From The C-Suite: Who's Big on Big Data
 
Driving A Data-Centric Culture: The Leadership Challenge
Driving A Data-Centric Culture: The Leadership ChallengeDriving A Data-Centric Culture: The Leadership Challenge
Driving A Data-Centric Culture: The Leadership Challenge
 
Driving A Data-Centric Culture: A Bottom Up Opportunity
Driving A Data-Centric Culture: A Bottom Up OpportunityDriving A Data-Centric Culture: A Bottom Up Opportunity
Driving A Data-Centric Culture: A Bottom Up Opportunity
 
Gain a Holistic View of your Customer's Journey
Gain a Holistic View of your Customer's JourneyGain a Holistic View of your Customer's Journey
Gain a Holistic View of your Customer's Journey
 
The Big Data Gusher: Big Data Analytics, the Internet of Things and the Oil B...
The Big Data Gusher: Big Data Analytics, the Internet of Things and the Oil B...The Big Data Gusher: Big Data Analytics, the Internet of Things and the Oil B...
The Big Data Gusher: Big Data Analytics, the Internet of Things and the Oil B...
 
Platfora Girl Geek Dinner
Platfora Girl Geek DinnerPlatfora Girl Geek Dinner
Platfora Girl Geek Dinner
 
Platfora Data Visualization Meetup
Platfora Data Visualization MeetupPlatfora Data Visualization Meetup
Platfora Data Visualization Meetup
 
Platfora Data Visualization Meetup
Platfora Data Visualization MeetupPlatfora Data Visualization Meetup
Platfora Data Visualization Meetup
 
Platfora - Denver Data Science Meetup
Platfora - Denver Data Science MeetupPlatfora - Denver Data Science Meetup
Platfora - Denver Data Science Meetup
 

Dernier

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 

Dernier (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Hadoop Data Reservoir Webinar

  • 3. Outline •  What is the Hadoop Data Reservoir (HDR)? •  Requirements and Solutions •  Hadoop Data Reservoir in Practice •  Demo •  Q&A
  • 4. What is the Hadoop Data Reservoir (HDR)? •  Central Hadoop cluster for the enterprise •  Serves as the Storage and the Source of data for self-service business analytics •  Provides Processing for data preparation and advanced analytics The Hadoop Data Reservoir eliminates data silos, reduces costs, and makes business analytics agile.
  • 5. HDR is Not a Replacement for the EDW
  • 6. HDR is Not a Replacement for the EDW •  EDWs require upfront planning •  EDWs require major ongoing IT maintenance and staffing •  EDWs are not self-service
  • 7. HDR Origin: Interviews with Enterprise IT •  Platfora interviewed over 200 enterprise IT professionals working with Hadoop •  Summer 2011 through early 2012 •  Topic of interview: challenges using Hadoop for business intelligence & analytics
  • 8. What is Your Vision for Hadoop? •  “I want Hadoop to be the central repository of all the data people need.” •  “We shouldn’t have to plan too much before we store data.” •  “Cost should only be a minor factor in how long we kept data around.” •  “I want to give everyone access to the data and break down the existing silos. But it needs to be secure.” •  “IT would not have to be involved in day-to-day management.”
  • 9. “I’m a bit out on a limb here. I pushed to use Hadoop to collect data that we Out on a Limb were dropping before. But now it’s taking way more time to make use of it then I expected.” Stock Photo 9
  • 10. The Missing Link to HDR Automatic / Fast / Iterative Unbounded FLEXIBLE Hadoop Data “SOFTWARE DEFINED” Web-based Reservoir Business Intelligence DATA MARTS Performance, Self-Service, and Security
  • 12. Queries must be consistently fast Modern BI applications are driving more Modern Data Discovery BI and more queries all the time. A single HDR user should not be able to impact other users simply because they asked the wrong question. Each move results in a new query. “We’re addicted to sub-second. If it takes longer than that for any reason, something is wrong.”
  • 13. Most Queries are Straightforward, but Big “What’s the trend of female visitors clicking on ads on the Traffic autos channel over time?” Logs Advertising ???   Logs Clicks User Demographics Big Hadoop cluster Months 2.4 PB total 700M records/day Processing the answer 400 GB/day could touch 10s of billions 2B user records of records.
  • 14. Solution: Aggregate Tables Stored In-Memory •  Pre-calculated summary tables, summarizing data to a coarser grain •  Dramatically reduces data required to answer a question •  Keeps redundant processing off the batch system (Hadoop) •  Keep summary data in memory to provide sub- second access 14
  • 16. Finding Data in the Reservoir Sales Shipments Hadoop Distributed File System (HDFS) is organized like other common FS: a directory structure Sentiment Web Logs Info Datasets in HDFS could be a single file or 10,000+ files, Customer Interactions commonly organized by Demographics directory Business users must be able to find data to answers their questions 16
  • 17. Aggregations Must Be Fully Automatic •  Building aggregate tables requires planning and up- front decisions •  Must choose the metrics, dimensions, granularity •  In practice, this is an iterative process, and the first attempt is usually wrong •  Aggregate tables must be maintained •  Each time new data arrives •  Sliding window tables (i.e. last 30 days): data in, data out For HDR to be self-service, this must be automatic.
  • 18. Drilling Through the Aggregation Netflow Example Raw Data in Hadoop Aggregate Tables Milliseconds Hours, Days Source IP Address # of Machines Destination IP Address # of Flows “What happened between Application Total Flow Size (KB) 10:03-10:04am?” Packets Application Bytes 100MB Compressed 26B records/month Fast 400GB compressed Slow Need to “drill through the aggregation” to get more detail, or add dimensionality. And, it needs to be self-service. 18
  • 19. Augmenting Datasets •  Users must be able to augment data with sources outside of the HDR •  I.e. market research or demographics •  Commonly needs to be combined at the raw level, before data is aggregated
  • 20. REQUIREMENT 3: SECURITY 20
  • 21. Modern Data Security Requirements •  Hadoop provides: •  File and directory based permissions (like Unix) •  Secure authentication (via Kerberos) •  However, enterprises require a finer level of data security control •  Datasets – could be one or many files, spanning directories •  Columns – datasets likely have many columns, with different security permissions •  Rows – can span many files, and directories •  Solution must abstract file-level security and enforce a finer level of control 21
  • 22. Strong and Secure; Collaborative Sharing •  In a self-service model, security must be strong and clear •  End-users will need to understand what they can access and what they can’t •  Security administrators must be able to enforce security centrally, down to the raw data •  As a centralized system, HDR must integrate with directory services for authentication and group membership 22
  • 23. HADOOP DATA RESERVOIR IN PRACTICE 23
  • 24. Platfora: Interest-Driven PipelineTM Automatic / Fast / Iterative Unbounded FLEXIBLE Hadoop Data “SOFTWARE DEFINED” Web-based Reservoir Business Intelligence DATA MARTS Performance, Self-Service, and Security
  • 25. Edmunds.com •  Beta participant since January 2013 •  Moved to Hadoop because of explosive data growth and promise of agility •  Web, mobile, visitor demographic data •  Use Case: optimize the matching of visitors with Founded in 1966: the cars they are looking for ”For the purpose of publishing •  Correlating browsers with the cars they are actually new and used automotive pricing guides to assist automobile buying buyers” •  Platfora has made big data accessible to the business Online Innovators: •  Increased access from 5 to 50 users •  First auto information website •  Decreased time to value from months to hours •  True Market Value®, True Cost to Own®, and My Car Match “Before, if we wanted access to Hadoop data, we wouldn’t even try. With Platfora our analysts can access anything they need.”
  • 26. DEMO 26
  • 27. Introducing Platfora’s Integrated Platform Web-based Business Vizboard Intelligence Application + Lens Scale-out, In-Memory Data Mart & Processing Engine + Dataset Automated Hadoop Data Refinery Powerful Closed-loop Analysis of Big Data
  • 28. Summary •  The Hadoop Data Reservoir vision is driven from requirements of enterprise Hadoop users •  HDR eliminates data silos, reduces costs, and makes business analytics agile •  To make HDR a reality, it needs to provide: •  Performance •  Self-service •  Security 28

Notes de l'éditeur

  1. Introduction to me.What I do.
  2. First, let me explain what Hadoop is:Apache Hadoop is an open source software project originally invented by Google.It enables the distributed processing of large data sets across clusters of commodity servers. Hadoop provides an inexpensive massively scalable solution to storing structured and unstructured raw data.The Hadoop Data Reservoir is a vision of what Hadoop can be for your enterprise.
  3. Before I go any further, I’d like to make sure I describe what the HDR is not. And sometimes this gets confused.
  4. Upfront planning: what data will we collect? how will the data be modeled to answer our business questions? how will we make access to the data fast for all of our users? (the questions are almost endless)Ongoing maintenance: when will we refresh the data in the EDW? when datasets change, do we start over?Self-service: should be obvious that EDWs are the domain of the IT team. But the vision of the HDR implies this is self-service. When we see what is required, we’ll see that this is no easy task
  5. How did we come to the concept of the HDR? THE VISION CAME OUT THROUGH THE INTERVIEWSStory:Developed a script of questionsPeople were at different places in their cycleThese were not data scientists. These were not people that had built their application on Hadoop (LinkedIn “People I know”)Cross section of industries: online media, financial services (banks and credit cards), federal government, retail, ecommerce, etc
  6. But, reality was that none of these interviewees had reached the vision of the HDR. In fact, this is my image of the folks we were talking to.Talk about the enlightened IT user.
  7. What is the thing that goes in between the HDR and the end user?The challenges with the Hadoop Data Reservoir:Missing link between the massive amount of raw data stored in HDR and access for business usersAccess has been self-limited to expert users who now data modeling and SQLIT teams must performing expensive ad-hoc data extractions into existing infrastructureAccess to the data in HDR must be high performance, self-service, and secure.
  8. Should be about 1:40pm
  9. Despite data size, queries must be fastIt’s not that queries just needed to be fast, they needed to be consistently fast.Modern tools require the ability to ask successive questions.As the centralized resource, you have many many questions being asked at once. The problem is when someone asks the wrong question in Hadoop, it impacts everyone.
  10. Explain the media company data. Desire to get a 360 view of the customer on their site.A straightforward question such as the one posed here potentially requires touching 10s of billions of records to the process the answer.
  11. Highly scalable architecture. Merv Adrian a few months ago: “One of the biggest technical challenges for BI in the Big Data era is deciding what is in memory. Fractal Cache does that efficiently and automatically.“The single most dramatic way to affect performance in a large data warehouse is to provide a proper set of aggregate (summary) records that coexist with the primary base records. Aggregates can have a very significant effect on performance, in some cases speeding queries by a factor of one hundred or even one thousand. No other means exist to harvest such spectacular gains.” – Ralph Kimball
  12. You’ve heard of “drilling-down” on something. Or even drilling up. Use example of Region -> States -> Metro -> City -> Stores.Back to the Netflow example of our interviewee. He had 26B rows of raw data in Hadoop, per month. We built aggregate tables which reduced the grain and removed dimensionality, and made our work really fast.But what happens if, in our self-service Data Reservoir, the end user wants to get more detail from the raw data in Hadoop? We can’t just query it directly, because it will take too long, and I won’t have a rich set of metrics or dimensions to use to answer questions. I need to be able to drill through the aggregation. And since HDR is self-service, I need to be able to do this without involving my colleagues in IT.
  13. Example of making sure data doesn’t get away.
  14. Platfora addressed the challenges of HDR with the interest driven pipeline.Platfora software instantly transforms raw data in Hadoop into interactive, in-memory business intelligence. No ETL or data warehouse required.Platfora is a full stack of technology that spans from raw data the Hadoop Data Reservoir all the way to BI and analytics for the end user.In the past this would require at least three separate products.Platfora is the first product to completely rebuild the traditional business analytics stack from the ground up.
  15. Platfora is made of three components – and none of these are more important than another – they all work together seamlessly.Platfora puts a very pretty face on Hadoop. Stunningly beautiful web-based BI interface. MAKES HADOOP DATA BEAUTIFUL.A scale-out in-memory data processing engine. MAKES HADOOP DATA FAST.Platfora drives Hadoop like a work engine. Automatically generating pushing jobs to Hadoop to do the heavy lifting without needing experts. MAKES HADOOP USABLE.These components work together. Based on what the user needs in the BI layer, the Lenses are automatically refined, the Hadoop data refinery does the heavy lifting without needing programming.Story: as we were working on the early designs for the product we thought about the old world that users were complaining about. Three separate layers – each with heavy expert intervention in-between. It reminded us of the way phones used to work. Remember managing contacts? iPhone analogy. Vertically integrated.