Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Building a Hadoop Data
Warehouse
Hadoop 101 for enterprise
data warehouse professionals
Ralph Kimball
APRIL 2014
Building ...
The Data Warehouse Mission
 Identify all possible enterprise data assets
 Select those assets that have actionable
conte...
Enormous RDBMS Legacy
 Legacy RDBMSs have been spectacularly
successful, and we will continue to use them.
 Too successf...
Houston: we have a problem
 Traditional RDBMSs cannot handle
 The new data types
 Extended analytic processing
 Teraby...
The Data Warehouse Stack in
Hadoop
 Hadoop is an open source distributed storage and
processing framework
 To understand...
The Data Warehouse Stack in
Hadoop
 Hadoop is an open source distributed storage and
processing framework
 To understand...
Hadoop for Exploratory DW/BI
• Query engines can access HDFS files before ETL
• BI tools are the ultimate glue integrating...
Data Load to Query in One
Step
 Copy into HDFS with ETL tool, Sqoop, or
Flume
into standard HDFS files (write once)
regis...
Typical Large Hadoop Cluster
 100 nodes (5 racks)
 Each node
 Dual hex core CPU running at 3 GHz
 64-378 GB of RAM
 2...
Committing to High
Performance HDFS files with
Embedded Schemas
10
HDFS Raw Files:
Sources: Trans-
actions
Free
text
Image...
High Performance Data
Warehouse Thread in Hadoop
 Copy data from raw HDFS file into
Parquet columnar file
 Parquet is no...
Use Hadoop as Platform for
Direct Analysis or ETL to
Text/Number DB
 Huge array of special analysis apps for
 Unstructur...
The Larger Picture: Why Use
Hadoop as Part of Your EDW?
 Strategic:
 Open floodgates to new kinds of data
 New kinds of...
It’s Not That Difficult
 Important existing tools already work in Hadoop
 ETL tool suites: familiar data flows and user ...
Integration is Crucial
 Integration is MORE than bringing separate data
sources onto a common platform.
 Suppose you hav...
Doing Integration the Right Way
 Teaspoon sip of EDW 101 for Hadoop
Professionals!
 Build a conformed dimension library
...
Integrating Big Data
 Remember: Data warehouse integration is drilling
across:
 Establish conformed attributes
(e.g., Cu...
Out of the Box Possibility:
Billions of Rows, Millions of
Columns
 Tough problem for all current relational platforms:
hu...
Summing Up:
The Data Warehouse
Renaissance
 Hadoop DW becomes equal partner with Enterprise
DW
 Hadoop will be the strat...
The Kimball Group Resource
 www.kimballgroup.com
 Best selling data warehouse books
NEW BOOK! The Classic “Toolkit” 3rd
...
“A data warehouse DBMS is
now expected to coordinate
data virtualization strategies,
and distributed file and/or
processin...
What inhibits “Big Data” initiatives?
• No compelling business need
• Not enough staff to support
• Lack of “data science”...
What inhibits “Big Data” initiatives?
• No compelling business need
• Not enough staff to support
• Lack of “data science”...
From Apache Hadoop to an enterprise data hub
BATCH
PROCESSING
STORAGE FOR ANY TYPE OF DATA
UNIFIED, ELASTIC, RESILIENT, SE...
From Apache Hadoop to an enterprise data hub
BATCH
PROCESSING
STORAGE FOR ANY TYPE OF DATA
UNIFIED, ELASTIC, RESILIENT, SE...
From Apache Hadoop to an enterprise data hub
Open Source
Scalable
Flexible
Cost-Effective
✔
Managed
✖
Open
Architecture ✖
...
From Apache Hadoop to an enterprise data hub
BATCH
PROCESSING
ANALYTIC
SQL
SEARCH
ENGINE
MACHINE
LEARNING
STREAM
PROCESSIN...
Partners
Proactive &
Predictive Support
Professional
Services
Training
Cloudera: Your Trusted Advisor for Big Data
28
Adva...
What inhibits “Big Data” initiatives?
• No compelling business need
• Not enough staff to support
• Lack of “data science”...
BusinessBusinessITIT
Disrupt the Industry, Not Your Business
Data
Science
Agile
Exploration
ETL
Acceleration
Cheap
Storage...
Thank you for attending!
• Submit questions in the Q&A
panel
• For a comprehensive set of
Data Warehouse resources -
books...
Prochain SlideShare
Chargement dans…5
×

Building a Hadoop Data Warehouse: Hadoop 101 for Enterprise Data Warehouse Professionals

12 016 vues

Publié le

Dr. Ralph Kimball will describe how Apache Hadoop complements and integrates effectively with the existing enterprise data warehouse. The Hadoop environment's revolutionary architectural advantages open the door to more data and more kinds of data than are possible to analyze with conventional RDBMSs, and additionally offer a whole series of new forms of integrated analysis.

Dr. Kimball will explain how Hadoop can be both:

- A destination data warehouse, and also
- An efficient staging and ETL source for an existing data warehouse

You will also learn how enterprise conformed dimensions can be used as the basis for integrating Hadoop and conventional data warehouses.

On-demand recording: http://www.cloudera.com/content/cloudera/en/resources/library/recordedwebinar/building-a-hadoop-data-warehouse-video.html

Publié dans : Technologie
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • There are over 16,000 woodworking plans that comes with step-by-step instructions and detailed photos, Click here to take a look ●●● http://tinyurl.com/y3hc8gpw
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

Building a Hadoop Data Warehouse: Hadoop 101 for Enterprise Data Warehouse Professionals

  1. 1. Building a Hadoop Data Warehouse Hadoop 101 for enterprise data warehouse professionals Ralph Kimball APRIL 2014 Building a Hadoop Data Warehouse © Ralph Kimball, Cloudera, 2014 April 2014
  2. 2. The Data Warehouse Mission  Identify all possible enterprise data assets  Select those assets that have actionable content and can be accessed  Bring the data assets into a logically centralized “enterprise data warehouse”  Expose those data assets most effectively for decision making
  3. 3. Enormous RDBMS Legacy  Legacy RDBMSs have been spectacularly successful, and we will continue to use them.  Too successful… If all you have is a hammer, everything looks like a nail.  RDBMS dilemma: a new ocean of new data types that are being monetized for strategic advantage  Unstructured, semi-structured and machine data  Evolving schemas, just-in-time schemas  Links, images, genomes, geo-positions, log data …
  4. 4. Houston: we have a problem  Traditional RDBMSs cannot handle  The new data types  Extended analytic processing  Terabytes/hour loading with immediate query access  We want to use SQL and SQL-like languages, but we don’t want the RDBMS storage constraints…  The disruptive solution: Hadoop
  5. 5. The Data Warehouse Stack in Hadoop  Hadoop is an open source distributed storage and processing framework  To understand how data warehousing is different in Hadoop, start with this powerful architecture difference:
  6. 6. The Data Warehouse Stack in Hadoop  Hadoop is an open source distributed storage and processing framework  To understand how data warehousing is different in Hadoop, start with this powerful architecture difference:
  7. 7. Hadoop for Exploratory DW/BI • Query engines can access HDFS files before ETL • BI tools are the ultimate glue integrating EDW HDFS Files: Sources: Trans- actions Free text Images Machines/ Sensors Links/ Networks Metadata (system table): HCatalo g Query Engines: BI Tools: Tableau Industry standad HW; Fault tolerant; Replicated; Write once(!); Agnostic content; Scalable to “infinity” Others… Bus Obj Cognos QlikVie w Others… All clients can use this to read files These are query engines, not databases! Purpose built for EXTREME I/O speeds; Use ETL tool or Sqoop EDW Overflow Hive SQL Impala SQL
  8. 8. Data Load to Query in One Step  Copy into HDFS with ETL tool, Sqoop, or Flume into standard HDFS files (write once) registering metadata with HCatalog  Declare query schema in Hive or Impala (no data copying or reloading)  Immediately launch familiar SQL queries: “Exploratory BI”
  9. 9. Typical Large Hadoop Cluster  100 nodes (5 racks)  Each node  Dual hex core CPU running at 3 GHz  64-378 GB of RAM  24-36 TB disk storage (6-10 TB effective storage with default redundancy of 3X)  Overall cluster (!)  6.4-37.8 TB of RAM (wow, think about this…)  Up to a PB of effective storage  Approximate fully loaded cost per TB: $1000 +/-
  10. 10. Committing to High Performance HDFS files with Embedded Schemas 10 HDFS Raw Files: Sources: Trans- actions Free text Images Machines/ Sensors Links/ Networks Metadata (system table): HCatalo g Query Engines: Hive SQL Impala SQL BI Tools: Tableau Commodity HW; Fault tolerant; Replicated; Append Only(!); Agnostic content; Scalable to “infinity” Bus Obj Cognos QlikVie w Others… All clients can use this to read files Parquet Columnar FILES: Read optimized schema defined column store Purpose built for EXTREME I/O speeds; Use ETL tool or Sqoop EDW Overflow Others… These are query engines, not databases!
  11. 11. High Performance Data Warehouse Thread in Hadoop  Copy data from raw HDFS file into Parquet columnar file  Parquet is not a database: it’s a file accessible to multiple query and analysis apps  Parquet data can be updated and the schema modified  Query Parquet data with Hive or Impala  At least 10x performance gain over simple raw file  Hive launches MapReduce jobs: relation scan  Ideal for ETL and transfer to conventional EDW  Impala launches in-memory individual queries  Ideal for interactive query in Hadoop destination DW  Impala at least 10x additional performance gain over Hive
  12. 12. Use Hadoop as Platform for Direct Analysis or ETL to Text/Number DB  Huge array of special analysis apps for  Unstructured text  Hyper structured text/numbers (machine data)  Positional data from GPS  Images  Audio, video  Consume results with increasing SQL support from these individual apps  Or, write text/number data into Hadoop from unstructured source or external EDW relational DBMS
  13. 13. The Larger Picture: Why Use Hadoop as Part of Your EDW?  Strategic:  Open floodgates to new kinds of data  New kinds of analysis impossible in RDBMS  “Schema on read” for exploratory BI  Attack same data from multiple perspectives  Choose SQL and non-SQL approaches at query time  Keep hyper granular data in “active archive” forever  No compromise data analysis  Compliance  Simultaneous incompatible analysis modes on same data files  Enterprise data hub: one location for all data resources  Tactical:  Dramatically lowered operational costs  Linear scaling across response time, concurrency, and data size well beyond petabytes  Highly reliable write-once, redundantly stored data  Meet ETL SLAs
  14. 14. It’s Not That Difficult  Important existing tools already work in Hadoop  ETL tool suites: familiar data flows and user interfaces  BI query tools: identical user interfaces, integration  Standard job schedulers, sort packages (e.g. SyncSort)  Skills you need anyway:  Java, Python or Ruby, C, SQL, Sqoop data transfer  Linux admin  but, MapReduce programming no longer needed  Investigate and add incrementally:  Analytic tools: MADLib extensions to RDBMS, SAS, R  Specialty data tools  E.g., Splunk (machine data)
  15. 15. Integration is Crucial  Integration is MORE than bringing separate data sources onto a common platform.  Suppose you have two customer facing data sources in your DW producing the following results. Is this integration?
  16. 16. Doing Integration the Right Way  Teaspoon sip of EDW 101 for Hadoop Professionals!  Build a conformed dimension library  Plan to download dimensions from EDW  Attach conformed dimensions to every possible source  Join dimensions at query time to fact tables in SQL-capable files  Embed dimension content as columns in NoSQL structures, and also HBase.
  17. 17. Integrating Big Data  Remember: Data warehouse integration is drilling across:  Establish conformed attributes (e.g., Customer Category) in each database  Fetch separate answer sets from different platforms grouped on the same conformed attributes  Sort-merge the answer sets at the BI layer
  18. 18. Out of the Box Possibility: Billions of Rows, Millions of Columns  Tough problem for all current relational platforms: huge Name-Value data sources (e.g. customer observations)  Think about Hbase (!)  Intended for “impossibly wide schemas”  Fully general binary data content  Fire hose SCD1 and SCD2 updates of individual records  Continuously growing row and columns  Only simple SQL direct access possible now: no joins… 
  19. 19. Summing Up: The Data Warehouse Renaissance  Hadoop DW becomes equal partner with Enterprise DW  Hadoop will be the strategic environment of choice for new data types and new analysis modes  Hadoop:  Extreme data type diversity  Huge library of specialty analysis tools with SQL extensions  Starting point for exploratory BI and ETL-to-EDW processing  Destination point for serious BI  Permanent active archive of hyper granular data  BI tools implement Hadoop-to-EDW integration 
  20. 20. The Kimball Group Resource  www.kimballgroup.com  Best selling data warehouse books NEW BOOK! The Classic “Toolkit” 3rd Ed.  In depth data warehouse classes taught by primary authors  Dimensional modeling (Ralph/Margy)  ETL architecture (Ralph/Bob)  Dimensional design reviews and consulting by Kimball Group principals  White Papers on Integration, Data Quality, and Big Data Analytics
  21. 21. “A data warehouse DBMS is now expected to coordinate data virtualization strategies, and distributed file and/or processing approaches, to address changes in data management and access requirements.” – 2014 Gartner Magic Quadrant for Data Warehouse DBMS Data Warehousing, Meet Hadoop
  22. 22. What inhibits “Big Data” initiatives? • No compelling business need • Not enough staff to support • Lack of “data science” expertise • Missing enterprise-grade features • Complexity of DIY open source
  23. 23. What inhibits “Big Data” initiatives? • No compelling business need • Not enough staff to support • Lack of “data science” expertise • Missing enterprise-grade features • Complexity of DIY open source
  24. 24. From Apache Hadoop to an enterprise data hub BATCH PROCESSING STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT, SECURE FILESYSTEM MAPREDUCE HDFS Open Source Scalable Flexible Cost-Effective ✔ Managed ✖ Open Architecture ✖ Secure and Governed ✖
  25. 25. From Apache Hadoop to an enterprise data hub BATCH PROCESSING STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT, SECURE SYSTEM MANAGEMENT FILESYSTEM MAPREDUCE HDFS CLOUDERAMANAGER Open Source Scalable Flexible Cost-Effective ✔ Managed ✖ Open Architecture ✖ Secure and Governed ✖ ✔✔
  26. 26. From Apache Hadoop to an enterprise data hub Open Source Scalable Flexible Cost-Effective ✔ Managed ✖ Open Architecture ✖ Secure and Governed ✖ ✔✔ ✔✔ BATCH PROCESSING ANALYTIC SQL SEARCH ENGINE MACHINE LEARNING STREAM PROCESSING 3RD PARTY APPS WORKLOAD MANAGEMENT STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT, SECURE SYSTEM MANAGEMENT FILESYSTEM ONLINE NOSQL MAPREDUCE IMPALA SOLR SPARK SPARK STREAMING YARN HDFS HBASE CLOUDERAMANAGER
  27. 27. From Apache Hadoop to an enterprise data hub BATCH PROCESSING ANALYTIC SQL SEARCH ENGINE MACHINE LEARNING STREAM PROCESSING 3RD PARTY APPS WORKLOAD MANAGEMENT STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT, SECURE DATA MANAGEMENT SYSTEM MANAGEMENT FILESYSTEM ONLINE NOSQL MAPREDUCE IMPALA SOLR SPARK SPARK STREAMING YARN HDFS HBASE CLOUDERANAVIGATORCLOUDERAMANAGER SENTRY Open Source Scalable Flexible Cost-Effective ✔ Managed ✖ Open Architecture ✖ Secure and Governed ✖ ✔✔ ✔✔ ✔✔
  28. 28. Partners Proactive & Predictive Support Professional Services Training Cloudera: Your Trusted Advisor for Big Data 28 Advance from Strategy to ROI with Best Practices and Peak Performance
  29. 29. What inhibits “Big Data” initiatives? • No compelling business need • Not enough staff to support • Lack of “data science” expertise • Missing enterprise-grade features • Complexity of DIY open source
  30. 30. BusinessBusinessITIT Disrupt the Industry, Not Your Business Data Science Agile Exploration ETL Acceleration Cheap Storage EDW Optimization Customer 360 Your Journey to Gaining Value from All Your Data Operational EfficiencyOperational Efficiency (Faster, Bigger, Cheaper)(Faster, Bigger, Cheaper) Transformative ApplicationsTransformative Applications (New Business Value)(New Business Value)
  31. 31. Thank you for attending! • Submit questions in the Q&A panel • For a comprehensive set of Data Warehouse resources - books, in depth classes, overall design consulting http://www.kimballgroup.com • Follow: • @cloudera • @mattbrandwein Register now for our next Webinar with Dr. Ralph Kimball: Best Practices for the Hadoop Data Warehouse EDW 101 for Hadoop Professionals Online Webinar | May 29, 2014 10AM PT / 1PM ET http://tinyurl.com/kimballwebinar

×