Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Hadoop and the Data Warehouse: When to Use Which

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité

Consultez-les par la suite

1 sur 44 Publicité

Hadoop and the Data Warehouse: When to Use Which

In recent years, Apache™ Hadoop® has emerged from humble beginnings to disrupt the traditional disciplines of information management. As with all technology innovation, hype is rampant, and data professionals are easily overwhelmed by diverse opinions and confusing messages.
Even seasoned practitioners sometimes miss the point, claiming for example that Hadoop replaces relational databases and is becoming the new data warehouse. It is easy to see where these claims originate since both Hadoop and Teradata® systems run in parallel, scale up to enormous data volumes and have shared-nothing architectures. At a conceptual level, it is easy to think they are interchangeable, but the differences overwhelm the similarities. This session will shed light on the differences and help architects, engineering executives, and data scientists identify when to deploy Hadoop and when it is best to use MPP relational database in a data warehouse, discovery platform, or other workload-specific applications.
Two of the most trusted experts in their fields, Steve Wooledge, VP of Product Marketing from Teradata and Jim Walker of Hortonworks will examine how big data technologies are being used today by practical big data practitioners.

In recent years, Apache™ Hadoop® has emerged from humble beginnings to disrupt the traditional disciplines of information management. As with all technology innovation, hype is rampant, and data professionals are easily overwhelmed by diverse opinions and confusing messages.
Even seasoned practitioners sometimes miss the point, claiming for example that Hadoop replaces relational databases and is becoming the new data warehouse. It is easy to see where these claims originate since both Hadoop and Teradata® systems run in parallel, scale up to enormous data volumes and have shared-nothing architectures. At a conceptual level, it is easy to think they are interchangeable, but the differences overwhelm the similarities. This session will shed light on the differences and help architects, engineering executives, and data scientists identify when to deploy Hadoop and when it is best to use MPP relational database in a data warehouse, discovery platform, or other workload-specific applications.
Two of the most trusted experts in their fields, Steve Wooledge, VP of Product Marketing from Teradata and Jim Walker of Hortonworks will examine how big data technologies are being used today by practical big data practitioners.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (18)

Publicité

Similaire à Hadoop and the Data Warehouse: When to Use Which (20)

Plus par DataWorks Summit (20)

Publicité

Plus récents (20)

Hadoop and the Data Warehouse: When to Use Which

  1. 1. HADOOP & THE DATA WAREHOUSE: WHEN TO USE WHICH Steve Wooledge – Teradata Labs Jim Walker – Hortonworks 1
  2. 2. Topics • Trends in enterprise data architectures • The value of an integrated data warehouse • The value of Hadoop • Bringing it all together and next steps
  3. 3. Big Data Comes with BIG HEADACHES Even free software like Hadoop is causing companies to spend more money…Many CIOs believe data is inexpensive because storage has become inexpensive. But data is inherently messy—it can be wrong, it can be duplicative, and it can be irrelevant— which means it requires handling, which is where the real expenses come in. “ ” Through 2015, 85% of Fortune 500 organizations will be unable to exploit big data for competitive advantage. “ ”Source: The Wall Street Journal. “CIOs’ Big Problem with Big Data”. Aug 2012 Source: Gartner. “Information Innovation: Innovation Key Initiative Overview”. April 2012
  4. 4. Organizations Face Several Obstacles with Big Data Source: Big Analytics 2012 Survey, Teradata Difficulty managing multiple systems, new types of data Hard to find right skills; Lack of supportability for new systems & “data scientists” Difficulty deploying and integrating new systems Difficulty providing accessibility to fast insights on big data
  5. 5. Shift from a Single Platform to an Ecosystem “Big Data requirements are solved by a range of platforms including analytical databases, discovery platforms, and NoSQL solutions beyond Hadoop.” “We will abandon the old models based on the desire to implement for high-value analytic applications.” "Logical" Data Warehouse Source: “Big Data Comes of Age”. EMA and 9sight Consulting. Nov 2012.
  6. 6. AUDIO & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP DISCOVERY PLATFORM CAPTURE | STORE | REFINE INTEGRATED DATA WAREHOUSE LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS Engineers Data Scientists Business Analysts Front-Line WorkersCustomers / PartnersMarketing Operational SystemsExecutives TERADATA UNIFIED DATA ARCHITECTURE
  7. 7. Topics • Trends in enterprise data architectures • The value of an integrated data warehouse • The value of Hadoop • Bringing it all together and next steps
  8. 8. DUAL SYSTEMS DATA MARTS ANALYTICAL ARCHIVE TEST/ DEV The Value of The Data Warehouse INDEPENDENT DATA MART Business Analysts Knowledge Workers DATA MININGBUSINESS INTELLIGENCE APPLICATIONS Customers/Partners Marketing Executives Front-line Workers Operational Systems INTEGRATED DATA WAREHOUSE DATA LAB Integrated Analytics Advanced Analytics Temporal OLAP Optimization Geospatial Big Data Integration Application Development Agile Analytics Data Exploration Benefits •Easy to consume data •Rationalization of data from multiple sources into single enterprise view •Clean, safe, secure data •Cross-functional analysis •Transform once, use many •Fast response times
  9. 9. SQL Advantages with an MPP RDBMS • Full ANSI SQL: • The lingua franca of business users when accessing data • Decades of standardization (stable, feature rich, portable) • Mature 3rd Party SQL based tools that provide business users with self service direct access to the data • BI Tools • In-database statistical packages • Analytic applications (CRM, SCM, MDM) • Easily parallelized • Scalable when manipulating large data sets 6/27/2013 9
  10. 10. ACID Advantages in an MPP RDBMS • Guarantees database actions are processed reliably • Ensures 100% query result accuracy • Supports updates and deletes • Needed for applications that require 100% consistency 6/27/2013 10 Atomicity - All of the pieces are committed or none are committed. Consistency - Creates a new and valid state of data, or, if any failure occurs, returns all data to its original state. Isolation - Processed and not yet committed transactions must remain isolated from any other transactions. Durability - Committed data is saved such that in event of a failure and system restart, the data is available in its correct state.
  11. 11. Tight Vertical Integration • End-to-end management of resources • Efficient utilization of resources • Engineered extremely well for known data • Fine-grained parallelism and resource management • Consistency of service level delivery Best Practices Management: • Workload functions • Workload groups • Exceptions • Priorities • Time periods
  12. 12. Low Latency Advantages of MPP RDBMS Multi-temperature storage with automated distribution of data based on access patterns: • In-Memory • Solid-State Drives • Fast Hard Drives • Fat Hard Drives 6/27/2013 12 • Indexes • Statistics • Advanced partitioning
  13. 13. Cost Based Optimizer Advantages in an MPP RDBMS • Best practices optimizer determines how the query will be processed most efficiently, with no “hints” or degrees of parallelism necessary. • In chess, you can look out a few moves to decide your best next move, but you can’t envision all move and countermove sequences for the entire game: • The Grand Master has the knowledge, experience, and intelligence to identify and use the right strategy. • With Hadoop, the user takes a heavy role in optimizing the execution of queries. • With an MPP RDBMS, the software is the optimizer. 6/27/2013 13 Query Rewrite • semantic optimization • different types of vendor tools Fast/Efficient Data Access • Access path - Indexing • Partitioning (CP & PPI) • Advanced partitioning schemes (range & case based, multilevel, dynamic) • IO Optimizations (efficient scans/sync scan) scan optimization Query Complexity • Join costing & planning • Aggregation Many ways to process a complex query…
  14. 14. Granular Security Advantages in an MPP RDBMS • Row level security • Column level security • An MPP RDBMS tightly integrates mature security features • User-level security controls • Increased user authentication options • Support for security roles • Enterprise directory integration • Auditing and monitoring controls • Encryption 6/27/2013 14
  15. 15. MPP RDBMS Customer Examples 6/27/2013 15
  16. 16. Topics • Trends in enterprise data architectures • The value of an integrated data warehouse • The value of Hadoop • Bringing it all together and next steps
  17. 17. © Hortonworks Inc. 2012 By the year 2015, we believe half the worlds data will be processed by Apache Hadoop Key Hadoop Features for the EDW •Storage/Processing •Metadata
  18. 18. © Hortonworks Inc. 2012 Data Explosion The World of Data is Changing Page 18 By 2015, organizations that build a modern information management system will outperform their peers financially by 20 percent. – Gartner, Mark Beyer, “Information Management in the 21st Century” 1 Zettabyte (ZB) = 1 Billion TBs 15x growth rate of machine generated data by 2020 Source: IDC
  19. 19. © Hortonworks Inc. 2012 StorageApache Hadoop: Center of Big Data Strategy Open Source data management with scale-out storage & distributed processing Page 19 HDFS • Distributed across “nodes” • Natively redundant • Name node tracks locations Processing Map Reduce • Splits a task across processors “near” the data & assembles results • Self-Healing, High Bandwidth Clustered Storage Key Characteristics • Scalable – Efficiently store and process petabytes of data – Linear scale driven by additional processing and storage • Reliable – Redundant storage – Failover across nodes and racks • Flexible – Store all types of data in any format – Apply schema on analysis and sharing of the data • Economical – Use commodity hardware – Open source software guards against vendor lock-in
  20. 20. © Hortonworks Inc. 2012 HCatalog Table access Aligned metadata REST API • Raw Hadoop data • Inconsistent, unknown • Tool specific access Apache HCatalog provides flexible metadata services across tools and external access Metadata Services • Consistency of metadata and data models across tools (MapReduce, Pig, HBase and Hive) • Accessibility: share data as tables in and out of HDFS • Availability: enables flexible, thin-client access via REST API Shared table and schema management opens the platform
  21. 21. © Hortonworks Inc. 2012 Page 21 “how to” deliver an open source enterprise product • Identify requirements • Open community delivery • Enterprise rigor Apache Hadoop Test & Patch Design & Develop Release Apache Pig Apache HCatalo g Apache HBase Other Apache Projects Apache Hive Apache Ambari An Open Apache Community Fastest path to innovation is an open community
  22. 22. © Hortonworks Inc. 2012 Big Data: It’s About Scale & Structure Page 22 RDBMS HadoopNoSQLMPPEDW best fit use schemaRequired on write Required on read speedReads are fast Writes are fast governanceStandards and structured Loosely structured processingLimited, no data processing Processing coupled with data data typesStructured Multi and unstructured Interactive OLAP Analytics Complex ACID Transactions Operational Data Store Data Discovery Processing unstructured data Massive Storage/Processing costSoftware License Support only resourcesKnown entity Growing, complexities, wide
  23. 23. © Hortonworks Inc. 2012 An Emerging Data Architecture Page 23 APPLICATIONSDATASYSTEMS TRADITIONAL REPOS RDBMS EDW MPP DATASOURCES MOBILE DATA OLTP, POS SYSTEMS OPERATIONAL TOOLS MANAGE & MONITOR Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media) DEV & DATA TOOLS BUILD & TEST Business Analytics Custom Applications Enterprise Applications HORTONWORKS DATA PLATFORM
  24. 24. © Hortonworks Inc. 2012 Interoperating With Your Tools Page 24 APPLICATIONSDATASYSTEMS DEV & DATA TOOLS OPERATIONAL TOOLS Viewpoint Microsoft Applications HADOOP DATASOURCES MOBILE DATA OLTP, POS SYSTEMS Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media)
  25. 25. AUDIO & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP DISCOVERY PLATFORM CAPTURE | STORE | REFINE INTEGRATED DATA WAREHOUSE LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS Engineers Data Scientists Business Analysts Front-Line WorkersCustomers / PartnersMarketing Operational SystemsExecutives TERADATA UNIFIED DATA ARCHITECTURE
  26. 26. © Hortonworks Inc. 2012 By the year 2015, we believe half the worlds data will be processed by Apache Hadoop Key Hadoop Features for the EDW •Storage/Processing •Metadata
  27. 27. © Hortonworks Inc. 2012 By the year 2015, we believe half the worlds data will be processed by Apache Hadoop Key Hadoop Features for the EDW •Storage/Processing •Metadata •FAMILIARITY
  28. 28. Organizations Face Several Obstacles with Big Data Source: Big Analytics 2012 Survey, Teradata Difficulty managing multiple systems, new types of data Hard to find right skills; Lack of supportability for new systems & “data scientists” Difficulty deploying and integrating new systems Difficulty providing accessibility to fast insights on big data
  29. 29. Topics • Trends in enterprise data architectures • The value of an integrated data warehouse • The value of Hadoop • Bringing it all together and next steps
  30. 30. Confidential and proprietary. Copyright © 2013 Teradata Corporation.30 Teradata Unified Data Architecture • Hadoop - Collect ALL interaction data • Teradata Aster - Discovery customer behavioral patterns • Teradata - Operationalize Insights The right technology on the right analytical problems using best of breed technologies
  31. 31. Confidential and proprietary. Copyright © 2013 Teradata Corporation.31 Improved Customer Service and Retention Hadoop captures, stores and transforms social, images and call records Path, pattern & graph analysis Data Sources Multi-Structured Raw Data Call Center Voice Records Check Images Traditional Data Flow Analysis + Marketing Automation (Customer Campaign) Capture, Store and Refine Layer ETL Tools Hadoop Call Data Integrated DW DimensionalData AnalyticResults Discovery Platform Sentiment Scores SOCIAL FEEDS CLICKSTREAM DATA
  32. 32. Confidential and proprietary. Copyright © 2013 Teradata Corporation.32 Teradata Workload-Specific Platforms 670 1650 2700 6700 Data Mart Appliance Extreme Data Appliance Data Warehouse Appliance Active Enterprise Data Warehouse Appliance for Hadoop Aster Big Analytics Appliance SAS High Performance Analytics Scale Up to 12TB Up to 186PB Up to 1.6PB Up to 61PB Up to 10PB Up to 5PB Up to 52TB Work- loads Test / Development or Smaller Data Marts Analytical Archive, Deep Dive Analytics Strategic Intelligence, Decision Support System, Fast Scan Strategic & Operational Intelligence, Real Time Update, Active workloads Appliance for Storing, Capturing and Refining Data. Hortonworks HDP 1.1 Discovery Platform for Big Data Analytics with embedded SQL MapReduce for new data types & sources Dedicated appliance for SAS high- performance analytic model development 700
  33. 33. Confidential and proprietary. Copyright © 2013 Teradata Corporation.33 Teradata Unified Data Architecture • Hadoop - Collect ALL interaction data • Teradata Aster - Discovery customer behavioral patterns • Teradata - Operationalize Insights The right technology on the right analytical problems using best of breed technologies SQL-H SQL-H Aster-Teradata Connector Aster Connector for Hadoop Teradata Connector for Hadoop
  34. 34. Confidential and proprietary. Copyright © 2013 Teradata Corporation.34 Teradata SQL-H™ A Business User’s Bridge to Access Hadoop Data Teradata SQL-H Gives Business Users a Better Way to Access Data Stored in Hadoop • Trusted: Use existing tools/skills and enable self-service BI with granular security • Allow standard ANSI SQL access to Hadoop data • Fast: Queries run on Teradata, data accessed from Hadoop • Efficient: Intelligent data access leveraging the Hadoop HCatalog Hadoop Layer: HDFS Pig Hive Hadoop MR Teradata: SQL-H HCatalog Data DataFiltering
  35. 35. Confidential and proprietary. Copyright © 2013 Teradata Corporation.35 The App Store of Big Data PATH ANALYSIS Discover Patterns in Rows of Sequential Data TEXT ANALYSIS Derive Patterns and Extract Features in Textual Data STATISTICAL ANALYSIS High-Performance Processing of Common Statistical Calculations SEGMENTATION Discover Natural Groupings of Data Points MARKETING ANALYTICS Analyze Customer Interactions to Optimize Marketing Decisions DATA TRANSFORMATION Transform Data for More Advanced Analysis Graph Analysis Graph analytics processing and visualization SQL-MapReduce Visualization Graphing and visualization tools linked to key functions of the MapReduce analytics library Aster Discovery Portfolio: Accelerate Time to Insights Some of the 80+ out-of-the-box analytical apps
  36. 36. Confidential and proprietary. Copyright © 2013 Teradata Corporation.36 Big Data Analytics & Discovery Example Customers: Teradata Aster Big Analytics Appliance XL Axiata
  37. 37. Confidential and proprietary. Copyright © 2013 Teradata Corporation.37 Discovering Deep Insights in Retail Transforming Web Walks into DNA Sequences Situation Large retailer with 700M visits/year, 2M customers / day look at 1M products online Problem Increase ability of web content owners to self-serve insights Solution Treat web walks like DNA sequences of simple patterns. Impact • Data: loaded logs into Hortonworks • Loaded 2 months of raw data in 1 hour, vs. 1 day on old system • Can load a day’s log data in 60 sec • Sessionize: Creates sequence for visit, e.g., boils 20 customer clicks down to 1 line: • <Home –Search -Look at Product - Add to Basket – Pay – Exit> • Analyze: Business analysts can now do path analysis • Act: • Segmentations by behavior can increase conversion rates by 5-10%. • Web design changes can drive another 10-20% more visitors into the sales funnel
  38. 38. Confidential and proprietary. Copyright © 2013 Teradata Corporation.38 Example: Online Checkout Flow Analysis • Customers who have reached the checkout process follow an “ideal path”. • deliveryslots > deliveryinformation > coupons > substitutions > paymentinfo > orderconfirmation • Determine how and when (and ultimately, why) customers deviate from this path. • Discover obstacles preventing purchase and optimize visitor flow through the web site. • The Aster SQL-MapReduce Framework enables a variety of different path visualizations.
  39. 39. Teradata Portfolio for Hadoop ”Taking Hadoop from Silicon Valley to Main Street” Most Trusted & Flexible Hadoop Platforms for Your Next-Generation Unified Data Architecture™ 1. Teradata Aster Big Analytics Appliance 2. Teradata Appliance for Hadoop 3. Teradata Commodity Offering for Hadoop (Dell) 4. Teradata Software-only for Hadoop (Hortonworks Data Platform) Complete consulting and training capability • Big Analytics Services – across the UDA • Data Integration Optimization – ETL, ELT across the UDA • Hadoop deployment & mentoring • Teradata delivering Hortonworks training • Hadoop Managed Services - operations & administration Customer Support for Hadoop • World-class Teradata customer support, backed by Hortonworks What We Announced Today
  40. 40. Teradata Appliance for Hadoop Value-Added Software Bringing Hadoop to Enterprise Access: SQL-H™ Management: Viewpoint, TVI Administration: Hadoop Builder, Intelligent start/stop, DataNode swap, deferred drive replace High Availability : NameNode HA, Master Machine Failover Refining, Metadata, Entity Resolution Security & Data Access HCatalog KerberosKerberos
  41. 41. 41 6/27/2013 Teradata Confidential Complete Consulting and Training Capability Post-sale Services Areas of Focus Teradata Analytic Architecture Services Services to scope, design, build, operate and maintain an optimal UDA approach for Teradata, Aster, and Hadoop Teradata DI Optimization Assess structured/non-structured data, discuss data loading techniques, determine best platform, optimize load scripts/processes Teradata Big Analytics Assess data value/cost of capture, identify source of “exhaust” data, create conceptual architecture, refine and enrich the data, implement initial analytics in Aster or best-fit tool Teradata Workshop for Hadoop Introduction workshop (across all of UDA) Teradata Data Staging for Hadoop Load data into landing-area; set-up data exploration/refining area; Scope architecture and analytics; set-up Hadoop repository; Load sample data Teradata Platform for Hadoop Installation guidance and mentoring for Hadoop platform, D-I-Y after installation Teradata Managed Services for Hadoop Operations, management, administration, backup, security, process control for Hadoop Teradata Training Courses for Hadoop Two comprehensive, multi-day training offerings: 1) Administration of Apache Hadoop and 2) Developing Solutions Using Apache Hadoop
  42. 42. 42 6/27/2013 Teradata Confidential When to Use Which? The best approach by workload and data type Processing as a Function of Schema Requirements and Stage of Data Pipeline Low Cost Storage and Fast Loading Data Pre- Processing, Refining, Cleansing “Simple math at scale” (Score, filter, sort, avg., count...) Joins, Unions, Aggregates Analytics (Iterative and data mining) Reporting Stable Schema Evolving Schema Aster (SQL + MapReduce Analytics) Format, No Schema Hadoop Hadoop Hadoop Aster Aster Aster (MapReduce Analytics) Teradata/ Hadoop Teradata Teradata Teradata Teradata Teradata Hadoop Aster / Hadoop Aster / Hadoop Aster Aster Aster Hadoop Hadoop Hadoop Aster Aster Aster Financial Analysis, Ad-Hoc/OLAP Enterprise-Wide BI and Reporting Spatial/Temporal Active Execution Interactive Data Discovery Web Clickstream, Set-Top Box Analysis CDRs, Sensor Logs, JSON Social Feeds, Text, Image Processing Audio/Video Storage and Refining Storage and Batch Transformations
  43. 43. Confidential and proprietary. Copyright © 2013 Teradata Corporation.43 When to Use Which? The best approach by workload and data type Processing as a Function of Schema Requirements and Stage of Data Pipeline Low Cost Storage and Fast Loading Data Pre- Processing, Refining, Cleansing “Simple math at scale” (Score, filter, sort, avg., count...) Joins, Unions, Aggregates Analytics (Iterative and data mining) Reporting Stable Schema Evolving Schema Aster (SQL + MapReduce Analytics) Format, No Schema Hadoop Hadoop Hadoop Aster Aster Aster (MapReduce Analytics) Teradata/ Hadoop Teradata Teradata Teradata Teradata Teradata Hadoop Aster / Hadoop Aster / Hadoop Aster Aster Aster Hadoop Hadoop Hadoop Aster Aster Aster
  44. 44. 6/27/2013 44 Questions and Answers Thank You!

×