Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Hadoop crash course workshop at Hadoop Summit

10 345 vues

Publié le

Hadoop Summit 2015

Publié dans : Technologie

Hadoop crash course workshop at Hadoop Summit

  1. 1. Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Crash Course Winter 2015 Version 1.0 Hortonworks. We do Hadoop. Rafael Coss rafael@hortonworks.com @racoss
  2. 2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Crash Course  Why Hadoop?  Hadoop Ecosystem & Distribution  Store Data (HDFS)  Process Data in Hadoop 1 (MapReduce)  Process Data in Hadoop 2 (Yarn + MapReduce/Tez)  Access Data  Lab
  3. 3. Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What disrupted the data center? ? Data?
  4. 4. Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Traditional World Of Applications And Data Silos Constrains data to specific apps No insight across ALL data Built for structured data Does not scale (cost and tech) ERP CRM SCM WEB
  5. 5. Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved New Data Paradigm Opens Up New Opportunity 2.8 zettabytes in 2012 44 zettabytes in 2020 N E W 1 zettabyte (ZB) = 1 million petabytes (PB); Sources: IDC, IDG Enterprise, and AMR Research Clickstream ERP, CRM, SCM Web & social Geolocation Internet of Things Server logs Files, emails Transform every industry via full fidelity of data and analytics Opportunity T R A D I T I O N A L LAGGARDS LEADERS Ability to Consume Data Enterprise Blind Spot
  6. 6. Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop YARN-based Architecture Unlocks Opportunity Consolidates all data sets Delivers real-time insights Integrates with data center Scalable and affordable T U R N A L L O F Y O U R D ATA I N T O VA L U E | Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation
  7. 7. Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Two Paths in a Customer’s Journey to a Data LakeSCALE SCOPE Goal: • Centralized Architecture • Data-driven Business DATA LAKE Journey to the Data Lake with Hadoop Systems of Insight The journey begins with either: 1. Cost Optimization (Data Architecture Optimization) 2. Advanced Analytic Applications Leaders are Data Driven Advanced Analytic Apps Cost Optimization
  8. 8. Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Common Drivers of Hadoop Adoption Data Architecture Optimization Keep 100% of Data at up to 1/100 the Cost and Enrich DW Analytics Single View Customer Product Supply Chain Predictive Analytics Behavioral Insight Preventive Maintenance Resource Optimization Data Discovery Explore Datasets Uncover New Findings Operationalize Insights Industry Hadoop Adoption Journey
  9. 9. Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Ecosystem runs on ETL RDBMS Import/Export Distributed Storage & Processing Framework Secure NoSQL DB SQL on HBase NoSQL DB Workflow Management SQL Streaming Data Ingestion Cluster System Operations Secure Gateway Distributed Registry ETL Search & Indexing Even Faster Data Processing Data Management Machine Learning
  10. 10. Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Architecture Data Access Engines Distributed Reliable Storage Distributed Compute Framework Resource Mgt, Data Locality Data Operating System Batch Interactive Streaming Governance Security Apps
  11. 11. Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Key Services Hortonworks Data Platform Multi-tenant data platform built on a centralized architecture of shared enterprise services YARN: data operating system Governance Security Operations Resource management Existing applications New analytics Partner applications Data access: batch, interactive, real-time Storage Key Services Resource and workload management Scalable tiered storage Consistent operations Comprehensive security Trusted data governance
  12. 12. Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hortonworks Development Investment for the Enterprise Horizontal Integration for Enterprise Services Ensure consistent enterprise services are applied across the Hadoop stack Vertical Integration with YARN and HDFS Ensure engines can run reliably and respectfully in a YARN based cluster Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection SECURITYGOVERNANCE Deploy and effectively manage the platform ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive Java Scala Cascading Stream Storm Search Solr NoSQL HBase Accumulo BATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Others ISV Engines 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° YARN: Data Operating System (Cluster Resource Management) HDFS (Hadoop Distributed File System) Tez Slider SliderTez Tez OPERATIONS
  13. 13. Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ` + /directory/structure/in/memory.txt Resource management + schedulingDisk, CPU, Memory Core NameNode HDFS ResourceManager YARN Hadoop daemon User application NN RM DataNode HDFS NodeManager YARN Worker Node
  14. 14. Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Joys of Real Hardware (Jeff Dean) Typical first year for a new cluster: ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) ~5 racks go wonky (40-80 machines see 50% packetloss) ~8 network maintenances (4 might cause ~30-minute random connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for dns ~1000 individual machine failures ~thousands of hard drive failures slow disks, bad memory, misconfigured machines, flaky machines, etc
  15. 15. Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Distributed File System (HDFS) Fault Tolerant Distributed Storage • Divide files into big blocks and distribute 3 copies randomly across the cluster • Processing Data Locality • Not Just storage but computation 10110100101 00100111001 11111001010 01110100101 00101100100 10101001100 01010010111 01011101011 11011011010 10110100101 01001010101 01011100100 11010111010 0 Logical File 1 2 3 4 Blocks 1 Cluster 1 1 2 2 2 3 3 34 4 4
  16. 16. Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved The DataNodes “I’m still here! This is my latest heartbeat.” “I’m here too! And here is my latest heartbeat.” 123 “Hey DataNode1, Replicate block 123 to DataNode 3.” NameNode DataNode 1 DataNode 3 DataNode 4 123 123 DataNode 1
  17. 17. Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Batch Processing in Hadoop MapReduce Batch Access to Data Original data access mechanism for Hadoop • Framework Made for developing distributed applications to process vast amounts of data in-parallel on large clusters • Proven Reliable interface to Hadoop which works from GB to PB. But, batch oriented – Speed is not it’s strong point. • Ecosystem Ported to Hadoop 2 to run on YARN. Supports original investments in Hadoop by customers and partner ecosystem. DataNode1 Mapper Data is shuffled across the network & sorted Map Phase Shuffle/Sort Reduce Phase MapReduce Job Lifecycle Saying that MapReduce is dead is preposterous - Would limits us to only new workloads - ALL Hadoop clusters use map reduce - Why rewrite everything immediately? DataNode2 Mapper DataNode3 Mapper DataNode1 Reducer DataNode2 Reducer DataNode3 Reducer YARN: Data Operating System Interactive Real-TimeBatch
  18. 18. Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is MapReduce? Break a large problem into sub-solutions Map • Iterate over a large # of records • Extract something of interest from each record Shuffle • Sort Intermediate results Reduce • Aggregate, summarize, filter or transform intermediate results • Generate final output Map Process Map Process Map Process Map Process Data Data Data Data Data Data Data Data Data Data Data Data Data Map Process Reduce Process Reduce Process Data Read & ETL Shuffle & Sort Aggregation Data Data Data Data Data Data Data Data
  19. 19. Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved WordCount in MapReduce HDFS constitution.txt The mappers read the file’s blocks from HDFS line-by-line 1 We the people, in order to form a... The lines of text are split into words and output to the reducers 2 The shuffle/sort phase combines pairs with the same key 3 The reducers add up the “1’s” and output the word and its count 4 <We, 1> <the,1> <people,1> <in,1> <order, 1> <to,1> <form,1> <a,1> <We, (1,1,1,1)> <the, (1,1,1,1,1,1,1,...)> <people,(1,1,1,1,1)> <form, (1)><We,4> <the,265> <people,5> <form,1>HDFS
  20. 20. Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved 1st Gen Hadoop: Cost Effective Batch at Scale HADOOP 1.0 Built for Web-Scale Batch Apps Single App BATCH HDFS Single App INTERACTIVE Single App BATCH HDFS Silos created for distinct use casesSingle App BATCH HDFS Single App ONLINE
  21. 21. Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop emerged as foundation of new data architecture Apache Hadoop is an open source data platform for managing large volumes of high velocity and variety of data • Built by Yahoo! to be the heartbeat of its ad & search business • Donated to Apache Software Foundation in 2005 with rapid adoption by large web properties & early adopter enterprises • Incredibly disruptive to current platform economics Traditional Hadoop Advantages  Manages new data paradigm  Handles data at scale  Cost effective  Open source Traditional Hadoop Had Limitations Batch-only architecture Single purpose clusters, specific data sets Difficult to integrate with existing investments Not enterprise-grade Application Storage HDFS Batch Processing MapReduce
  22. 22. Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What does iOS 6 and Windows 3.1 have in common?
  23. 23. Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Beyond Batch with YARN HDFS MapReduce Pig (data flow) Hive (SQL) Others API, Engine, and System Hadoop 1 MapReduce as the Base HDFS (redundant, reliable storage) YARN (Data Operating System: resource management, etc.) Tez (modern execution engine) Data Flow Pig SQL Hive Java Apps Cascading Batch MapReduce Hadoop 2 Apache Yarn as a Base System Engine API’s Single Use Sysztem Batch Apps Multi Use Data Platform Batch, Interactive, Online, Streaming, … A shift from the old to the new…
  24. 24. Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Tez is a critical innovation of the Stinger Initiative. • Along with YARN, Tez not only improves Hive, but improves all things batch and interactive for Hadoop; Pig, Cascading… • More Efficient Processing than MapReduce • Reduce operations and complexity of back end processing • Allows for Map Reduce Reduce which saves hard disk operations • Implements a “service” which is always on, decreasing start times of jobs • Allows Caching of Data in Memory YARN Dev Cascading/S calding Why is Tez Important? °1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) Scriptin g Pig SQL Hive Tez Tez Applications Tez YARN: Data Operating System Interactive Real-TimeBatch
  25. 25. Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tez Hive – MapReduce Hive – Tez SELECT a.state, COUNT(*), AVG(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state SELECT a.state JOIN (a, c) SELECT c.price SELECT b.id JOIN(a, b) GROUP BY a.state COUNT(*) AVG(c.price) M M M R R M M R M M R M M R HDFS HDFS HDFS M M M R R R M M R R SELECT a.state, c.itemId JOIN (a, c) JOIN(a, b) GROUP BY a.state COUNT(*) AVG(c.price) SELECT b.id Tez avoids unneeded writes to HDFS
  26. 26. Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP delivers a Centralized Architecture YARN Other Pure Play Vendors A siloed “with” YARN architecture Disjoint, Siloed Clusters • Inefficient use of resources, single tenant, duplicate storage & processing • Multiple implementations of governance, security and operations • New applications require new clusters Hortonworks Data Platform A centralized architecture built on YARN Cluster1 Application Security Storage YARN Governance Operations Batch Storage YARN: Data Operating System Governance Security Operations Resource Management Existing Applications New Analytics Partner Applications (ie. SAS) Cluster2 Application Security Storage Governance Operations ClusterN Application Security Storage Governance Operations … Interactive Dedicated Resource mgt Real-time Dedicated Resource mgt Single cluster, multiple applications • Efficient storage, processing • Centralized Security, Operations, Governance • Run a variety of applications simultaneously Data Access: Batch, Interactive & Real-time
  27. 27. Page28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved {Processing + Storage} = {MapReduce/YARN + HDFS} = {Core Hadoop}
  28. 28. Page29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Modern Data Architecture emerges to unify data & processing Modern Data Architecture • Enable applications to have access to all your enterprise data through an efficient centralized platform • Supported with a centralized approach governance, security and operations • Versatile to handle any applications and datasets no matter the size or type Clickstream Web & Social Geolocation Sensor & Machine Server Logs Unstructured SOURCES Existing Systems ERP CRM SCM ANALYTICS Data Marts Business Analytics Visualization & Dashboards ANALYTICS Applications Business Analytics Visualization & Dashboards ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) YARN: Data Operating System Interactive Real-TimeBatch Partner ISVBatch BatchMP P EDW
  29. 29. Page30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is Data Access? Data Access defines ALL the channels through which data can be accessed, analyzed, cleansed and consumed within Hadoop. Each channel can be categorized into THREE core patterns; Batch, Interactive and Real-time. Multiple engines provide optimized access to your mission critical data.
  30. 30. Page31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Access patterns enabled by YARN Batch Needs to happen but, no timeframe limitations Interactive Needs to happen at Human time Real-Time Needs to happen at Machine Execution time. YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °N HDFS (Hadoop Distributed File System) Interactive Real-TimeBatch
  31. 31. Page32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Projects Enable Access Patterns • Various Open Source projects have incubated in order to meet these access pattern needs • Today, they can all run on a single cluster on a Single set of data because of YARN! • ALL powered by a BROAD Open Community YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °N HDFS (Hadoop Distributed File System) Batch MapReduce Pig Hive Interactive Solr Spark Hive Kafka Real-Time HBase Accumulo Storm
  32. 32. Page33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Scripting Data Flow & ETL Apache Pig • Data flow engine and scripting language (Pig Latin) • Allows you to transform data and datasets Advantages over MapReduce • Reduces time to write jobs • Community support • Piggybank has a significant number of UDF’s to help adoption • There are a large number of existing shops using PIG YARN: Data Operating System Interactive Real-TimeBatch
  33. 33. Page34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Pig Latin • Pig executes in a unique fashion: o During execution, each statement is processed by the Pig interpreter o If a statement is valid, it gets added to a logical plan built by the interpreter o The steps in the logical plan do not actually execute until a DUMP or STORE command is used
  34. 34. Page35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Why use Pig? • Maybe we want to join two datasets, from different sources, on a common value, and want to filter, and sort, and get top 5 sites
  35. 35. Page36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Hive: THE defacto standard for SQL in Hadoop • What? • Treat your data in Hadoop as tables • Provides a standard SQL 92 interface to data in Hadoop • Why? • Shipped in every distribution… you already have it (although some do not ship complete versions) Quickly find value in raw data files • Proven at petabyte scale for both batch and interactive queries • Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy, Business Objects, etc…
  36. 36. Page37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive Architecture User issues SQL query Hive parses and plans query Query converted to MapReduce and executed on Hadoop 2 3 Web UI JDBC / ODBC CLI Hive SQL 1 1 HiveServer2 Hive MR/Tez Compiler Optimizer Executor 2 Hive MetaStore (MySQL, Postgresql, Oracle) MapReduce or Tez Job Data DataData Hadoop 3 Data-local processing
  37. 37. Page38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Using Tez for Hive Queries Set the following property in either hive-site.xml or in your script: set hive.execution.engine=tez;
  38. 38. Page39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SQL Compliance Evolution of SQL Compliance in Hive SQL Datatypes SQL Semantics INT/TINYINT/SMALLINT/BIGINT SELECT, INSERT FLOAT/DOUBLE GROUP BY, ORDER BY, HAVING BOOLEAN JOIN on explicit join key ARRAY, MAP, STRUCT, UNION Inner, outer, cross and semi joins STRING Sub-queries in the FROM clause BINARY ROLLUP and CUBE TIMESTAMP UNION DECIMAL Standard aggregations (sum, avg, etc.) DATE Custom Java UDFs VARCHAR Windowing functions (OVER, RANK, etc.) CHAR Advanced UDFs (ngram, XPath, URL) Interval Types Sub-queries for IN/NOT IN, HAVING JOINs in WHERE Clause INSERT/UPDATE/DELETE Legend Hive 10 or earlier Roadmap Hive 11 Hive 12 Hive 13 YARN: Data Operating System Interactive Real-TimeBatch
  39. 39. Page40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Overview of Stinger Base Optimizations Generate simplified DAGs In-memory Hash Joins Vector Query Engine Optimized for modern processor architectures Tez Express tasks more simply Eliminate disk writes Pre-warmed Containers ORCFile Column Store High Compression Predicate / Filter Pushdowns YARN Next-gen Hadoop data processing framework 100X+ Faster Time to Insight + + Deeper Analytical Capabilities Performance Optimizations Query Planner Intelligent Cost-Based Optimizer
  40. 40. Page41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved System Engine API YARN : Data Operating System °1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) Batch MapReduce Real-Time Slider Direct Java .NET Scripting Pig SQL Hive Cascading Java Scala NoSQL HBase Accumulo Stream Storm Other ISV Other ISV Applications Others Spark Other ISV HDP 2.2 HDP 2.2 HDP 2.2 HDP 2.2 HDP 2.2TezTezTez Tez YARN: Resource Manager for Hadoop 2.0 Flexible Enables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming Efficient Double processing IN Hadoop on the same hardware while providing predictable performance & quality of service Shared Provides a stable, reliable, secure foundation and shared operational services across multiple workloads Data Processing Engines Run Natively IN Hadoop
  41. 41. Page42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive & Pig Hive & Pig work well together and many customers use both Hive is a good choice: • if you are familiar with SQL • when you want to query data • when you need an answer to specific questions Pig is a good choice: • For ETL (Extract, Transform, Load) • for preparing data for analysis • when you have a long series of steps to perform YARN: Data Operating System Interactive Real-TimeBatch
  42. 42. Page43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Pig and Hive Sample Scenario Hadoop Distributed File System Structured Data Raw Data 1. Put the data into HDFS in its raw format Answers to questions = $$ 2. Use Pig to explore and transform 3. Data analysts use Hive to query the data 4. Data scientists use MapReduce, R, and Mahout to mine the data Hidden gems = $$
  43. 43. Page44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Big Data ETL Life Cycle Mobile Apps Transactions, OLTP, OLAP Social Media, Web Logs Machine Device, Scientific Documents and Emails 9. Govern & enrich with metadata 3. Stream real-time data 8. Explore & validate data 4. Mask sensitive data 2. Replicate changed data & schemas Visualization & Analytics 11. Subscribe to datasets Data Mart 1. Load or archive batch data Data Access & Query 5. Access customer “golden record MDM 10. Correlate real-time events with historical patterns & trends 6. Transform & refine data 7. Move results to EDW
  44. 44. Page45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP: Any Data, Any Application, Anywhere Any Application • Deep integration with ecosystem partners to extend existing investments and skills • Broadest set of applications through the stable of YARN-Ready applications Any Data Deploy applications fueled by clickstream, sensor, social, mobile, geo-location, server log, and other new paradigm datasets with existing legacy datasets. Anywhere Implement HDP naturally across the complete range of deployment options Clickstream Web & Social Geolocation Internet of Things Server Logs Files, emailsERP CRM SCM hybrid commodity appliance cloud Over 70 Hortonworks Certified YARN Apps
  45. 45. Page46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What next? -> developer.hortonworks.com
  46. 46. Page47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Thank you! rafael@hortonworks.com @racoss
  47. 47. Page48 © Hortonworks Inc. 2011 – 2015. All Rights Reserved IoT Data Discovery Lab • A trucking company has over 100 trucks. • The geolocation data collected from the trucks contains events generated while the truck drivers are driving. • The company’s goal with Hadoop is to Mitigate Risk: o Understand correlations between miles driven and events o Compute the risk factor for each driver based on mileage & events o Lab Env o Sandbox 2.3 TP o Lab Doc o URL o Load Data o Query Data o Process Data
  48. 48. Page49 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Move Data Into Hadoop Geolocation.csv trucks.csv Geolocation_stage Geolocation Trucks_stage Trucks csv csv ORC ORC SQL SQL move LOAD
  49. 49. Page50 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Geolocation Trucks ORC ORC SQL SQL PIG Risk Calculation Truck_mileage ORC Avg_mileage ORC DriverMileage ORC RiskFactor ORC Events ORC Trucking Risk Analysis – Hadoop ELT
  50. 50. Page51 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Calculate Risk
  51. 51. Page52 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Cautionary Statement Regarding Forward-Looking Statements This presentation contains forward-looking statements involving risks and uncertainties. Such forward-looking statements in this presentation generally relate to future events, our ability to increase the number of support subscription customers, the growth in usage of the Hadoop framework, our ability to innovate and develop the various open source projects that will enhance the capabilities of the Hortonworks Data Platform, anticipated customer benefits and general business outlook. In some cases, you can identify forward-looking statements because they contain words such as “may,” “will,” “should,” “expects,” “plans,” “anticipates,” “could,” “intends,” “target,” “projects,” “contemplates,” “believes,” “estimates,” “predicts,” “potential” or “continue” or similar terms or expressions that concern our expectations, strategy, plans or intentions. You should not rely upon forward-looking statements as predictions of future events. We have based the forward-looking statements contained in this presentation primarily on our current expectations and projections about future events and trends that we believe may affect our business, financial condition and prospects. We cannot assure you that the results, events and circumstances reflected in the forward-looking statements will be achieved or occur, and actual results, events, or circumstances could differ materially from those described in the forward-looking statements. The forward-looking statements made in this prospectus relate only to events as of the date on which the statements are made and we undertake no obligation to update any of the information in this presentation. Trademarks Hortonworks is a trademark of Hortonworks, Inc. in the United States and other jurisdictions. Other names used herein may be trademarks of their respective owners.
  52. 52. Page53 © Hortonworks Inc. 2011 – 2015. All Rights Reserved A Definition of Open Enterprise Hadoop Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection SECURITYGOVERNANCE Deploy and effectively manage the platform ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° BATCH, INTERACTIVE & REAL-TIME DATA ACCESS YARN: Data Operating System (Cluster Resource Management) HDFS (Hadoop Distributed File System) OPERATIONS Batch Interactive Real-Time
  53. 53. Page54 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Big Data ETL Life Cycle Mobile Apps Transactions, OLTP, OLAP Social Media, Web Logs Machine Device, Scientific Documents and Emails 9. Govern & enrich with metadata 3. Stream real-time data 8. Explore & validate data 4. Mask sensitive data 2. Replicate changed data & schemas Visualization & Analytics 11. Subscribe to datasets Data Mart 1. Load or archive batch data Data Access & Query 5. Access customer “golden record MDM 10. Correlate real-time events with historical patterns & trends 6. Transform & refine data 7. Move results to EDW
  54. 54. Page55 © Hortonworks Inc. 2011 – 2015. All Rights Reserved EDW Data Data Data Data Data Data Data Data DataSchemaData Data Data ETL ETL ETL ETL EDW Data Data Data Data Data Data Data Data DataSchemaData Data Data ETL ETL ETL ETL Fragile workflows make supporting the analytical models you want expensive and time-consuming.
  55. 55. Page56 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Options for Data Input MapReduce WebHDFS hadoop fs -put Vendor Connectors Hadoop nfs gateway Hue Explorer
  56. 56. Page57 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Risk Factors Viewed in a Graph
  57. 57. Page58 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Risk Factors Viewed on a Map

×