Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Overview of Hadoop and HDFS

209 vues

Publié le

Overview of Hadoop and HDFS

Publié dans : Données & analyses
  • Soyez le premier à commenter

Overview of Hadoop and HDFS

  1. 1. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Introduction, Background to Hadoop and HDFS! ! ! ! ! Brendan Tierney
  2. 2. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com What is Big Data? O’Reilly Radar definition: •  Big data is when the size of the data itself becomes part of the problem EMC/IDC definition: •  Big Data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high velocity capture, discovery and/or analysis •  McKinsey definition: •  Big Data refers to datasets whose size is beyond the availability of typical database software tools to capture, store, manage and analyse http://www.oreilly.com/data/free/big-data-now-2012.csp! http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf! http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation! http://csrc.nist.gov/groups/SMA/forum/documents/june2012presentations/fcsm_june2012_cooper_mell.pdf
  3. 3. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Big Data Some Companies continue to generate large amounts of data: •  Facebook ~ 6 billion messages per day •  EBay ~ 2 billion page views a day, ~ 9 Petabytes of storage •  Satellite Images by Skybox Imaging ~ 1 Terabyte per day •  These numbers are probably out of date before I finished writing this slide Important : This is for some companies and not all companies Part of their data management architecture. It will not replace existing DBs etc
  4. 4. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Basic idea •  The basic idea behind the phrase Big Data is that everything we do is increasingly leaving a digital trace (data) which we can use and analyse •  Big Data therefore refers to our ability to make use of ever increasing volumes of data Traditional data storage methods can be a challenge! Why ?
  5. 5. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Big Data
  6. 6. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com 2013 2013
  7. 7. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com 2014 Where is Predictive Analytics?
  8. 8. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com 2015
  9. 9. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop •  Existing tools were not designed to handle such large amounts of data •  "The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.” •  http://hadoop.apache.org •  – Process Big Data on clusters of commodity hardware •  – Vibrant open-source community •  – Many products and tools reside on top of Hadoop
  10. 10. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Who is using Hadoop in Ireland ? Big websites Big telcos Big Banks Big Financial CERN Big ….
  11. 11. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Access Speeds? 1990: Typical drive ~1370MB Transfer speed ~ 4.4MB/s read drive in 5 mins 2010: Typical drive ~1TB Transfer speed ~ 100MB/s read drive in 2.5 hrs Hadoop - 100 drives working at the same time can read 1TB of data in 2 minutes
  12. 12. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Scaling issue $ $ $ $ ?
  13. 13. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Scaling issue •  It is harder and more expensive to scale-up ( “It Depends” needs to be applied) •  Add additional resources to an existing node (CPU, RAM) •  Moore’s Law can’t keep up with data growth •  New units must be purchased if required resources can not be added •  Also known as scale vertically •  Scale-Out •  Add more nodes/machines to an existing distributed application •  Software Layer is designed for node additions or removal •  Hadoop takes this approach - A set of nodes are bonded together as a single distributed system •  Very easy to scale down as well
  14. 14. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Principles •  Scale-Out rather than Scale-Up •  Bring code to data rather than data to code •  Deal with failures – they are common •  Abstract complexity of distributed and concurrent applications •  Self managing •  Auto parallel processing
  15. 15. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Big Data – Example Applications Not all of these are using Hadoop or require Hadoop!
  16. 16. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Cluster •  A set of "cheap" commodity hardware •  Networked together •  Resides in the same location •  Set of servers in a set of racks in a data center •  “Cheap” Commodity Server Hardware •  No need for super-computers, use commodity unreliable hardware •  Not desktops Yes you can build a Hadoop Cluster using Raspberry Pi’s
  17. 17. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Abstracting Complexity •  Distributed Computing is HARD WORK •  Hadoop abstracts many complexities in distributed and concurrent applications •  Defines small number of components •  Provides simple and well defined interfaces of interactions between these components •  Frees developer from worrying about system level challenges •  race conditions, data starvation •  processing pipelines, data partitioning, code distribution, etc. •  Allows developers to focus on application development and business logic
  18. 18. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop vs RDBMS •  Always keep the phrase “It Depends” in mind when discussing Big Data •  Hadoop != RDBMS •  Hadoop will not replace RDBMS •  Hadoop is part of your data management architecture •  and only if it is needed !
  19. 19. RDBMS Hadoop Data size Gigabytes Petabytes Access Interactive & Batch Batch Updates Read & write many times Write once, read many times Integrity High Low Scaling Non Linear Linear Data representation Structured Unstructured, semi- structured
  20. 20. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  21. 21. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  22. 22. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  23. 23. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  24. 24. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com
  25. 25. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  26. 26. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Working together •  Hadoop and RDBMS frequently complement each other within an architecture •  For example, a website that •  has a small number of users •  produces a large amount of audit logs
  27. 27. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Ecosystem
  28. 28. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Ecosytems
  29. 29. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Ecosytems
  30. 30. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Distributions •  Large number of independent products (Apache projects) •  Can be challenging to get all/some of these to work together •  We will will be working with Hadoop, installing and using some products •  Hadoop Distributions aim to resolve version incompatibilities •  Distribution Vendor will •  Integration Test a set of Hadoop products •  Package Hadoop products in various installation formats •  Linux Packages, tarballs, etc. •  Distributions may provide additional scripts to execute Hadoop •  Some vendors may choose to backport features and bug fixes made by Apache •  Typically vendors will employ Hadoop committers so the bugs they find will make it into Apache’s repository
  31. 31. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Distributions •  Cloudera Distribution for Hadoop (CDH) •  Check out the pre-built VM with most of Cloudera products (Hadoop, etc) •  http://www.cloudera.com/downloads/quickstart_vms/5-8.html •  MapR Distribution •  Check out the MapR Sandbox VM •  https://www.mapr.com/products/mapr-sandbox-hadoop •  Hortonworks Data Platform (HDP) •  Check out the Hortonworks Sandbox VM •  http://hortonworks.com/products/sandbox/ •  Oracle Big Data Applicance •  Check out a pre-built VM with Hadoop, Oracle and lots of other tools all installed and configured for you to use •  http://www.oracle.com/technetwork/database/bigdata-appliance/oracle- bigdatalite-2104726.html $
  32. 32. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop - “move-code-to-data” approach •  Data is distributed among the nodes as it is initially stored in the system •  Data is replicated multiple times on the system for increased reliability & availability •  Master allocates work to nodes •  Computation happens on the nodes where the data is stored - data locality •  Nodes work in parallel each on their own part of the overall dataset •  Nodes are independent and self-sufficient - shared-nothing architecture •  If a node fails, master detects the failure and re-assigns work to other nodes •  If a failed node restarts, it is automatically added back into the system and assigned new tasks
  33. 33. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS •  A distributed file system modelled on the Google File System (GFS)
 [http://research.google.com/archive/gfs.html] •  Data is split into blocks, typically 64MB or 128MB in size, spread across many nodes •  Works better on large files >= 1 HDFS block in size •  Each block is replicated to a number of nodes (typically 3) •  ensures reliability and availability •  Files in HDFS are write once - no random writes to files allowed •  HDFS is optimised for large streaming reads of files - no random access to files allowed •  see HIVE later on for more DBMS-type access to HDFS files....
  34. 34. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS is good for •  Storing large files •  Terabytes, Petabytes, etc... •  Millions rather than billions of files •  100MB or more per file •  Streaming data •  Unstructured data => really mixed structured data •  Write once and read-many times patterns •  Schema on Read (RDBMS = schema on write) •  Huge time saving at data write time •  BUT !!! •  Optimized for streaming reads rather than random reads •  “Cheap” Commodity Hardware •  No need for super-computers, use less reliable commodity hardware
  35. 35. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS is not so good at •  Low-latency reads •  High-throughput rather than low latency for small chunks of data •  HBase and other DBs can address this issue (?) •  Large amount of small files •  Better for millions of large files instead of billions of small files •  Block size of 128M or 256M •  For example each file can be 100MB or more •  Multiple Writers •  Single writer per file •  Writes only at the end of file, no-support for arbitrary offset •  Time needed for replication
  36. 36. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS •  Two types of nodes in a HDFS cluster •  NameNode - the master node •  DataNodes - slave or worker nodes •  NameNode manages the file system •  keeps track of the metadata - which blocks make up a file (using 2 files - namespace image and the edit log) •  knows on which DataNodes the blocks are stored •  DataNodes do the work •  store the blocks •  retrieve blocks when requested to (by the client or the NameNode) •  poll and report back to the NameNode periodically with the list of blocks that they are storing
  37. 37. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS •  When a client application wants to read a file... •  it communicates with the NameNode to determine which blocks make up the file, and on which DataNodes the block reside •  it then communicates directly with the DataNodes •  NameNode is the single point of failure of a Hadoop system •  backup periodically to remote NFS (setup as part of Hadoop configuration) •  use Secondary NameNode •  not the same as the NameNode •  periodically merges namespace with edit log and maintains a copy
  38. 38. [from Hadoop in Practice, Alex Holmes] HDFS Architecture
  39. 39. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Files and Blocks •  Files are split into blocks (single unit of storage) •  Managed by Namenode, stored by Datanode •  Transparent to user •  Replicated across machines at load time •  Same block is stored on multiple machines •  Good for fault-tolerance and access •  Can lead to inconsistent reads •  Default replication is 3 Have you ever experienced inconsistent reads?
  40. 40. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS File Writes
  41. 41. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS File Reads
  42. 42. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Who is using Hadoop in Ireland ? •  List of Cloudera customers in Ireland •  Citi •  Allianz •  Deutsche Bank •  Ulster Bank •  dun & bradstreet •  Ryanair •  BT •  Vodafone •  Novartis •  airbnb •  Dell •  Intel •  Rockwell Automation •  Revenue •  Adecco •  Experian •  M&S
  43. 43. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Discuss Hadoop is not FREE J vs Hadoop is not FREE L
  44. 44. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Something to think about

×