Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

4. hadoop גיא לבנברג

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 20 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à 4. hadoop גיא לבנברג (20)

Publicité

Plus récents (20)

Publicité

4. hadoop גיא לבנברג

  1. 1. The Good, The Bad and the Ugly How to tame the Big Data Beast Guy Loewenberg May 2013
  2. 2. Overview • Data Explosion
  3. 3. Overview • Big Data: A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications • Hadoop: A framework that allows distributed processing of large data-sets across clusters of computers using a simple programming model • 1000 Kilobytes = 1 Megabyte • 1000 Megabytes = 1 Gigabyte • 1000 Gigabytes = 1 Terabyte • 1000 Terabytes = 1 Petabyte • 1000 Petabytes = 1 Exabyte • 1000 Exabytes = 1 Zettabyte • 1000 Zettabytes = 1 Yottabyte • 1000 Yottabytes = 1 Brontobyte • 1000 Brontobytes = 1 Geopbyte Most US SME corporations Most US large corporations Leaders like Facebook & Google
  4. 4. Hadoop Basics • Designed to scale • Uses commodity hardware • Processes data in batches • Can process very large scale of data (PBs)
  5. 5. Core Hadoop • Core hadoop is built from two main systems: – Hadoop Clustered file system - HDFS – MapReduce programming framework
  6. 6. Hadoop architecture • Hadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storage. – NameNode controls HDFS whereas DataNodes does the block replications, read/write operations and drives the workloads for HDFS – Work in a master/slave mode.
  7. 7. Hadoop architecture • MapReduce: Distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction. – The JobTracker schedules jobs and allocates activities to TaskTracker nodes which execute the map and reduce processes requested – Work in master/slave mode
  8. 8. Hadoop software architecture MapReduce: Parallel data processing framework for large data sets HDFS: Hadoop distributed File System Oozie: MapReduce job Scheduler HBase: Key-value database Pig: Large data sets analysis language Hive: High-level language for analyzing large data sets ZooKeeper: distributed coordination system Solr / Lucene search engine, query engine library
  9. 9. What Hadoop can’t do • Hadoop lets you perform batch analysis on whatever data you have stored within Hadoop. That data, does not have to be structured – Many solutions take advantage of the low storage expense of Hadoop to store structured data there instead of RDBMS. But shifting data back and forth between Hadoop and an RDBMS would be overkill. – Transactional data is highly complex, as a transaction on an ecommerce site can generate many steps that all have to be implemented quickly. That scenario is not ideal for Hadoop – Structured data sets that require very minimal latency
  10. 10. Comparing RDBMS to MapReduce RDBMS MapReduce Data size Gigabytes Petabytes Access Interactive and batch Batch Structure Fixed schema Unstructured schema Language SQL Procedural (Java, C++, Ruby, etc) Integrity High Low Scaling Nonlinear Linear Updates Read and write Write once, read many times Latency Low High
  11. 11. What Hadoop can do • High data volume, stored in Hadoop, and queried at length later using MapReduce functions – index building – pattern recognitions – creating recommendation engines – sentiment analysis • Hadoop should be integrated within your existing IT infrastructure in order to capitalize on the countless pieces of data that flows into your organization.
  12. 12. Hadoop Maturity?! • Inaccessible to analysts without programming ability • clusters have no record of who changed which record and when it was changed • storage functionality they have always depended on (snapshots, mirroring) are lacking in HDFS. • Incompatibility with existing tools • Data without structure has limited value and applying the structure at query time requires a lot of Java code. • Limited documentation • Limited troubleshooting capabilities
  13. 13. Choosing your infrastructure • Define what you want to achieve – POC – Scale (few, tens, hundreds) – One-time, periodic, continuous • Infrastructure design – Servers, storage, network, rack-space – Define a joined team Hadoop App/Dev and infrastructure specialist (facilities/server/network) when building a solution – Virtual machines vs. Physical machines (IO performance, High CPU, Network)
  14. 14. Choosing your infrastructure • Network infrastructure – Data movement between nodes (rack-awareness, replication factor) – Data between sites (Hosting/Service) • Storage (architecture, disks) – Local disks, JBOD – Increase default block-size • Operations – Monitor – Backup (configuration files, journal, Checkpoint …)
  15. 15. Performance & Scale considerations • Consider running on a dedicated/standalone not shared with other Hadoop processes on the same server – Name Node, Secondary Name Node and/or Checkpoint Node – Job Tracker and the HBASE (or any DB) Master • Consider a Physical dedicated environment
  16. 16. Thank you! Hadoop - The Good, The Bad and the Ugly Guy Loewenberg
  17. 17. SUPPORTING SLIDES
  18. 18. HDFS Architecture
  19. 19. Improving RDBMS with Hadoop • Accelerating nightly batch business processes. • Storage of extremely high volumes of enterprise data • Creation of automatic redundant backups • Improving the scalability of applications • Use of Java for data processing instead of SQL. • Produce just-in-time feeds for dashboards and business intelligence • Handling urgent, ad hoc requests for data • Turning unstructured data into relational data • Taking on tasks that require massive parallelism • Moving existing algorithms, code, frameworks, and components to a highly distributed computing environment.

Notes de l'éditeur

  • NameNode and DataNode are HDFS components that work in a master/slave mode. NameNode is a major component that controls HDFS whereas DataNodes does the block replications, read/write operations and drives the workloads for HDFS.
  • JobTracker and TaskTracker are also components that work in master/slave mode where JobTracker tasks control the mapping and reducing tasks at individual nodes among other tasks. The TaskTrackers run at the node levels and maintains communications with JobTracker for all nodes within the cluster.
  • The main components include:Hadoop. Java software framework to support data-intensive distributed applications ZooKeeper. A highly reliable distributed coordination system MapReduce. A flexible parallel data processing framework for large data sets HDFS. Hadoop Distributed File System Oozie. A MapReduce job scheduler HBase. Key-value database Hive. A high-level language built on top of MapReduce for analyzing large data sets Pig. Enables the analysis of large data sets using Pig Latin. Pig Latinis a high-level language compiled into MapReduce for parallel data processing.

×