Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Hadoop INTRO.pptx

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Prochain SlideShare
Mapreduce Hadop.pptx
Mapreduce Hadop.pptx
Chargement dans…3
×

Consultez-les par la suite

1 sur 40 Publicité

Plus De Contenu Connexe

Plus récents (20)

Publicité

Hadoop INTRO.pptx

  1. 1. Eurostat THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Introduction to Hadoop and Spark Antonino Virgillito
  2. 2. Eurostat Large-scale Computation • Traditional solutions for computing large quantities of data relied mainly on processor • Complex processing made on data moved in memory • Scale only by adding power (more memory, faster processor) • Works for relatively small-medium amounts of data but cannot keep up with larger datasets • How to cope with today’s indefinitely growing production of data? • Terabytes per day 2
  3. 3. Eurostat Distributed Computing • Multiple machines connected among each other and cooperating for a common job • «Cluster» • Challenges • Complexity of coordination – all processes and data have to be maintained syncronized about the global system state • Failures • Data distribution 3
  4. 4. Eurostat Hadoop • Open source platform for distributed processing of large datasets • Based on a project developed at Google • Functions: • Distribution of data and processing across machines • Management of the cluster • Simplified programming model • Easy to write distributed algorithms
  5. 5. Eurostat Hadoop scalability  Hadoop can reach massive scalability by exploiting a simple distribution architecture and coordination model  Huge clusters can be made up using (cheap) commodity hardware  Cluster can easily scale up with little or no modifications to the programs
  6. 6. Eurostat Hadoop Concepts • Applications are written in common high-level languages • Inter-node communication is limited to the minimum • Data is distributed in advance • Bring the computation close to the data • Data is replicated for availability and reliability • Scalability and fault-tolerance 6
  7. 7. Eurostat Scalability and Fault-tolerance • Scalability principle • Capacity can be increased by adding nodes to the cluster • Increasing load does not cause failures but in the worst case only a graceful degradation of performance • Fault-tolerance • Failure of nodes are considered inevitable and are coped with in the architecture of the platform • System continues to function when failure of a node occurs – tasks are re-scheduled • Data replication guarantees no data is lost • Dynamic reconfiguration of the cluster when nodes join and leave 7
  8. 8. Eurostat Benefits of Hadoop • Previously impossible or impractical analysis made possible • Lower cost of hardware • Less time • Ask Bigger Questions 8
  9. 9. Eurostat Hadoop Components 9 Hive Pig Sqoop HBase Flume Mahout Oozie Core Components
  10. 10. Eurostat Hadoop Core Components • HDFS: Hadoop Distributed File System • Abstraction of a file system over a cluster • Stores large amount of data by transparently spreading it on different machines • MapReduce • Simple programming model that enables parallel execution of data processing programs • Executes the work on the data near the data • In a nutshell: HDFS places the data on the cluster and MapReduce does the processing work
  11. 11. Eurostat Structure of an Hadoop Cluster • Hadoop Cluster: • Group of machines working together to store and process data • Any number of “worker” nodes • Run both HDFS and MapReduce components • Two “Master” nodes • Name Node: manages HDFS • Job Tracker: manages MapReduce 11
  12. 12. Eurostat Hadoop Hadoop Principle I’m one big data set Hadoop is basically a middleware platform that manages a cluster of machines The core components is a distributed file system (HDFS) HDFS Files in HDFS are split into blocks that are scattered over the cluster The cluster can grow indefinitely simply by adding new nodes
  13. 13. Eurostat The MapReduce Paradigm Parallel processing paradigm Programmer is unaware of parallelism Programs are structured into a two-phase execution Map Data elements are classified into categories Reduce An algorithm is applied to all the elements of the same category x 4 x 5 x 3
  14. 14. Eurostat MapReduce Concepts • Automatic parallelization and distribution • Fault-tolerance • A clean abstraction for programmers • MapReduce programs are usually written in Java • Can be written in any language using Hadoop Streaming • All of Hadoop is written in Java • MapReduce abstracts all the ‘housekeeping’ away from the developer • Developer can simply concentrate on writing the Map and Reduce functions 14
  15. 15. Eurostat MapReduce and Hadoop Hadoop HDFS MapReduce MapReduce is logically placed on top of HDFS
  16. 16. Eurostat MapReduce and Hadoop Hadoop HDFS MR HDFS MR HDFS MR HDFS MR MR works on (big) files loaded on HDFS Each node in the cluster executes the MR program in parallel, applying map and reduces phases on the blocks it stores Output is written on HDFS Scalability principle: Perform the computation were the data is
  17. 17. Eurostat Hive • Apache Hive is a high-level abstraction on top of MapReduce – Uses an SQL/like language called HiveQL – Generates MapReduce jobs that run on the Hadoop cluster – Originally developed by Facebook for data warehousing – Now an open/source Apache project 17
  18. 18. Eurostat Overview • HiveQL queries are transparently mapped into MapReduce jobs at runtime by the Hive execution engine • Also makes optimizations • Jobs are submitted to the Hadoop cluster 18
  19. 19. Eurostat Hive Tables • Hive works on the abstraction of table, similar to a table in a relational database • Main difference: a Hive table is simply a directory in HDFS, containing one or more files • By default files are in text format but different formats can be specified • The structure and location of the tables are stored in a backing SQL database called the metastore • Transparent for the user • Can be any RDBMS, specified at configuration time 19
  20. 20. Eurostat Hive Tables • At query time, the metastore is consulted to check if the query is consistent with the tables it invokes • The query itself operates on the actual data files stored in HDFS 20
  21. 21. Eurostat Hive Tables • By default, tables are stored in a warehouse directory on HDFS • Default location: /user/hive/warehouse/<db>/<table> • Each subdirectory of the warehouse directory is considered a database • Each subdirectory of a database directory is a table • All files in a table directory are considered part of the table when querying • Must have the same structure 21
  22. 22. Eurostat Pig • Tool for querying data on Hadoop clusters • Widely used in the Hadoop world • Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts • Allows to write data manipulation scripts written in a high-level language called Pig Latin • Interpreted language: scripts are translated into MapReduce jobs • Mainly targeted at joins and aggregations
  23. 23. Eurostat Overview of Pig • PigLatin • Language for definition of data flow • Grunt • Interactive shell for typing and executing PigLatin statements • Interpreter and execution engine 23
  24. 24. Eurostat RHadoop • Collection of packages that allows integration of R with HDFS and MapReduce • Hadoop provides the storage while R brings the processing • Just a library • Not a special run-time, Not a different language, Not a special purpose language • Incrementally port your code and use all packages • Requires R installed and configured on all nodes in the cluster
  25. 25. Eurostat RHadoop Packages • rhdfs • Interface for reading and writing files from/to a HDFS cluster • rmr2 • Interface to MapReduce through R • rhbase • Interface to HBase
  26. 26. Eurostat rhdfs • As Hadoop MapReduce programs use HDFS for taking their input and writing their output, it is necessary to access them from R console • The R programmer can easily perform read and write operations on distributed data files. • Basically, rhdfs package calls the HDFS API in backend to operate data sources stored on HDFS.
  27. 27. Eurostat rmr2 • rmr2 is an R interface for providing Hadoop MapReduce facility inside the R environment. • So, the R programmer needs to just divide their application logic into the map and reduce phases and submit it with the rmr2 methods. • After that, rmr2 calls the Hadoop streaming MapReduce API with several job parameters as input directory, output directory, mapper, reducer, and so on, to perform the R MapReduce job over Hadoop cluster.
  28. 28. Eurostat mapreduce • The mapreduce function takes as input a set of named parameters • input: input path or variable • input.format: specification of input format • output: output path or variable • map: map function • reduce: reduce function • map and reduce function present the usual interface • A call to keyval(k,v) inside the map and reduce function is used to emit respectively intermediate and output key-value pairs
  29. 29. Eurostat WordCount in R wordcount = function( input, output = NULL, pattern = " "){ wc.map = function(., lines) { keyval( unlist( strsplit( x = lines, split = pattern)), 1)} wc.reduce = function(word, counts ) { keyval(word, sum(counts))} mapreduce( input = input , output = output, input.format = "text", map = wc.map, reduce = wc.reduce, combine = T)}
  30. 30. Eurostat Reading delimited data tsv.reader = function(con, nrecs){ lines = readLines(con, 1) if(length(lines) == 0) NULL else { delim = strsplit(lines, split = "t") keyval( sapply(delim,function(x) x[1]), sapply(delim,function(x) x[-1]))}} freq.counts = mapreduce( input = tsv.data, input.format = tsv.format, map = function(k, v) keyval(v[1,], 1), reduce = function(k, vv) keyval(k, sum(vv)))
  31. 31. Eurostat Reading named columns tsv.reader = function(con, nrecs){ lines = readLines(con, 1) if(length(lines) == 0) NULL else { delim = strsplit(lines, split = "t") keyval(sapply(delim, function(x) x[1]), data.frame( location = sapply(delim, function(x) x[2]), name = sapply(delim, function(x) x[3]), value = sapply(delim, function(x) x[4])))}} freq.counts = mapreduce( input = tsv.data, input.format = tsv.format, map = function(k, v) { filter = (v$name == "blarg") keyval(k[filter], log(as.numeric(v$value[filter])))}, reduce = function(k, vv) keyval(k, mean(vv)))
  32. 32. Eurostat Apache Spark • A general purpose framework for big data processing • It interfaces with many distributed file systems, such as Hdfs (Hadoop Distributed File System), Amazon S3, Apache Cassandra and many others • 100 times faster than Hadoop for in-memory computation 32
  33. 33. Eurostat Multilanguage API • You can write applications in various languages • Java • Python • Scala • R • In the context of this course we will consider Python 33
  34. 34. Eurostat Built-in Libraries 34
  35. 35. Eurostat RDD - Resilient Distributed Dataset • The RDD is the core abstraction used by Spark to work on data • A RDD is a collection of elements partitioned in every cluster node. Spark operates in parallel on them • Every RDD is created from a file on Hadoop filesystem • They can be made persistent in memory 35
  36. 36. Eurostat Transformations • For example, map is a transformation that takes all elements of the dataset, pass them to a function and returns another RDD with the results resultRDD = originalRDD.map(myFunction) 36
  37. 37. Eurostat Actions • For example, reduce is an action. It aggregates all elements of the RDD using a function and returns the result to the driver program result = rdd.reduce(function) 37
  38. 38. Eurostat SparkSQL and DataFrames • SparkSQL is the spark module for structured data processing • DataFrame API is one of the ways to interact with SparkSQL 38
  39. 39. Eurostat DataFrames • A DataFrame is a collection of data organized into columns • Similar to tables in relational databases • Can be created from various sources: structured data files, Hive Tables, external db, csv etc… 39
  40. 40. Eurostat Example operations on DataFrames • To show the content of the DataFrame • df.show() • To print the Schema of the DataFrame • df.printSchema() • To select a column • df.select(‘columnName’).show() • To filter by some parameter • df.filter(df[‘columnName’] > N).show() 40

Notes de l'éditeur

  • Figure?

×