Hadoop Ecosystem

•Télécharger en tant que PPSX, PDF•

6 j'aime•4,705 vues

There is a lot more to Hadoop than Map-Reduce. An increasing number of engineers and researchers involved in processing and analyzing large amount of data, regards Hadoop as an ever expanding ecosystem of open sources libraries, including NoSQL, scripting and analytics tools.

Technologie

Hadoop
Ecosystem
ACM Bay Area Data Mining Camp 2011
Patrick Nicolas
September 19, 2011
http://patricknicolas.blogspot.com
http://www.slideshare.net/pnicolas
https://github.com/prnicolas

Copyright 2011 Patrick Nicolas - All rights reserved

1

Overview
Beside providing developers and analysts with an open source
implementation of map-reduce functional model, the Hadoop
ecosystem incorporates analytical algorithms, tasks/workflow
managers and NoSQL stores.
Client code, Scripts
NoSQL

Analytics

Key-Values stores Mahout
Document stores
Multi-column stores
Graph databases

Configuration
Zookeeper

Workflow
Hive
Pig
Cascading

Map/Reduce framework
HDFS
Java Virtual Machine

Copyright 2011 Patrick Nicolas - All rights reserved

2

Key Components
The Hadoop ecosystem can be described as a data centric
taxonomy to analyze, aggregate, store and report data.
Admin.
File System

GFS,HDFS

MapReduce

K-V Stores

Redis, Memcache, Kyoto Cabinet

Doc Stores

Hadoop

Zookeeper

MongoDB, CouchDB

NoSQL

Multi-column
stores

HBase, Hypertable, BigData,
Cassandra, BerkeleyDB

Graph DB
Script
Workflow

Neo4j, GraphDB, InfiniteGraph
Pig
Cascading

SQL
Analytics

API

Hive

Mahout, Chunkwa

Copyright 2011 Patrick Nicolas - All rights reserved

3

NoSQL: Overview

Non relational data stores allow large amount of data to be
collected very efficiently. Contrary to RDBMS, NoSQL
schemas are optimized for sequential writes and therefore are
not appropriate for querying and reporting.

Key

Value

Column families, nested structures

NoSQL storages share the same basic key-value schema but
provide different method to describe values.

Copyright 2011 Patrick Nicolas - All rights reserved

4

NoSQL: Document Stores
Key-Value files (HDFS)
<key, value>
Distributed replicable blocks of sequential key-value string pairs

Key-Value stores (Redis, Memcache)
<key*, value>
Language independent, distributed, sorted key value pairs (keys
are list, sets or hashes) with in-memory caching and support for
atomic operations.

Document stores (MongoDB, CouchDB)
{ “k1”:val1, “k2”:val2 }
Fault-tolerant, document centric using dynamic schema of sorted
javascript objects and supports limited SQL like syntax.

Copyright 2011 Patrick Nicolas - All rights reserved

5

NoSQL: Tuples & Graphs

Sorted, ordered tuples(Cassandra, HBase,..)
{ name:x value: { key1: {name:key1, value:v1, tstamp:x}, key2:x}}

Fault-tolerant, distributed sorted, ordered, grouped (family)
‘super-column’ (map of unbounded number of columns)

Graph databases(Neo4j, GraphDB, InfiniteGraph,..)
Efficient transactional, traversal & storage of entity (vertice),
attribute & relationship (edge)

Copyright 2011 Patrick Nicolas - All rights reserved

6

Data Flow Managers
Map & Reduce tasks can be abstracted to a tasks or workflow
managers using high level language such as scripts, SQL or
UNIX-pipe like API. Those data flow tools hide the functional
complexity of Map-Reduce from domain experts.
Scripting

Pig

SQL

Hive

API: Pipes & flows

Cascading

API

Map
Map
Map
Map
Map

Combine
Combine

Reduce
Reduce
Reduce
Reduce

Copyright 2011 Patrick Nicolas - All rights reserved

7

Data Flow Code Samples
Pig Latin
A = LOAD „mydata' USING PigStorage() AS (f1:int, name:string);
B = GROUP A BY f1;
C = FOREACH B GENERATE COUNT ($0);

Hive
LOAD DATA LOCAL INPATH „xxx' OVERWRITE INTO TABLE z;
INSERT OVERWRITE TABLE z SELECT count(*) FROM y GROUP BY f1;

Cascading
Scheme srcScheme = new TextLine( new Fields( “line”));
Tap src = new Hfs(srcScheme, inpath);
Pipe counter = new Pipe (“count”);
counter = new GroupBy( counter, new Fields(“f1”);
FlowConnector connector = new FlowConnector(props);
Flow flow = connector.connect( “count”, src, sink, pipe);
flow.complete();

Copyright 2011 Patrick Nicolas - All rights reserved

8

Recommandé

Doug Cutting on the State of the Hadoop EcosystemCloudera, Inc.

Hadoop EcosystemLior Sidi

Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz

Introduction To Hadoop EcosystemInSemble

The Evolution of the Hadoop EcosystemCloudera, Inc.

Introduction to the Hadoop EcoSystemShivaji Dutta

Migrating structured data between Hadoop and RDBMSBouquet

Apache Spark & HadoopMapR Technologies

Recommandé

Doug Cutting on the State of the Hadoop EcosystemCloudera, Inc.

Hadoop EcosystemLior Sidi

Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz

Introduction To Hadoop EcosystemInSemble

The Evolution of the Hadoop EcosystemCloudera, Inc.

Introduction to the Hadoop EcoSystemShivaji Dutta

Migrating structured data between Hadoop and RDBMSBouquet

Apache Spark & HadoopMapR Technologies

Spark meetup TCHUGRyan Bosshart

HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.

Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Databricks

Apache Spark Overview @ ferretAndrii Gakhov

Payment Gateway Live hadoop projectKamal A

Cloudera Hadoop DistributionThisara Pramuditha

Proud to be Polyglot - Riviera Dev 2015Tugdual Grall

Apache drillMapR Technologies

Summer Shorts: Big Data Integrationibi

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni

Cloudera Impala + PostgreSQLliuknag

Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty

The Fundamentals Guide to HDP and HDInsightGert Drapers

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit

Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit

Hadoop trainting in hyderabad@kelly technologiesKelly Technologies

Summary machine learning and model deploymentNovita Sari

An introduction to Apache Hadoop HiveMike Frampton

Big Data on the Microsoft PlatformAndrew Brust

Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit

Hadoop Ecosystem Architecture Overview Senthil Kumar

Creating an Ecosystem Platform with Vertical PaaSWSO2

Contenu connexe

Tendances

Spark meetup TCHUGRyan Bosshart

HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.

Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Databricks

Apache Spark Overview @ ferretAndrii Gakhov

Payment Gateway Live hadoop projectKamal A

Cloudera Hadoop DistributionThisara Pramuditha

Proud to be Polyglot - Riviera Dev 2015Tugdual Grall

Apache drillMapR Technologies

Summer Shorts: Big Data Integrationibi

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni

Cloudera Impala + PostgreSQLliuknag

Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty

The Fundamentals Guide to HDP and HDInsightGert Drapers

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit

Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit

Hadoop trainting in hyderabad@kelly technologiesKelly Technologies

Summary machine learning and model deploymentNovita Sari

An introduction to Apache Hadoop HiveMike Frampton

Big Data on the Microsoft PlatformAndrew Brust

Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit

Tendances (20)

Spark meetup TCHUG

HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...

Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...

Apache Spark Overview @ ferret

Payment Gateway Live hadoop project

Cloudera Hadoop Distribution

Proud to be Polyglot - Riviera Dev 2015

Apache drill

Summer Shorts: Big Data Integration

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy

Cloudera Impala + PostgreSQL

Hadoop Architecture Options for Existing Enterprise DataWarehouse

The Fundamentals Guide to HDP and HDInsight

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks

Hadoop trainting in hyderabad@kelly technologies

Summary machine learning and model deployment

An introduction to Apache Hadoop Hive

Big Data on the Microsoft Platform

Hadoop Infrastructure @Uber Past, Present and Future

En vedette

Hadoop Ecosystem Architecture Overview Senthil Kumar

Creating an Ecosystem Platform with Vertical PaaSWSO2

The Hadoop EcosystemJ Singh

Media Buying Platform Ecosystemolivier delamesliere

Understanding the Online Advertising Technology Landscape Karina Sanz

Business Ecosystem DesignJan Schmiedgen

En vedette (6)

Hadoop Ecosystem Architecture Overview

Creating an Ecosystem Platform with Vertical PaaS

The Hadoop Ecosystem

Media Buying Platform Ecosystem

Understanding the Online Advertising Technology Landscape

Business Ecosystem Design

Similaire à Hadoop Ecosystem

Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer

TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta

Hadoop: An Industry PerspectiveCloudera, Inc.

Hadoop_arunam_pptjerrin joseph

Hadoop and IoT Sinergija 2014Milos Milovanovic

Big Data Analytics Projects - Real World with PentahoMark Kromer

Spark Study NotesRichard Kuo

Big data hadoop ecosystem and nosqlKhanderao Kand

Hadoop and IoT Sinergija 2014Darko Marjanovic

Pivotal HD and Spring for Apache Hadoopmarklpollack

Meetup Oracle Database BCN: 2.1 Data Management Trendsavanttic Consultoría Tecnológica

Big data conceptsSerkan Özal

Big data vahidamiri-tabriz-13960226-datastack.irdatastack

Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaMopuru Babu

Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev

Hadoop and mysql by Chris SchneiderDmitry Makarchuk

Hadoop Big Data A big pictureJ S Jodha

xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)Claudiu Barbura

Similaire à Hadoop Ecosystem (20)

Big Data Analytics with Hadoop, MongoDB and SQL Server

TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...

Hadoop: An Industry Perspective

Hadoop_arunam_ppt

Hadoop and IoT Sinergija 2014

Big Data Analytics Projects - Real World with Pentaho

Spark Study Notes

Big data hadoop ecosystem and nosql

Hadoop and IoT Sinergija 2014

Pivotal HD and Spring for Apache Hadoop

Meetup Oracle Database BCN: 2.1 Data Management Trends

Big data concepts

Big data vahidamiri-tabriz-13960226-datastack.ir

Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala

Big Data Hoopla Simplified - TDWI Memphis 2014

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015

Hadoop and mysql by Chris Schneider

Hadoop Big Data A big picture

xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)

Plus de Patrick Nicolas

Autonomous medical coding with discriminative transformersPatrick Nicolas

Open Source Lambda Architecture for deep learningPatrick Nicolas

AI for electronic health recordsPatrick Nicolas

Monadic genetic kernels in ScalaPatrick Nicolas

Scala for Machine LearningPatrick Nicolas

Stock Market Prediction using Hidden Markov Models and Investor sentimentPatrick Nicolas

Advanced Functional Programming in ScalaPatrick Nicolas

Adaptive Intrusion Detection Using Learning ClassifiersPatrick Nicolas

Data Modeling using Symbolic RegressionPatrick Nicolas

Semantic Analysis using Wikipedia TaxonomyPatrick Nicolas

Taxonomy-based Contextual Ads TargetingPatrick Nicolas

Multi-tenancy in Private CloudsPatrick Nicolas

Plus de Patrick Nicolas (12)

Autonomous medical coding with discriminative transformers

Open Source Lambda Architecture for deep learning

AI for electronic health records

Monadic genetic kernels in Scala

Scala for Machine Learning

Stock Market Prediction using Hidden Markov Models and Investor sentiment

Advanced Functional Programming in Scala

Adaptive Intrusion Detection Using Learning Classifiers

Data Modeling using Symbolic Regression

Semantic Analysis using Wikipedia Taxonomy

Taxonomy-based Contextual Ads Targeting

Multi-tenancy in Private Clouds

Dernier

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

How to write a Business Continuity PlanDatabarracks

unit 4 immunoblotting technique complete.pptxBkGupta21

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

Advanced Computer Architecture – An IntroductionDilum Bandara

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Take control of your SAP testing with UiPath Test SuiteDianaGray10

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Dernier (20)

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

"Debugging python applications inside k8s environment", Andrii Soldatenko

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Developer Data Modeling Mistakes: From Postgres to NoSQL

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

Moving Beyond Passwords: FIDO Paris Seminar.pdf

Unraveling Multimodality with Large Language Models.pdf

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

How to write a Business Continuity Plan

unit 4 immunoblotting technique complete.pptx

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

Advanced Computer Architecture – An Introduction

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Take control of your SAP testing with UiPath Test Suite

How AI, OpenAI, and ChatGPT impact business and software.

Gen AI in Business - Global Trends Report 2024.pdf

Hadoop Ecosystem

1. Hadoop Ecosystem ACM Bay Area Data Mining Camp 2011 Patrick Nicolas September 19, 2011 http://patricknicolas.blogspot.com http://www.slideshare.net/pnicolas https://github.com/prnicolas Copyright 2011 Patrick Nicolas - All rights reserved 1

2. Overview Beside providing developers and analysts with an open source implementation of map-reduce functional model, the Hadoop ecosystem incorporates analytical algorithms, tasks/workflow managers and NoSQL stores. Client code, Scripts NoSQL Analytics Key-Values stores Mahout Document stores Multi-column stores Graph databases Configuration Zookeeper Workflow Hive Pig Cascading Map/Reduce framework HDFS Java Virtual Machine Copyright 2011 Patrick Nicolas - All rights reserved 2

3. Key Components The Hadoop ecosystem can be described as a data centric taxonomy to analyze, aggregate, store and report data. Admin. File System GFS,HDFS MapReduce K-V Stores Redis, Memcache, Kyoto Cabinet Doc Stores Hadoop Zookeeper MongoDB, CouchDB NoSQL Multi-column stores HBase, Hypertable, BigData, Cassandra, BerkeleyDB Graph DB Script Workflow Neo4j, GraphDB, InfiniteGraph Pig Cascading SQL Analytics API Hive Mahout, Chunkwa Copyright 2011 Patrick Nicolas - All rights reserved 3

4. NoSQL: Overview Non relational data stores allow large amount of data to be collected very efficiently. Contrary to RDBMS, NoSQL schemas are optimized for sequential writes and therefore are not appropriate for querying and reporting. Key Value Column families, nested structures NoSQL storages share the same basic key-value schema but provide different method to describe values. Copyright 2011 Patrick Nicolas - All rights reserved 4

5. NoSQL: Document Stores Key-Value files (HDFS) <key, value> Distributed replicable blocks of sequential key-value string pairs Key-Value stores (Redis, Memcache) <key*, value> Language independent, distributed, sorted key value pairs (keys are list, sets or hashes) with in-memory caching and support for atomic operations. Document stores (MongoDB, CouchDB) { “k1”:val1, “k2”:val2 } Fault-tolerant, document centric using dynamic schema of sorted javascript objects and supports limited SQL like syntax. Copyright 2011 Patrick Nicolas - All rights reserved 5

6. NoSQL: Tuples & Graphs Sorted, ordered tuples(Cassandra, HBase,..) { name:x value: { key1: {name:key1, value:v1, tstamp:x}, key2:x}} Fault-tolerant, distributed sorted, ordered, grouped (family) ‘super-column’ (map of unbounded number of columns) Graph databases(Neo4j, GraphDB, InfiniteGraph,..) Efficient transactional, traversal & storage of entity (vertice), attribute & relationship (edge) Copyright 2011 Patrick Nicolas - All rights reserved 6

7. Data Flow Managers Map & Reduce tasks can be abstracted to a tasks or workflow managers using high level language such as scripts, SQL or UNIX-pipe like API. Those data flow tools hide the functional complexity of Map-Reduce from domain experts. Scripting Pig SQL Hive API: Pipes & flows Cascading API Map Map Map Map Map Combine Combine Reduce Reduce Reduce Reduce Copyright 2011 Patrick Nicolas - All rights reserved 7

8. Data Flow Code Samples Pig Latin A = LOAD „mydata' USING PigStorage() AS (f1:int, name:string); B = GROUP A BY f1; C = FOREACH B GENERATE COUNT ($0); Hive LOAD DATA LOCAL INPATH „xxx' OVERWRITE INTO TABLE z; INSERT OVERWRITE TABLE z SELECT count(*) FROM y GROUP BY f1; Cascading Scheme srcScheme = new TextLine( new Fields( “line”)); Tap src = new Hfs(srcScheme, inpath); Pipe counter = new Pipe (“count”); counter = new GroupBy( counter, new Fields(“f1”); FlowConnector connector = new FlowConnector(props); Flow flow = connector.connect( “count”, src, sink, pipe); flow.complete(); Copyright 2011 Patrick Nicolas - All rights reserved 8