SlideShare une entreprise Scribd logo
1  sur  36
Big Data Components
Flume, Pig and Sqoop
Data Management Without Hadoop
Data Management with Hadoop
Components in Hadoop Architecture
• The Gray components are pure open source and Blue are Open Source and yet contributed by other companies
HDFS Components
• Node – A Computer ( Commodity Hardware)
• Rack – Collection of Nodes (30 to 40 in the same network) Bandwidth inside and Between Rack
Varies
• Cluster – Collection of Racks
• Distributed File System
• Hadoop Distributed FileSystem
• Map Reduce Engine
• Built in Resource Manager and Scheduler
Hadoop Cluster
Flume and Sqoop
• These both frameworks for transferring data to and from Hadoop File System (HDFS)
• The main difference between Flume and Sqoop is Flume will be used to capture a stream of moving data where as
Sqoop loads data from relational databases to HDFS
Flume
• This is an event driven framework used to capture data that continuously flows into the system
• Flume runs as one or more agents and each agent has three different components
• Source
• Channels
• Sinks
Flume Agent
• Source – This component retrieves the data from a particular application e.g. Web Server
• Channel – This simply acts as a pipe which temporarily stores the data if Output rate is lesser than the input rate.
• Sink – This components processes the data and stores it in a specific destination mostly a HDFS
Source Sink
Channel
Web Server
HDFS
AGENT
A Single Agent can
have multiple sources,
channels and Sinks
Use of a Channel
• Source will write events in a channel
• Channel maintains such events and removes it only when the sink completes
performing the event
• There are two types of Channel
• In-Memory – Processes the events faster, but it is volatile
• File Based – Processes the events slower, but permanent
Multiplexing and Serialization
• Output from one agent can serve as input to the other agent
• Avro is a remote call-and-serialization framework from Apache to do
this effectively
Fan out flow
• If the events from a single source is distributed to multiple channels, then it is called as Fanning out the flow
Source Channel 2
Channel 3
Channel 1
Source Channel 2
Channel 3
Channel 1
Replicating Fan Out
Source Channel 2
Channel 3
Channel 1
Multiplexing Fan Out
Flume Commands
• These are the commands listed out in Terminal
Why the name Pig?
• According to the Apache Pig philosophy, pigs eat anything, live anywhere and are domesticated
• In Hadoop pig is used for processing any kind of data (Structured, Unstructured and Semi Structured)
What’s so great about Pig
• Java is a low level language (Users must be aware of what the
program does and how the program does it)
• Whereas Pig is a high level language (Users must be aware of only
what the program does and need not worry about how it is done)
• Its extensible – Java classes can be defined separately and called
within a Pig program
Components of Pig
• Pig consists of two components
Pig
Language
Pig Latin
Complier
Data Flow Language
• Pig is called as a Data Flow Language
• Users will define a data stream
• Through out the stream several transformations are applied on the data
• Transformations includes mathematical operations, grouping, filtering etc.
Programs like ‘C’ are called Control flow
languages as they have loops and if
statements
Steps involved in Data Flow
Load
Transform
Dump/Save
Users can specify a single file or entire directory
Filter, Join, Group, Order etc
Dump the results somewhere or save in a file
Pig – Data Types
Pig has four different data types
• Atom – It can be a string or a number. This is similar to Int, long or char in other programming languages
• Tuple – It is a record that consists of a series of fields. Each field can contain a string or a number
• Bag – It is a collection of non-unique tuples. Each tuple can have different number of records
• Map – It is a collection of key value pairs. Any type can be stored in value and key has to be unique
If the value is unknown, the keyword “null” can be used as a place holder in the program
Pig - Operators
These are all the operators used at various levels
Pig – Debug and Troubleshoot
• There are few commands which can be used for debugging
Modes of Execution
Pig scripts can be executed in two different environments
Local Mode:
Pig is executed in a single node (Linux machine) and it does not requires Hadoop or HDFS.
This is used for testing pig logics.
pig -x local programname.pig
MapReduce Mode:
This is an actual Hadoop environment deployed along with HDFS.
pig -x mapreduce programname.pig
Packaging Pigs
Pig scripts can be packaged in three different ways
Script: This method is nothing more than a file containing Pig Latin commands, identified by the .pig suffix
(FlightData.pig, for example).Ending your Pig program with the .pig extension is a convention but not required.
Grunt: Grunt acts as a command interpreter where you can interactively enter Pig Latin at the Grunt command
line and immediately see the response. This method is helpful for prototyping during initial development and
with what-if scenarios.
Embedded: Pig Latin statements can be executed within Java, Python, or JavaScript programs.
User Defined Functions
• There are lot of User Defined Functions (UDFs) available for Pig
• These UDFs can be written in any languages and used with Pig
• Community members of open source have already posted several useful UDFs online
• Pig can be embedded in host languages like Java, Python and Java Script to integrate existing applications
with pig
• We can even make Pig to support control flow language by placing a Pig Latin script within “iF”
loop and it runs a MapReduce job until the condition is met
Sqoop
• It acts as SQL designed for Hadoop
• The main use of Sqoop is to load the data from other external data sources onto the Hadoop
Distributed File System (HDFS)
• Other data sources can be structured, semi-structured or even unstructured
Need for Sqoop
• Organizations have been storing data for many years in Relational Databases
• There are several types of RDBMS such as
Need for Sqoop
• Those data has to be fed into HDFS for distributed processing
• Sqoop is the best command line based (now web based as well) tool
to perform the import/export operations to and from HDFS
• Similar to Agents in Flume, Sqoop consists of different Connectors
Sqoop Architecture
• User/Administrator can control Sqoop
Sqoop job types
• Sqoop performs two important operations
Other Data
Source
(RDBMS,
Cassandra etc..)
Hadoop
Distributed
File System
Sqoop
Import
Other Data
Source
(RDBMS,
Cassandra etc..)
Hadoop
Distributed
File System
Sqoop
Export
Perform Data Processing
and Analysis
• This characteristic of Sqoop is called as bidirectional tool
How Sqoop Works?
• Sqoop communicates with the MapReduce engine and seeks help for copying data from
other Data sources into HDFS
• MapReduce will allocate mappers and performs the copy operation
• Types of operations
• Import one table
• Import complete database
• Import selected tables
• Import selected columns from a particular table
• Filter out certain rows from certain table etc
2 important features
Import Data in Compressed Format
While Sqoop imports data and stores on HDFS file system, it can be set to
compress the data and store it to reduce the overall utilization of the disk.
Well know compressed file formats are GZIP, BZ2 etc.
Parallelism
By default four mappers will be allocated to copy Data from Other DB into
HDFS. Users can increase the number of mappers to even 8 or 16
JDBC Drivers
• JDBC acts as an interface between an application and its database
• An application can send data into the database or it can retrieve
whenever it wants
• Sqoop connectors work along with the JDBC drivers
Sqoop latest version
• This is what is inside Sqoop
Sqoop Latest version
REST
Representational State Transfer – A software architecture style
UI
User Interface
Connectors
Interface that communicates with other data sources
JDBC drivers
My SQL
http://www.mysql.com/downloads/connector/j/5.1.html
Oracle
http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-090769.html
Microsoft SQL
http://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774
Difference between Flume and Sqoop
Sqoop Flume
Sqoop is used for importing data from structured data
sources such as RDBMS.
Flume is used for moving bulk streaming data into HDFS.
Sqoop has a connector based architecture. Connectors
know how to connect to the respective data source and
fetch the data.
Flume has an agent based architecture. Here, code is
written (which is called as 'agent') which takes care of
fetching data.
HDFS is a destination for data import using Sqoop. Data flows to HDFS through zero or more channels.
Sqoop data load is not event driven. Flume data load can be driven by event.
In order to import data from structured data sources, one
has to use Sqoop only, because its connectors know how
to interact with structured data sources and fetch data
from them.
In order to load streaming data such as tweets generated
on Twitter or log files of a web server, Flume should be
used. Flume agents are built for fetching streaming data.

Contenu connexe

Tendances

Advanced Web Development
Advanced Web DevelopmentAdvanced Web Development
Advanced Web DevelopmentRobert J. Stein
 
CPU Scheduling in OS Presentation
CPU Scheduling in OS  PresentationCPU Scheduling in OS  Presentation
CPU Scheduling in OS Presentationusmankiyani1
 
Process synchronization in Operating Systems
Process synchronization in Operating SystemsProcess synchronization in Operating Systems
Process synchronization in Operating SystemsRitu Ranjan Shrivastwa
 
Backtracking-N Queens Problem-Graph Coloring-Hamiltonian cycle
Backtracking-N Queens Problem-Graph Coloring-Hamiltonian cycleBacktracking-N Queens Problem-Graph Coloring-Hamiltonian cycle
Backtracking-N Queens Problem-Graph Coloring-Hamiltonian cyclevarun arora
 
Message and Stream Oriented Communication
Message and Stream Oriented CommunicationMessage and Stream Oriented Communication
Message and Stream Oriented CommunicationDilum Bandara
 
2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software conceptsPrajakta Rane
 
Program security
Program securityProgram security
Program securityG Prachi
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLRamakant Soni
 
Unit II - 2 - Operating System - Threads
Unit II - 2 - Operating System - ThreadsUnit II - 2 - Operating System - Threads
Unit II - 2 - Operating System - Threadscscarcas
 
T9. Trust and reputation in multi-agent systems
T9. Trust and reputation in multi-agent systemsT9. Trust and reputation in multi-agent systems
T9. Trust and reputation in multi-agent systemsEASSS 2012
 
Web essentials clients, servers and communication – the internet – basic inte...
Web essentials clients, servers and communication – the internet – basic inte...Web essentials clients, servers and communication – the internet – basic inte...
Web essentials clients, servers and communication – the internet – basic inte...smitha273566
 

Tendances (20)

Naming in Distributed System
Naming in Distributed SystemNaming in Distributed System
Naming in Distributed System
 
Advanced Web Development
Advanced Web DevelopmentAdvanced Web Development
Advanced Web Development
 
CPU Scheduling in OS Presentation
CPU Scheduling in OS  PresentationCPU Scheduling in OS  Presentation
CPU Scheduling in OS Presentation
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
RAID
RAIDRAID
RAID
 
CPU Scheduling Algorithms
CPU Scheduling AlgorithmsCPU Scheduling Algorithms
CPU Scheduling Algorithms
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Process synchronization in Operating Systems
Process synchronization in Operating SystemsProcess synchronization in Operating Systems
Process synchronization in Operating Systems
 
OS - Deadlock
OS - DeadlockOS - Deadlock
OS - Deadlock
 
Backtracking-N Queens Problem-Graph Coloring-Hamiltonian cycle
Backtracking-N Queens Problem-Graph Coloring-Hamiltonian cycleBacktracking-N Queens Problem-Graph Coloring-Hamiltonian cycle
Backtracking-N Queens Problem-Graph Coloring-Hamiltonian cycle
 
Message and Stream Oriented Communication
Message and Stream Oriented CommunicationMessage and Stream Oriented Communication
Message and Stream Oriented Communication
 
Introduction to Exploitation
Introduction to ExploitationIntroduction to Exploitation
Introduction to Exploitation
 
Reasoning in AI
Reasoning in AIReasoning in AI
Reasoning in AI
 
2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts
 
Program security
Program securityProgram security
Program security
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Unit II - 2 - Operating System - Threads
Unit II - 2 - Operating System - ThreadsUnit II - 2 - Operating System - Threads
Unit II - 2 - Operating System - Threads
 
Operating System: Deadlock
Operating System: DeadlockOperating System: Deadlock
Operating System: Deadlock
 
T9. Trust and reputation in multi-agent systems
T9. Trust and reputation in multi-agent systemsT9. Trust and reputation in multi-agent systems
T9. Trust and reputation in multi-agent systems
 
Web essentials clients, servers and communication – the internet – basic inte...
Web essentials clients, servers and communication – the internet – basic inte...Web essentials clients, servers and communication – the internet – basic inte...
Web essentials clients, servers and communication – the internet – basic inte...
 

En vedette

Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoopChristophe Marchal
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2Cloudera, Inc.
 
Habits of Effective Sqoop Users
Habits of Effective Sqoop UsersHabits of Effective Sqoop Users
Habits of Effective Sqoop UsersKathleen Ting
 
Que debe saber un DBA de SQL Server sobre Hadoop
Que debe saber un DBA de SQL Server sobre HadoopQue debe saber un DBA de SQL Server sobre Hadoop
Que debe saber un DBA de SQL Server sobre HadoopEduardo Castro
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionBenoit Perroud
 
HBase @ Twitter
HBase @ TwitterHBase @ Twitter
HBase @ Twitterctrezzo
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messagesyarapavan
 
Hadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at FacebookHadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at FacebookDataWorks Summit
 
Visualizing Big Data – The Fundamentals
Visualizing Big Data – The FundamentalsVisualizing Big Data – The Fundamentals
Visualizing Big Data – The FundamentalsStampedeCon
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsCloudera, Inc.
 
Big-Data Server Farm Architecture
Big-Data Server Farm Architecture Big-Data Server Farm Architecture
Big-Data Server Farm Architecture Jordan Chung
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use caseDavin Abraham
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data setsCreditas
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 

En vedette (20)

Hadoop - Apache Pig
Hadoop - Apache PigHadoop - Apache Pig
Hadoop - Apache Pig
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Cloudera's Flume
Cloudera's FlumeCloudera's Flume
Cloudera's Flume
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2
 
Habits of Effective Sqoop Users
Habits of Effective Sqoop UsersHabits of Effective Sqoop Users
Habits of Effective Sqoop Users
 
Que debe saber un DBA de SQL Server sobre Hadoop
Que debe saber un DBA de SQL Server sobre HadoopQue debe saber un DBA de SQL Server sobre Hadoop
Que debe saber un DBA de SQL Server sobre Hadoop
 
Hadoop 101 v1
Hadoop 101 v1Hadoop 101 v1
Hadoop 101 v1
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment Evolution
 
HBase @ Twitter
HBase @ TwitterHBase @ Twitter
HBase @ Twitter
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
Hadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at FacebookHadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at Facebook
 
Visualizing Big Data – The Fundamentals
Visualizing Big Data – The FundamentalsVisualizing Big Data – The Fundamentals
Visualizing Big Data – The Fundamentals
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
 
Big-Data Server Farm Architecture
Big-Data Server Farm Architecture Big-Data Server Farm Architecture
Big-Data Server Farm Architecture
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data sets
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
storm at twitter
storm at twitterstorm at twitter
storm at twitter
 

Similaire à Big data components - Introduction to Flume, Pig and Sqoop

BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsNetajiGandi1
 
SpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataNaveen Korakoppa
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to HadoopAnandMHadoop
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsKrishnaVeni451953
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxmrudulasb
 
GETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptxGETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptxinfinix8
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User GroupCsaba Toth
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Apache Thrift, a brief introduction
Apache Thrift, a brief introductionApache Thrift, a brief introduction
Apache Thrift, a brief introductionRandy Abernethy
 

Similaire à Big data components - Introduction to Flume, Pig and Sqoop (20)

BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
SpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache Hadoop
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptx
 
GETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptxGETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptx
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Apache Thrift, a brief introduction
Apache Thrift, a brief introductionApache Thrift, a brief introduction
Apache Thrift, a brief introduction
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 

Dernier

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 

Dernier (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Big data components - Introduction to Flume, Pig and Sqoop

  • 4. Components in Hadoop Architecture • The Gray components are pure open source and Blue are Open Source and yet contributed by other companies
  • 5. HDFS Components • Node – A Computer ( Commodity Hardware) • Rack – Collection of Nodes (30 to 40 in the same network) Bandwidth inside and Between Rack Varies • Cluster – Collection of Racks • Distributed File System • Hadoop Distributed FileSystem • Map Reduce Engine • Built in Resource Manager and Scheduler
  • 7. Flume and Sqoop • These both frameworks for transferring data to and from Hadoop File System (HDFS) • The main difference between Flume and Sqoop is Flume will be used to capture a stream of moving data where as Sqoop loads data from relational databases to HDFS
  • 8. Flume • This is an event driven framework used to capture data that continuously flows into the system • Flume runs as one or more agents and each agent has three different components • Source • Channels • Sinks
  • 9. Flume Agent • Source – This component retrieves the data from a particular application e.g. Web Server • Channel – This simply acts as a pipe which temporarily stores the data if Output rate is lesser than the input rate. • Sink – This components processes the data and stores it in a specific destination mostly a HDFS Source Sink Channel Web Server HDFS AGENT A Single Agent can have multiple sources, channels and Sinks
  • 10. Use of a Channel • Source will write events in a channel • Channel maintains such events and removes it only when the sink completes performing the event • There are two types of Channel • In-Memory – Processes the events faster, but it is volatile • File Based – Processes the events slower, but permanent
  • 11. Multiplexing and Serialization • Output from one agent can serve as input to the other agent • Avro is a remote call-and-serialization framework from Apache to do this effectively
  • 12. Fan out flow • If the events from a single source is distributed to multiple channels, then it is called as Fanning out the flow Source Channel 2 Channel 3 Channel 1 Source Channel 2 Channel 3 Channel 1 Replicating Fan Out Source Channel 2 Channel 3 Channel 1 Multiplexing Fan Out
  • 13. Flume Commands • These are the commands listed out in Terminal
  • 14. Why the name Pig? • According to the Apache Pig philosophy, pigs eat anything, live anywhere and are domesticated • In Hadoop pig is used for processing any kind of data (Structured, Unstructured and Semi Structured)
  • 15. What’s so great about Pig • Java is a low level language (Users must be aware of what the program does and how the program does it) • Whereas Pig is a high level language (Users must be aware of only what the program does and need not worry about how it is done) • Its extensible – Java classes can be defined separately and called within a Pig program
  • 16. Components of Pig • Pig consists of two components Pig Language Pig Latin Complier
  • 17. Data Flow Language • Pig is called as a Data Flow Language • Users will define a data stream • Through out the stream several transformations are applied on the data • Transformations includes mathematical operations, grouping, filtering etc. Programs like ‘C’ are called Control flow languages as they have loops and if statements
  • 18. Steps involved in Data Flow Load Transform Dump/Save Users can specify a single file or entire directory Filter, Join, Group, Order etc Dump the results somewhere or save in a file
  • 19. Pig – Data Types Pig has four different data types • Atom – It can be a string or a number. This is similar to Int, long or char in other programming languages • Tuple – It is a record that consists of a series of fields. Each field can contain a string or a number • Bag – It is a collection of non-unique tuples. Each tuple can have different number of records • Map – It is a collection of key value pairs. Any type can be stored in value and key has to be unique If the value is unknown, the keyword “null” can be used as a place holder in the program
  • 20. Pig - Operators These are all the operators used at various levels
  • 21. Pig – Debug and Troubleshoot • There are few commands which can be used for debugging
  • 22. Modes of Execution Pig scripts can be executed in two different environments Local Mode: Pig is executed in a single node (Linux machine) and it does not requires Hadoop or HDFS. This is used for testing pig logics. pig -x local programname.pig MapReduce Mode: This is an actual Hadoop environment deployed along with HDFS. pig -x mapreduce programname.pig
  • 23. Packaging Pigs Pig scripts can be packaged in three different ways Script: This method is nothing more than a file containing Pig Latin commands, identified by the .pig suffix (FlightData.pig, for example).Ending your Pig program with the .pig extension is a convention but not required. Grunt: Grunt acts as a command interpreter where you can interactively enter Pig Latin at the Grunt command line and immediately see the response. This method is helpful for prototyping during initial development and with what-if scenarios. Embedded: Pig Latin statements can be executed within Java, Python, or JavaScript programs.
  • 24. User Defined Functions • There are lot of User Defined Functions (UDFs) available for Pig • These UDFs can be written in any languages and used with Pig • Community members of open source have already posted several useful UDFs online • Pig can be embedded in host languages like Java, Python and Java Script to integrate existing applications with pig • We can even make Pig to support control flow language by placing a Pig Latin script within “iF” loop and it runs a MapReduce job until the condition is met
  • 25. Sqoop • It acts as SQL designed for Hadoop • The main use of Sqoop is to load the data from other external data sources onto the Hadoop Distributed File System (HDFS) • Other data sources can be structured, semi-structured or even unstructured
  • 26. Need for Sqoop • Organizations have been storing data for many years in Relational Databases • There are several types of RDBMS such as
  • 27. Need for Sqoop • Those data has to be fed into HDFS for distributed processing • Sqoop is the best command line based (now web based as well) tool to perform the import/export operations to and from HDFS • Similar to Agents in Flume, Sqoop consists of different Connectors
  • 29. Sqoop job types • Sqoop performs two important operations Other Data Source (RDBMS, Cassandra etc..) Hadoop Distributed File System Sqoop Import Other Data Source (RDBMS, Cassandra etc..) Hadoop Distributed File System Sqoop Export Perform Data Processing and Analysis • This characteristic of Sqoop is called as bidirectional tool
  • 30. How Sqoop Works? • Sqoop communicates with the MapReduce engine and seeks help for copying data from other Data sources into HDFS • MapReduce will allocate mappers and performs the copy operation • Types of operations • Import one table • Import complete database • Import selected tables • Import selected columns from a particular table • Filter out certain rows from certain table etc
  • 31. 2 important features Import Data in Compressed Format While Sqoop imports data and stores on HDFS file system, it can be set to compress the data and store it to reduce the overall utilization of the disk. Well know compressed file formats are GZIP, BZ2 etc. Parallelism By default four mappers will be allocated to copy Data from Other DB into HDFS. Users can increase the number of mappers to even 8 or 16
  • 32. JDBC Drivers • JDBC acts as an interface between an application and its database • An application can send data into the database or it can retrieve whenever it wants • Sqoop connectors work along with the JDBC drivers
  • 33. Sqoop latest version • This is what is inside Sqoop
  • 34. Sqoop Latest version REST Representational State Transfer – A software architecture style UI User Interface Connectors Interface that communicates with other data sources
  • 36. Difference between Flume and Sqoop Sqoop Flume Sqoop is used for importing data from structured data sources such as RDBMS. Flume is used for moving bulk streaming data into HDFS. Sqoop has a connector based architecture. Connectors know how to connect to the respective data source and fetch the data. Flume has an agent based architecture. Here, code is written (which is called as 'agent') which takes care of fetching data. HDFS is a destination for data import using Sqoop. Data flows to HDFS through zero or more channels. Sqoop data load is not event driven. Flume data load can be driven by event. In order to import data from structured data sources, one has to use Sqoop only, because its connectors know how to interact with structured data sources and fetch data from them. In order to load streaming data such as tweets generated on Twitter or log files of a web server, Flume should be used. Flume agents are built for fetching streaming data.