SlideShare une entreprise Scribd logo
1  sur  24
Big Data processing using Hadoop
infrastructure
Use case
Intrum Justitia SDC
• 20 countries / different applications to process, store and analyze data
• Non-unified data storage formats
• High number of data objects (Records, Transactions, Entities)
• May have complex strong or loose relation rules
• Often: involving time-stamped events, made of incomplete data
2(24)
Possible solutions
• Custom built from ground solution
• Semi-clustered approach
‒ Tools from Oracle
‒ MySQL/PostgreSQL nodes
‒ Document-oriented tools like MongoDB
• “Big Data” approach (Map-Reduce)
3(24)
Map-Reduce
• Simple programming model that applies
to many large-scale computing problems
• Availabe in MongoDB for sharded data
• MapReduce tools usually offers:
‒ automatic parallelization
‒ load balancing
‒ network/disk transfer optimization
‒ handling of machine failures
‒ robustness
• Introduced by Google, open-source
implementation by Apache (Hadoop),
enterprise support by Cloudera
4(24)
Cloudera manager
YARN
Resource manager
CDH ecosystem
5(24)
HDFS
MapReduce
HBase
NoSQL
Hive
HQL Sqoop
Import
ExportParquet
Impala
SQL / ODBC
Pig
Pig Latin
Zookeeper
Coordination
Hue
HDFS
• Hadoop Distributed File System
• Redundancy
• Fault Tolerant
• Scalable
• Self Healing
• Write Once, Read Many Times
• Java API
• Command Line Tool
• Mountable (FUSE)
6(24)
HDFS file read
Code to data, not data to code
7(24)
Client application
HDFS client
Name node
/bob/file.txt
Block A
Block B
DataNode 2
DataNode 3
DataNode 1
DataNode 3
DataNode 1
C
B
D
DataNode 2
C
A
D
DataNode 3
C
B
A
1
4
4
2
3
Map-Reduce workflow and redundancy
(6) Write
User
Program
Master
Worker
Worker
Worker
Split 0
Split 1
Split 2
Split 3
Split 4
Worker
Worker
Output
File 0
Output
File 1
(1) Fork(1) Fork(1) Fork
(2) Assign map (2) Assign reduce
(3) Read (4) Local
write
Input files MAP phase
Intermediate
files
REDUCE phase Output files
8(24)
9(24)
Hive Pig
• High-level data access language
• Data Warehouse System for
Hadoop
• Data Aggregation
• Ad-Hoc Queries
• SQL-like Language (HiveQL)
• High-level data access language
(zero Java knowledge required)
• Data preparation (ETL)
• Pig Latin scripting language
SQL
Hive
MapReduce
Pig Latin
Pig
MapReduce
HiveQL vs. Pig Latin
insert into ValClickPerDMA
select dma, count(*) from geoinfo
join (
select name, ipaddr from users
join
clicks on (users.name = clicks.user)
where value > 0;
) using ipaddr group by dma;
Users = load 'users' as (name, age, ipaddr);
Clicks = load 'clicks' as (user, url, value);
ValuableClicks = filter Clicks by value > 0;
UserClicks = join Users by name, ValuableClicks by user;
Geoinfo = load 'geoinfo' as (ipaddr, dma);
UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr;
ByDMA = group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group, count(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';
10(24)
Pig Latin is procedural,
where HQL is declarative.
Impala
• Real time queries ~100x faster comparing to Hive
• Direct data access
• Query data on HDFS or HBase
• Allows table joins and aggregation
11(24)
ODBC
Impala
HDFS HBase
Parquet
• Row Groups: A group of rows in columnar format
‒ One (or more) per split while reading
‒ Max size buffered in memory while writing
‒ About 50MB < row group < 1GB
• Columns Chunk: Data for one column in row group
‒ Column chunks can be read independently for efficient scans
• Page: Unit of access in a column chunk
‒ Should be big enough for efficient compression
‒ Min size to read while accessing a single record
‒ About 8KB < page < 1MB
Lars George, Cloudera. Data I/O, 2013
12(24)
HBase
Column-oriented data storage. Very large tables – billions
of rows X millions of columns.
• Low Latency
• Random Reads And Writes (by PK)
• Distributed Key/Value Store; automatic region sharding
• Simple API
‒ PUT
‒ GET
‒ DELETE
‒ SCAN
13(24)
HBase building blocks
• The most basic unit in HBase is a column
‒ Each column may have multiply versions with each distinct value contained in
separate cell
‒ One or more columns from a row, that is addressed uniquely by a row key
‒ Can have millions of columns
‒ Can be compressed or tagged to stay in memory
• A table is a collection of rows
‒ All rows and columns are always sorted lexicographically by their row key
14(24)
HBase read
15(24)
Client
ZooKeeper
HMaster
RegionServer
RegionServer
RegionServer
Hadoop
Oozie
• Oozie is a workflow scheduler system to manage Hadoop jobs.
• Workflow is a collection of actions
• Arranged in Directed Acyclic Graph
16(24)
Job submission
Oozie server
"All done"
MapReduce
Pig
...
Schedule
Result
Schedule
Failure
Re-schedule
Result
Hadoop v1 architecture
• JobTracker
‒ Manage Cluster Resources
‒ Job Scheduling
• TaskTracker
‒ Per-node agent
‒ Task management
• Single purpose – batch processing
17(24)
Hadoop v2 - YARN architecture
• ResourceManager – Allocates cluster resources
• NodeManager – Enforces node resource allocations
• ApplicationMaster – Application lifecycle and task scheduler
• Multi-purpose system, Batch, interactive querying, streaming, aggregation
18(24)
Resource
Manager
MapReduce Status
Job Submission
Client
Node
Manager
Node
Manager
Container
Node
Manager
App Mstr
Node Status
Resource Request
Hadoop infrastructure integration
18(24)
TxB
IW extract
program
IW extract
program
IW extract
program
TxB
TxB
Data In HDFS
RAW
(S)FTP
SCP
HTTP(S)
JDBC
HDFS
Binary
HDFS
Results
Data Out
Intrum Web
PAM
GSS
Catalyst
Dashboard
Parsing&Validation
Conversion&Compression
DataQualityAnalysis
BusinessAnalytics
DataTransformation
DataDelivery
Monitoring and Management
Hadoop Cluster
Development environment integration
Generic enterprise stack
• Maven
• Spring, Spring-Hadoop
• Hibernate
• H2, MySQL cluster
• LDAP, Kerberos
• CI (Jenkins ...)
20(24)
Java example: Hadoop task
@Component
public class HBaseEventLogToMySQL extnds Configured implements Tool {
@Autowired private EntityManagerFactory entityManagerFactory;
@Override public int run(String[] args) throws Exception {
LogAbstractEvent lastEvent = getLastMySQLEvent();
Scan scan;
String lastEventKey = "";
if (lastEvent == null) {
scan = new Scan();
} else {
lastEventKey = lastEvent.getEventKey();
scan = new Scan(Bytes.toBytes(lastEventKey + Character.MAX_VALUE));
}
final Configuration conf = HBaseConfiguration.create(getConf());
HTable table = new HTable(conf, tableName);
ResultScanner resultScanner = table.getScanner(scan);
readRowsToMySQL(resultScanner);
}
21(24)
Java example: Map part (Table)
public class BasicProdStatusHbaseMapper extends TableMapper<Text,
MapWritableComparable> {
@Override public void map(ImmutableBytesWritable key, Result value, Context
context) throws IOException, InterruptedException {
byte[] caseByteArr = value.getValue(FAMILY, CASE_QUALIFIER);
Map<> caseMap = HbaseUtils.convertCaseColumnByteArrToMap(caseByteArr);
MapWritableComparable map = new MapWritableComparable();
map.put(new Text("originalCapital"), new
DoubleWritable((Double)caseMap.get("OriginalCapital")));
map.put(new Text("remainingCapital"), new
DoubleWritable((Double)caseMap.get("RemainingCapital")));
context.getCounter(COUNTERS_COMMON_GROUP, COUNTER_CASES_MAPPED).increment(1);
context.write(new Text(mainStatusCode), map);
}}
22(24)
Java example: Reduce part
public class BasicProdStatusHbaseReducer extends Reducer<Text,
MapWritableComparable, BasicProdStatusWritable, NullWritable> {
@Override protected void reduce(Text key, Iterable<MapWritableComparable>
values, Context context) throws IOException, InterruptedException {
String mainStatusCode = key.toString();
AggregationBean ab = new AggregationBean();
for (MapWritableComparable map : values){
double originalCapital = ((DoubleWritable)map.get(new
Text("originalCapital"))).get();
double remainingCapital = ((DoubleWritable)map.get(new
Text("remainingCapital"))).get();
ab.add(originalCapital,remainingCapital);
}
context.write(ab.getDBObject(mainStatusCode,pid), NullWritable.get());
context.getCounter(COMMON_GROUP, CASES_PROCESSED).increment(1); }}
23(24)
Q&A
We are hiring!
Big Data processing using Hadoop
infrastructure

Contenu connexe

Tendances

Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core conceptsMaryan Faryna
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big dataYukti Kaura
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTAmrit Chhetri
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsFadi Yousuf
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - IntroductionTomy Rhymond
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata Mk Kim
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
 
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...Dataconomy Media
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUBAhmed Salman
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data HadoopApache Apex
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopFebiyan Rachman
 

Tendances (20)

Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 

En vedette

Boston HUG - Cloudera presentation
Boston HUG - Cloudera presentationBoston HUG - Cloudera presentation
Boston HUG - Cloudera presentationreedshea
 
презентация савилова
презентация савиловапрезентация савилова
презентация савиловаdavidovanat
 
Reverse engineering of binary programs for custom virtual machines
Reverse engineering of binary programs for custom virtual machinesReverse engineering of binary programs for custom virtual machines
Reverse engineering of binary programs for custom virtual machinesSmartDec
 
Лекция 11 Действие электрического тока на биологические ткани организма
Лекция 11 Действие электрического тока на биологические ткани организмаЛекция 11 Действие электрического тока на биологические ткани организма
Лекция 11 Действие электрического тока на биологические ткани организмаdrtanton
 
доклад электромагнитное излучение
доклад электромагнитное излучениедоклад электромагнитное излучение
доклад электромагнитное излучениеdavidovanat
 
Бинарный анализ с декомпиляцией и LLVM
Бинарный анализ с декомпиляцией и LLVMБинарный анализ с декомпиляцией и LLVM
Бинарный анализ с декомпиляцией и LLVMSmartDec
 
влияние электромагнитного излучения бытовых приборов и сото
влияние электромагнитного излучения бытовых приборов и сотовлияние электромагнитного излучения бытовых приборов и сото
влияние электромагнитного излучения бытовых приборов и сотоAndrei V, Zhuravlev
 
электромагнитное излучение и его влияние на человека
электромагнитное излучение и его влияние на человекаэлектромагнитное излучение и его влияние на человека
электромагнитное излучение и его влияние на человекаAndrei V, Zhuravlev
 
презентация
презентацияпрезентация
презентацияAndrey Fomenko
 
Биологическое действие магнитного поля на организм человека
Биологическое действие магнитного поля на организм человека Биологическое действие магнитного поля на организм человека
Биологическое действие магнитного поля на организм человека amtc7
 
Негативное воздействие компьютера на здоровье человека и способы защиты
Негативное воздействие компьютера на здоровье человека и способы защитыНегативное воздействие компьютера на здоровье человека и способы защиты
Негативное воздействие компьютера на здоровье человека и способы защитыHakimova_AR
 
Системноинженерное мышление в непрерывном образовании
Системноинженерное мышление в непрерывном образованииСистемноинженерное мышление в непрерывном образовании
Системноинженерное мышление в непрерывном образованииAnatoly Levenchuk
 
влияние компьютера на человека
влияние компьютера на человекавлияние компьютера на человека
влияние компьютера на человекаZavirukhina
 
низкоуровневое программирование сегодня новые стандарты с++, программирован...
низкоуровневое программирование сегодня   новые стандарты с++, программирован...низкоуровневое программирование сегодня   новые стандарты с++, программирован...
низкоуровневое программирование сегодня новые стандарты с++, программирован...COMAQA.BY
 
А давайте будем многопоточить и масштабировить! - записки сумасшедшего №0
А давайте будем многопоточить и масштабировить! - записки сумасшедшего №0А давайте будем многопоточить и масштабировить! - записки сумасшедшего №0
А давайте будем многопоточить и масштабировить! - записки сумасшедшего №0COMAQA.BY
 
В топку Postman - пишем API автотесты в привычном стеке
В топку Postman - пишем API автотесты в привычном стекеВ топку Postman - пишем API автотесты в привычном стеке
В топку Postman - пишем API автотесты в привычном стекеCOMAQA.BY
 
Автоматизация тестирования API для начинающих
Автоматизация тестирования API для начинающихАвтоматизация тестирования API для начинающих
Автоматизация тестирования API для начинающихCOMAQA.BY
 

En vedette (20)

JOOQ and Flyway
JOOQ and FlywayJOOQ and Flyway
JOOQ and Flyway
 
Boston HUG - Cloudera presentation
Boston HUG - Cloudera presentationBoston HUG - Cloudera presentation
Boston HUG - Cloudera presentation
 
презентация савилова
презентация савиловапрезентация савилова
презентация савилова
 
Reverse engineering of binary programs for custom virtual machines
Reverse engineering of binary programs for custom virtual machinesReverse engineering of binary programs for custom virtual machines
Reverse engineering of binary programs for custom virtual machines
 
Лекция 11 Действие электрического тока на биологические ткани организма
Лекция 11 Действие электрического тока на биологические ткани организмаЛекция 11 Действие электрического тока на биологические ткани организма
Лекция 11 Действие электрического тока на биологические ткани организма
 
доклад электромагнитное излучение
доклад электромагнитное излучениедоклад электромагнитное излучение
доклад электромагнитное излучение
 
Бинарный анализ с декомпиляцией и LLVM
Бинарный анализ с декомпиляцией и LLVMБинарный анализ с декомпиляцией и LLVM
Бинарный анализ с декомпиляцией и LLVM
 
влияние электромагнитного излучения бытовых приборов и сото
влияние электромагнитного излучения бытовых приборов и сотовлияние электромагнитного излучения бытовых приборов и сото
влияние электромагнитного излучения бытовых приборов и сото
 
C++ idioms
C++ idiomsC++ idioms
C++ idioms
 
электромагнитное излучение и его влияние на человека
электромагнитное излучение и его влияние на человекаэлектромагнитное излучение и его влияние на человека
электромагнитное излучение и его влияние на человека
 
Java Memory Model
Java Memory ModelJava Memory Model
Java Memory Model
 
презентация
презентацияпрезентация
презентация
 
Биологическое действие магнитного поля на организм человека
Биологическое действие магнитного поля на организм человека Биологическое действие магнитного поля на организм человека
Биологическое действие магнитного поля на организм человека
 
Негативное воздействие компьютера на здоровье человека и способы защиты
Негативное воздействие компьютера на здоровье человека и способы защитыНегативное воздействие компьютера на здоровье человека и способы защиты
Негативное воздействие компьютера на здоровье человека и способы защиты
 
Системноинженерное мышление в непрерывном образовании
Системноинженерное мышление в непрерывном образованииСистемноинженерное мышление в непрерывном образовании
Системноинженерное мышление в непрерывном образовании
 
влияние компьютера на человека
влияние компьютера на человекавлияние компьютера на человека
влияние компьютера на человека
 
низкоуровневое программирование сегодня новые стандарты с++, программирован...
низкоуровневое программирование сегодня   новые стандарты с++, программирован...низкоуровневое программирование сегодня   новые стандарты с++, программирован...
низкоуровневое программирование сегодня новые стандарты с++, программирован...
 
А давайте будем многопоточить и масштабировить! - записки сумасшедшего №0
А давайте будем многопоточить и масштабировить! - записки сумасшедшего №0А давайте будем многопоточить и масштабировить! - записки сумасшедшего №0
А давайте будем многопоточить и масштабировить! - записки сумасшедшего №0
 
В топку Postman - пишем API автотесты в привычном стеке
В топку Postman - пишем API автотесты в привычном стекеВ топку Postman - пишем API автотесты в привычном стеке
В топку Postman - пишем API автотесты в привычном стеке
 
Автоматизация тестирования API для начинающих
Автоматизация тестирования API для начинающихАвтоматизация тестирования API для начинающих
Автоматизация тестирования API для начинающих
 

Similaire à Big Data Processing Using Hadoop Infrastructure

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
What's the Scoop on Hadoop? How It Works and How to WORK IT!
What's the Scoop on Hadoop? How It Works and How to WORK IT!What's the Scoop on Hadoop? How It Works and How to WORK IT!
What's the Scoop on Hadoop? How It Works and How to WORK IT!MongoDB
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Webinar: MongoDB + Hadoop
Webinar: MongoDB + HadoopWebinar: MongoDB + Hadoop
Webinar: MongoDB + HadoopMongoDB
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 

Similaire à Big Data Processing Using Hadoop Infrastructure (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop
HadoopHadoop
Hadoop
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
What's the Scoop on Hadoop? How It Works and How to WORK IT!
What's the Scoop on Hadoop? How It Works and How to WORK IT!What's the Scoop on Hadoop? How It Works and How to WORK IT!
What's the Scoop on Hadoop? How It Works and How to WORK IT!
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Hadoop
HadoopHadoop
Hadoop
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Webinar: MongoDB + Hadoop
Webinar: MongoDB + HadoopWebinar: MongoDB + Hadoop
Webinar: MongoDB + Hadoop
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
מיכאל
מיכאלמיכאל
מיכאל
 

Plus de Dmitry Buzdin

How Payment Cards Really Work?
How Payment Cards Really Work?How Payment Cards Really Work?
How Payment Cards Really Work?Dmitry Buzdin
 
Как построить свой фреймворк для автотестов?
Как построить свой фреймворк для автотестов?Как построить свой фреймворк для автотестов?
Как построить свой фреймворк для автотестов?Dmitry Buzdin
 
How to grow your own Microservice?
How to grow your own Microservice?How to grow your own Microservice?
How to grow your own Microservice?Dmitry Buzdin
 
How to Build Your Own Test Automation Framework?
How to Build Your Own Test Automation Framework?How to Build Your Own Test Automation Framework?
How to Build Your Own Test Automation Framework?Dmitry Buzdin
 
Delivery Pipeline for Windows Machines
Delivery Pipeline for Windows MachinesDelivery Pipeline for Windows Machines
Delivery Pipeline for Windows MachinesDmitry Buzdin
 
Developing Useful APIs
Developing Useful APIsDeveloping Useful APIs
Developing Useful APIsDmitry Buzdin
 
Архитектура Ленты на Одноклассниках
Архитектура Ленты на ОдноклассникахАрхитектура Ленты на Одноклассниках
Архитектура Ленты на ОдноклассникахDmitry Buzdin
 
Riding Redis @ask.fm
Riding Redis @ask.fmRiding Redis @ask.fm
Riding Redis @ask.fmDmitry Buzdin
 
Rubylight JUG Contest Results Part II
Rubylight JUG Contest Results Part IIRubylight JUG Contest Results Part II
Rubylight JUG Contest Results Part IIDmitry Buzdin
 
Rubylight Pattern-Matching Solutions
Rubylight Pattern-Matching SolutionsRubylight Pattern-Matching Solutions
Rubylight Pattern-Matching SolutionsDmitry Buzdin
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 
Poor Man's Functional Programming
Poor Man's Functional ProgrammingPoor Man's Functional Programming
Poor Man's Functional ProgrammingDmitry Buzdin
 
Rubylight programming contest
Rubylight programming contestRubylight programming contest
Rubylight programming contestDmitry Buzdin
 
Continuous Delivery
Continuous Delivery Continuous Delivery
Continuous Delivery Dmitry Buzdin
 
Introduction to DevOps
Introduction to DevOpsIntroduction to DevOps
Introduction to DevOpsDmitry Buzdin
 
Thread Dump Analysis
Thread Dump AnalysisThread Dump Analysis
Thread Dump AnalysisDmitry Buzdin
 
Pragmatic Java Test Automation
Pragmatic Java Test AutomationPragmatic Java Test Automation
Pragmatic Java Test AutomationDmitry Buzdin
 

Plus de Dmitry Buzdin (20)

How Payment Cards Really Work?
How Payment Cards Really Work?How Payment Cards Really Work?
How Payment Cards Really Work?
 
Как построить свой фреймворк для автотестов?
Как построить свой фреймворк для автотестов?Как построить свой фреймворк для автотестов?
Как построить свой фреймворк для автотестов?
 
How to grow your own Microservice?
How to grow your own Microservice?How to grow your own Microservice?
How to grow your own Microservice?
 
How to Build Your Own Test Automation Framework?
How to Build Your Own Test Automation Framework?How to Build Your Own Test Automation Framework?
How to Build Your Own Test Automation Framework?
 
Delivery Pipeline for Windows Machines
Delivery Pipeline for Windows MachinesDelivery Pipeline for Windows Machines
Delivery Pipeline for Windows Machines
 
Developing Useful APIs
Developing Useful APIsDeveloping Useful APIs
Developing Useful APIs
 
Whats New in Java 8
Whats New in Java 8Whats New in Java 8
Whats New in Java 8
 
Архитектура Ленты на Одноклассниках
Архитектура Ленты на ОдноклассникахАрхитектура Ленты на Одноклассниках
Архитектура Ленты на Одноклассниках
 
Dart Workshop
Dart WorkshopDart Workshop
Dart Workshop
 
Riding Redis @ask.fm
Riding Redis @ask.fmRiding Redis @ask.fm
Riding Redis @ask.fm
 
Rubylight JUG Contest Results Part II
Rubylight JUG Contest Results Part IIRubylight JUG Contest Results Part II
Rubylight JUG Contest Results Part II
 
Rubylight Pattern-Matching Solutions
Rubylight Pattern-Matching SolutionsRubylight Pattern-Matching Solutions
Rubylight Pattern-Matching Solutions
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
Poor Man's Functional Programming
Poor Man's Functional ProgrammingPoor Man's Functional Programming
Poor Man's Functional Programming
 
Rubylight programming contest
Rubylight programming contestRubylight programming contest
Rubylight programming contest
 
Continuous Delivery
Continuous Delivery Continuous Delivery
Continuous Delivery
 
Introduction to DevOps
Introduction to DevOpsIntroduction to DevOps
Introduction to DevOps
 
Thread Dump Analysis
Thread Dump AnalysisThread Dump Analysis
Thread Dump Analysis
 
Pragmatic Java Test Automation
Pragmatic Java Test AutomationPragmatic Java Test Automation
Pragmatic Java Test Automation
 
Mlocjs buzdin
Mlocjs buzdinMlocjs buzdin
Mlocjs buzdin
 

Dernier

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 

Dernier (20)

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 

Big Data Processing Using Hadoop Infrastructure

  • 1. Big Data processing using Hadoop infrastructure
  • 2. Use case Intrum Justitia SDC • 20 countries / different applications to process, store and analyze data • Non-unified data storage formats • High number of data objects (Records, Transactions, Entities) • May have complex strong or loose relation rules • Often: involving time-stamped events, made of incomplete data 2(24)
  • 3. Possible solutions • Custom built from ground solution • Semi-clustered approach ‒ Tools from Oracle ‒ MySQL/PostgreSQL nodes ‒ Document-oriented tools like MongoDB • “Big Data” approach (Map-Reduce) 3(24)
  • 4. Map-Reduce • Simple programming model that applies to many large-scale computing problems • Availabe in MongoDB for sharded data • MapReduce tools usually offers: ‒ automatic parallelization ‒ load balancing ‒ network/disk transfer optimization ‒ handling of machine failures ‒ robustness • Introduced by Google, open-source implementation by Apache (Hadoop), enterprise support by Cloudera 4(24)
  • 5. Cloudera manager YARN Resource manager CDH ecosystem 5(24) HDFS MapReduce HBase NoSQL Hive HQL Sqoop Import ExportParquet Impala SQL / ODBC Pig Pig Latin Zookeeper Coordination Hue
  • 6. HDFS • Hadoop Distributed File System • Redundancy • Fault Tolerant • Scalable • Self Healing • Write Once, Read Many Times • Java API • Command Line Tool • Mountable (FUSE) 6(24)
  • 7. HDFS file read Code to data, not data to code 7(24) Client application HDFS client Name node /bob/file.txt Block A Block B DataNode 2 DataNode 3 DataNode 1 DataNode 3 DataNode 1 C B D DataNode 2 C A D DataNode 3 C B A 1 4 4 2 3
  • 8. Map-Reduce workflow and redundancy (6) Write User Program Master Worker Worker Worker Split 0 Split 1 Split 2 Split 3 Split 4 Worker Worker Output File 0 Output File 1 (1) Fork(1) Fork(1) Fork (2) Assign map (2) Assign reduce (3) Read (4) Local write Input files MAP phase Intermediate files REDUCE phase Output files 8(24)
  • 9. 9(24) Hive Pig • High-level data access language • Data Warehouse System for Hadoop • Data Aggregation • Ad-Hoc Queries • SQL-like Language (HiveQL) • High-level data access language (zero Java knowledge required) • Data preparation (ETL) • Pig Latin scripting language SQL Hive MapReduce Pig Latin Pig MapReduce
  • 10. HiveQL vs. Pig Latin insert into ValClickPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value > 0; ) using ipaddr group by dma; Users = load 'users' as (name, age, ipaddr); Clicks = load 'clicks' as (user, url, value); ValuableClicks = filter Clicks by value > 0; UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load 'geoinfo' as (ipaddr, dma); UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreach ByDMA generate group, count(UserGeo); store ValuableClicksPerDMA into 'ValuableClicksPerDMA'; 10(24) Pig Latin is procedural, where HQL is declarative.
  • 11. Impala • Real time queries ~100x faster comparing to Hive • Direct data access • Query data on HDFS or HBase • Allows table joins and aggregation 11(24) ODBC Impala HDFS HBase
  • 12. Parquet • Row Groups: A group of rows in columnar format ‒ One (or more) per split while reading ‒ Max size buffered in memory while writing ‒ About 50MB < row group < 1GB • Columns Chunk: Data for one column in row group ‒ Column chunks can be read independently for efficient scans • Page: Unit of access in a column chunk ‒ Should be big enough for efficient compression ‒ Min size to read while accessing a single record ‒ About 8KB < page < 1MB Lars George, Cloudera. Data I/O, 2013 12(24)
  • 13. HBase Column-oriented data storage. Very large tables – billions of rows X millions of columns. • Low Latency • Random Reads And Writes (by PK) • Distributed Key/Value Store; automatic region sharding • Simple API ‒ PUT ‒ GET ‒ DELETE ‒ SCAN 13(24)
  • 14. HBase building blocks • The most basic unit in HBase is a column ‒ Each column may have multiply versions with each distinct value contained in separate cell ‒ One or more columns from a row, that is addressed uniquely by a row key ‒ Can have millions of columns ‒ Can be compressed or tagged to stay in memory • A table is a collection of rows ‒ All rows and columns are always sorted lexicographically by their row key 14(24)
  • 16. Hadoop Oozie • Oozie is a workflow scheduler system to manage Hadoop jobs. • Workflow is a collection of actions • Arranged in Directed Acyclic Graph 16(24) Job submission Oozie server "All done" MapReduce Pig ... Schedule Result Schedule Failure Re-schedule Result
  • 17. Hadoop v1 architecture • JobTracker ‒ Manage Cluster Resources ‒ Job Scheduling • TaskTracker ‒ Per-node agent ‒ Task management • Single purpose – batch processing 17(24)
  • 18. Hadoop v2 - YARN architecture • ResourceManager – Allocates cluster resources • NodeManager – Enforces node resource allocations • ApplicationMaster – Application lifecycle and task scheduler • Multi-purpose system, Batch, interactive querying, streaming, aggregation 18(24) Resource Manager MapReduce Status Job Submission Client Node Manager Node Manager Container Node Manager App Mstr Node Status Resource Request
  • 19. Hadoop infrastructure integration 18(24) TxB IW extract program IW extract program IW extract program TxB TxB Data In HDFS RAW (S)FTP SCP HTTP(S) JDBC HDFS Binary HDFS Results Data Out Intrum Web PAM GSS Catalyst Dashboard Parsing&Validation Conversion&Compression DataQualityAnalysis BusinessAnalytics DataTransformation DataDelivery Monitoring and Management Hadoop Cluster
  • 20. Development environment integration Generic enterprise stack • Maven • Spring, Spring-Hadoop • Hibernate • H2, MySQL cluster • LDAP, Kerberos • CI (Jenkins ...) 20(24)
  • 21. Java example: Hadoop task @Component public class HBaseEventLogToMySQL extnds Configured implements Tool { @Autowired private EntityManagerFactory entityManagerFactory; @Override public int run(String[] args) throws Exception { LogAbstractEvent lastEvent = getLastMySQLEvent(); Scan scan; String lastEventKey = ""; if (lastEvent == null) { scan = new Scan(); } else { lastEventKey = lastEvent.getEventKey(); scan = new Scan(Bytes.toBytes(lastEventKey + Character.MAX_VALUE)); } final Configuration conf = HBaseConfiguration.create(getConf()); HTable table = new HTable(conf, tableName); ResultScanner resultScanner = table.getScanner(scan); readRowsToMySQL(resultScanner); } 21(24)
  • 22. Java example: Map part (Table) public class BasicProdStatusHbaseMapper extends TableMapper<Text, MapWritableComparable> { @Override public void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException { byte[] caseByteArr = value.getValue(FAMILY, CASE_QUALIFIER); Map<> caseMap = HbaseUtils.convertCaseColumnByteArrToMap(caseByteArr); MapWritableComparable map = new MapWritableComparable(); map.put(new Text("originalCapital"), new DoubleWritable((Double)caseMap.get("OriginalCapital"))); map.put(new Text("remainingCapital"), new DoubleWritable((Double)caseMap.get("RemainingCapital"))); context.getCounter(COUNTERS_COMMON_GROUP, COUNTER_CASES_MAPPED).increment(1); context.write(new Text(mainStatusCode), map); }} 22(24)
  • 23. Java example: Reduce part public class BasicProdStatusHbaseReducer extends Reducer<Text, MapWritableComparable, BasicProdStatusWritable, NullWritable> { @Override protected void reduce(Text key, Iterable<MapWritableComparable> values, Context context) throws IOException, InterruptedException { String mainStatusCode = key.toString(); AggregationBean ab = new AggregationBean(); for (MapWritableComparable map : values){ double originalCapital = ((DoubleWritable)map.get(new Text("originalCapital"))).get(); double remainingCapital = ((DoubleWritable)map.get(new Text("remainingCapital"))).get(); ab.add(originalCapital,remainingCapital); } context.write(ab.getDBObject(mainStatusCode,pid), NullWritable.get()); context.getCounter(COMMON_GROUP, CASES_PROCESSED).increment(1); }} 23(24)
  • 24. Q&A We are hiring! Big Data processing using Hadoop infrastructure