SlideShare une entreprise Scribd logo
1  sur  32
Spring for Apache
Hadoop
By Zenyk Matchyshyn
Agenda
• Goals of the project
• Hadoop Introduction
• High level support
• Workflows
• Scripting & Migration
• Alternatives
• Testing & Related
BigData–Why?
Because of Terabytes and Petabytes:
• Smart meter analysis
• Genome processing
• Sentiment & social media analysis
• Network capacity trending & management
• Ad targeting
• Fraud detection
Goals
• Provide programmatic model to work with
Hadoop ecosystem
• Simplify client libraries usage
• Provide Spring friendly wrappers
• Enable real-world usage as a part of
Spring Batch & Spring Integration
• Leverage Spring features
Supporteddistros
• Apache Hadoop 1.2.1/2.0.6/2.2.0
• Cloudera CDH4
• Hortonworks HDP 1.3
• Pivotal HD 1.0/1.1
HADOOP INTRODUCTION
Hadoop
Hadoop
Map/Reduce
HDFS
HBase
Pig Hive
Hadoopbasics
Split Map Shuffle Reduce
Dog ate the bone
Cat ate the fish
Dog, 1
Ate, 1
The, 1
Bone, 1
Cat, 1
Ate, 1
The, 1
Fish,1
Dog, 1
Ate, {1, 1}
The, {1, 1}
Bone, 1
Cat, 1
Fish,1
Dog, 1
Ate, 2
The, 2
Bone, 1
Cat, 1
Fish,1
Configuration
< … XML …>
<context:property-placeholder
location="hadoop.properties"/>
<hdp:configuration>
fs.default.name=${hd.fs}
mapred.job.tracker=${hd.jt}
</hdp:configuration>
<… XML … >
Job definition
<hdp:job id=“hadoopJob"
input-path="${wordcount.input.path}"
output-path="${wordcount.output.path}"
libs="file:${app.repo}/supporting-lib-*.jar"
mapper="org.company.Mapper"
reducer="org.company.Reducer"/>
Configuration conf = new Configuration();
Job job = new Job(conf, “hadoopJob");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Maper.class);
job.setReducerClass(Reducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
Job Execution
<hdp:job-runner id="runner" run-at-startup="true"
pre-action=“someScript“
post-action=“someOtherScript“
job-ref=“hadoopJob" />
• Basic:
• Scheduled
– TaskScheduler
– Quartz
• Custom
HIGH LEVEL TOOLS
Solutions
• HBase
• Hive
• Pig
• Cascading
Simplifies
• Thread safety
• DAO friendliness, wrappers and basic
mappers
• Simple connection interfaces
• Runners, Template and callback
methods
• Common scenarios simplifications
• Scripting support
Example- Template
template.execute("MyTable", new TableCallback<Object>() {
@Override
public Object doInTable(HTable table) throws Throwable {
Put p = new Put(Bytes.toBytes("SomeRow"));
p.add(Bytes.toBytes("SomeColumn"), Bytes.toBytes("SomeQualifier"), Bytes.toBytes("AValue"));
table.put(p);
return null;
}
});
<hdp:hbase-configuration/>
<bean id="hbaseTemplate"
class="org.springframework.data.hadoop.hbase.HbaseTemplate"
p:configuration-ref="hbaseConfiguration"/>
Example–ScriptRunner
<hdp:hive-server host=“hivehost" port="10001" />
<hdp:hive-template />
<hdp:hive-client-factory host="some-host" port="some-port" >
<hdp:script location="classpath:org/company/hive/script.q">
<arguments>ignore-case=true</arguments>
</hdp:script>
</hdp:hive-client-factory>
<hdp:hive-runner id="hiveRunner" run-at-startup="true">
<hdp:script>
DROP TABLE IF EXITS testHiveBatchTable;
CREATE TABLE testHiveBatchTable (key int, value string);
</hdp:script>
<hdp:script location="hive-scripts/script.q"/>
</hdp:hive-runner>
WORKFLOWS
TypicalBig Data ProcessingFlow
Capture Pre-Process Insert Process Extract Present
SpringBatch &Spring Integration
• Big Data Flows are based on Spring
Integration & Spring Batch
• Spring for Hadoop provides:
– Spring Batch tasklets
– Spring Integration support
Tasklets
• Job runners
• Script runners
• Hive
• Pig
• Cascading
Example
<hdp:job-tasklet id="hadoop-tasklet" job-ref="mr-job" wait-for-completion="true" />
<batch:job id="job1">
<batch:step id="import" next=“ht">
<batch:tasklet ref="script-tasklet"/>
</batch:step>
<batch:step id=“ht">
<batch:tasklet ref=" hadoop-tasklet" />
</batch:step>
</batch:job>
SCRIPTING & MIGRATION
Details
• Supports JVM languages from JSR-223
(Groovy, JRuby, Jython, Rhino)
• Exposes SimplerFileSystem
• Provides implicit variables
• Exposes FsShell to mimic HDFS shell
• Exposes DistCp to mimic distcp from
Hadoop
Example
<hdp:script-tasklet id="script-tasklet">
<hdp:script language="groovy">
inputPath = "/user/gutenberg/input/word/"
outputPath = "/user/gutenberg/output/word/"
if (fsh.test(inputPath)) {
fsh.rmr(inputPath) }
if (fsh.test(outputPath)) {
fsh.rmr(outputPath) }
inputFile = "src/main/resources/data/nietzsche-chapter-1.txt"
fsh.put(inputFile, inputPath)
</hdp:script>
</hdp:script-tasklet>
Migration
Hadoop Streaming:
Hadoop Tool Executor:
<hdp:streaming id="streaming"
input-path="/input/" output-path="/ouput/"
mapper="${path.cat}" reducer="${path.wc}"/>
<hdp:tool-runner id="someTool" tool-class="org.foo.SomeTool" run-at-startup="true">
<hdp:arg value="data/in.txt"/>
<hdp:arg value="data/out.txt"/>
property=value
</hdp:tool-runner>
Alternatives
• Apache Flume – distributed data collection
• Apache Oozie – workflow scheduler
• Apache Sqoop – SQL bulk import/export
TESTING & RELATED TOOLS
Testing
• JUnit/Mocks + MRUnit
• Mini-HDFS and Mini-MapReduce
cluster
• LocalJobRunner
SpringYARN
HDFS
storage
Map/Reduce
cluster / data process
YARN
cluster
HDFS
storage
Map/Reduce
data process
Other
like Spark - data
Hadoop 1.x Hadoop 2.x
SpringeXtremeData (XD)
• Ultimate data processing solution
• Implements most common approach,
business logic up to you
• On top of Spring Batch and Spring
Integration
• Has DSL
• Scalable
More speedups
• Use provider quick start VM for initial
development
• Use cloud based images for production
(start/stop)
• Don’t use Map/Reduce without real need.
Start with higher abstraction.
• Don’t migrate without real need!
• Invest in DevOps (Chef / Puppet /
Vagrant…)
Q/A
?

Contenu connexe

Tendances

HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
Edureka!
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
Reza Ameri
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
DataWorks Summit
 

Tendances (20)

HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at Yahoo
 
Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
Hive
HiveHive
Hive
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop sqoop
Hadoop sqoop Hadoop sqoop
Hadoop sqoop
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
 
알쓸신잡
알쓸신잡알쓸신잡
알쓸신잡
 

Similaire à Rapid Development of Big Data applications using Spring for Apache Hadoop

Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Jesus Rodriguez
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
DataWorks Summit
 
KOTI_RESUME_(1) (2)
KOTI_RESUME_(1) (2)KOTI_RESUME_(1) (2)
KOTI_RESUME_(1) (2)
ch koti
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
Rajesh Nadipalli
 

Similaire à Rapid Development of Big Data applications using Spring for Apache Hadoop (20)

Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
5. pivotal hd 2013
5. pivotal hd 20135. pivotal hd 2013
5. pivotal hd 2013
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-Cloud
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
 
KOTI_RESUME_(1) (2)
KOTI_RESUME_(1) (2)KOTI_RESUME_(1) (2)
KOTI_RESUME_(1) (2)
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platform
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Srikanth hadoop 3.6yrs_hyd
Srikanth hadoop 3.6yrs_hydSrikanth hadoop 3.6yrs_hyd
Srikanth hadoop 3.6yrs_hyd
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 

Plus de zenyk

SEMASEARCH - Високі технології у боротьбі з корупцією та на захисті держави
SEMASEARCH - Високі технології у боротьбі з корупцією та на захисті державиSEMASEARCH - Високі технології у боротьбі з корупцією та на захисті держави
SEMASEARCH - Високі технології у боротьбі з корупцією та на захисті держави
zenyk
 

Plus de zenyk (12)

Semasearch Spring - 2015
Semasearch   Spring - 2015Semasearch   Spring - 2015
Semasearch Spring - 2015
 
Проект Каскад
Проект КаскадПроект Каскад
Проект Каскад
 
Ecois.me and uMuni
Ecois.me and uMuniEcois.me and uMuni
Ecois.me and uMuni
 
Semasearch Intro
Semasearch IntroSemasearch Intro
Semasearch Intro
 
SEMASEARCH - Високі технології у боротьбі з корупцією та на захисті держави
SEMASEARCH - Високі технології у боротьбі з корупцією та на захисті державиSEMASEARCH - Високі технології у боротьбі з корупцією та на захисті держави
SEMASEARCH - Високі технології у боротьбі з корупцією та на захисті держави
 
Introduction to Clojure - EDGE Lviv
Introduction to Clojure - EDGE LvivIntroduction to Clojure - EDGE Lviv
Introduction to Clojure - EDGE Lviv
 
Puppet / DevOps - EDGE Lviv
Puppet / DevOps - EDGE LvivPuppet / DevOps - EDGE Lviv
Puppet / DevOps - EDGE Lviv
 
Hadoop Solutions
Hadoop SolutionsHadoop Solutions
Hadoop Solutions
 
Emotional Intelligence
Emotional IntelligenceEmotional Intelligence
Emotional Intelligence
 
Lviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQLLviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQL
 
Amazon Clouds in Action
Amazon Clouds in ActionAmazon Clouds in Action
Amazon Clouds in Action
 
Modern Java Web Development
Modern Java Web DevelopmentModern Java Web Development
Modern Java Web Development
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Rapid Development of Big Data applications using Spring for Apache Hadoop