SlideShare une entreprise Scribd logo
Faster Data Flows with Hive, Spring and Hadoop
Alex Silva

Principal Data Engineer
DATA ADVENTURES AT RACKSPACE
• Datasets
• Data pipeline: flows and systems
• Creating a generic Hadoop ETL framework
• Integrating Hadoop with Spring
• Spring Hadoop, Spring Bach and Spring Boot
• Hive
• File formats
• Queries and performance
MAAS Dataset
• System and platform monitoring
• Pings, SSH, HTTP, HTTPS checks
• Remote monitoring
• CPU, file system, load average, disk memory
• MySQL, Apache
THE BUSINESS DOMAIN | 3
The Dataset
• Processing around 1.5B records/day
• Stored in Cassandra
• Exported to HDFS in batches
• TBs of uncompressed JSON (“raw data”) daily
• First dataset piped through ETL platform
DATA ENGINEERING STATS | 4
DATA PIPELINE
• Data flow
• Stages
• ETL
• Input formats
• Generic Transformation Layer
• Outputs
Data Flow Diagram
DATA FLOW | 6
Monitoring
JSON Export
HDFS
Start
Available and
well-formed?
No
Stop
EXTRACT AND TRANSFORM
BAD ROW
OR ERROR?
LOG
CSV
STAGING
FILE
ETL
JSON DATA
HDFSYes
Yes No
LOAD
Partioning
Bucketing
Indexing
Staging Table Production Table
ETL
Hive Table
Flume
Systems Diagram
SYSTEMS | 7
Monitoring
Events
HDFS
JSON
Extract
MapReduce
1.2.0.1.3.2.0
Load
Hive
0.12.0
Flume Log4J
Appender
Flume
1.5.0
Access
End User
Bad records sink
Export
ETL Summary
• Extract
• JSON files in HDFS
• Transform
• Generic Java based ETL framework
• MapReduce jobs extract features
• Quality checks
• Load
• Load data into partitioned ORC Hive tables
DATA FLOW | 8
HADOOP
Hadoop: Pros
• Dataset volume
• Data is grows exponentially at a very rapid rate
• Integrates with existing ecosystem
• HiveQL
• Experimentation and exploration
• No expensive software or hardware to buy
TOOLS AND TECHNOLOGIES | 10
Hadoop: Cons
• Job monitoring and scheduling
• Data quality
• Error handling and notification
• Programming model
• Generic framework mitigates some of that
TOOLS AND TECHNOLOGIES | 11
CAN WE OVERCOME SOME OF THOSE?
Keeping the Elephant “Lean”
• Job control without the complexity of external tools
• Checks and validations
• Unified configuration model
• Integration with scripts
• Automation
• Job restartability
DATA ENGINEERING | 13
HEY! WHAT ABOUT SPRING?
SPRING DATA HADOOP
What is it about?
• Part of the Spring Framework
• Run Hadoop apps as standard Java apps using DI
• Unified declarative configuration model
• APIs to run MapReduce, Hive, and Pig jobs.
• Script HDFS operations using any JVM based languages.
• Supports both classic MR and YARN
TOOLS AND TECHNOLOGIES | 16
The Apache Hadoop Namespace
TOOLS AND TECHNOLOGIES | 17
Also supports annotation based configuration via the
@EnableHadoop annotation.
Job Configuration: Standard Hadoop APIs
TOOLS AND TECHNOLOGIES | 18
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
Job.setJarByClass(WordCountMapper.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new
Path(args[1]));
job.waitForCompletion(true);
Configuring Hadoop with Spring
SPRING HADOOP | 19
<context:property-placeholder location="hadoop-dev.properties"/>
<hdp:configuration>
fs.default.name=${hd.fs}
</hdp:configuration>
<hdp:job id="word-count-job"
input-path=“${input.path}"
output-path="${output.path}“
jar=“hadoop-examples.jar”
mapper="examples.WordCount.WordMapper“
reducer="examples.WordCount.IntSumReducer"/>
<hdp:job-runner id=“runner” job-ref="word-count-job“
run-at-startup=“true“ />
input.path=/wc/input/-
output.path=/wc/word/-
hd.fs=hdfs://localhost:9000
SPRING HADOOP | 20
Configuration Attributes
Creating a Job
SPRING HADOOP | 21
Injecting Jobs
• Use DI to obtain reference to Spring managed Hadoop
job
• Perform additional validation and configuration before
submitting
TOOLS AND TECHNOLOGIES | 22
public'class'WordService'{'
'
''@Autowired'
''private'Job'mapReduceJob;'''
'
''public'void'processWords()'{'''''
''''mapReduceJob.submit();'
''}'
}'
Running a Job
TOOLS AND TECHNOLOGIES | 23
Distributed Cache
TOOLS AND TECHNOLOGIES | 24
Using Scripts
TOOLS AND TECHNOLOGIES | 25
Scripting Implicit Variables
TOOLS AND TECHNOLOGIES | 26
Scripting Support in HDFS
• FSShell is designed to support scripting languages
• Use these for housekeeping tasks:
• Check for files, prepare input data, clean output
directories, set flags, etc.
TOOLS AND TECHNOLOGIES | 27
SPRING BATCH
What is it about?
• Born out of collaboration with Accenture in 2007
• Fully automated processing of large volumes of data.
• Logging, txn management, listeners, job statistics,
restart, skipping, and resource management.
• Automatic retries after failure
• Synch, async and parallel processing
• Data partitioning
TOOLS AND TECHNOLOGIES | 29
Hadoop Workflow Orchestration
• Complex data flows
• Reuses batch infrastructure to manage Hadoop workflows.
• Steps can be any Hadoop job type or HDFS script
• Jobs can be invoked by events or scheduled.
• Steps can be sequential, conditional, split, 

concurrent, or programmatically determined.
• Works with flat files, XML, or databases.
TOOLS AND TECHNOLOGIES | 30
Spring Batch Configuration
• Jobs are composed of steps
TOOLS AND TECHNOLOGIES | 31
<job id="job1">
<step id="import" next="wordcount">
<tasklet ref=“import-tasklet"/>
</step>
<step id=“wc" next="pig">
<tasklet ref="wordcount-tasklet"/>
</step>
<step id="pig">
<tasklet ref="pig-tasklet“></step>
<split id="parallel" next="hdfs">
<flow><step id="mrStep">
<tasklet ref="mr-tasklet"/>
</step></flow>
<flow><step id="hive">
<tasklet ref="hive-tasklet"/>
</step></flow>
</split>
<step id="hdfs">
<tasklet ref="hdfs-tasklet"/></step>
</job>
Spring Data Hadoop Integration
TOOLS AND TECHNOLOGIES | 32
SPRING BOOT
What is it about?
• Builds production-ready Spring applications.
• Creates a “runnable” jar with dependencies and classpath settings.
• Can embed Tomcat or Jetty within the JAR
• Automatic configuration
• Out of the box features:
• statistics, metrics, health checks and externalized configuration
• No code generation and no requirement for XML configuration.
TOOLS AND TECHNOLOGIES | 34
PUTTING IT ALL TOGETHER
Spring Data Flow Components
TOOLS AND TECHNOLOGIES | 36
Spring Boot
Extract
Spring Batch
2.0
Load
Spring Hadoop
2.01.1.5
HDFS
Hive
0.12.0
MapReduce
HDP 1.3
Hierarchical View
TOOLS AND TECHNOLOGIES | 37
Spring Boot
Spring Batch
Job control
Spring Hadoop
- Notifications
- Validation
- Scheduling
- Data Flow
- Callbacks
HADOOP DATA FLOWS, SPRINGFIED
Spring Hadoop Configuration
• Job parameters configured by Spring
• Sensible defaults used
• Parameters can be overridden:
• External properties file.
• At runtime via system properties: -Dproperty.name = property.value
TOOLS AND TECHNOLOGIES | 39
<configuration>
fs.default.name=${hd.fs}
io.sort.mb=${io.sort.mb:640mb}
mapred.reduce.tasks=${mapred.reduce.tasks:1}
mapred.job.tracker=${hd.jt:local}
mapred.child.java.opts=${mapred.child.java.opts}
</configuration>
MapReduce Jobs
• Configured via Spring Hadoop
• One job per entity
TOOLS AND TECHNOLOGIES | 40
<job id="metricsMR"
input-path="${mapred.input.path}"
output-path="${mapred.output.path}"
mapper="GenericETLMapper"
reducer="GenericETLReducer”
input-format="org.apache.hadoop.mapreduce.lib.input.TextInputFormat"
output-format="org.apache.hadoop.mapreduce.lib.output.TextOutputFormat"
key="TextArrayWritable"
value="org.apache.hadoop.io.NullWritable"
map-key="org.apache.hadoop.io.Text"
map-value="org.apache.hadoop.io.Text"
jar-by-class="GenericETLMapper">
volga.etl.dto.class=Metric
</job>
MapReduce Jobs
• Jobs are wrapped into Tasklet definitions
TOOLS AND TECHNOLOGIES | 41
<job-tasklet job-ref="metricsMR" id="metricsJobTasklet"/>
Hive Configuration
• Hive steps also defined as tasklets
• Parameters are passed from MapReduce phase to Hive
phase
TOOLS AND TECHNOLOGIES | 42
<hive-client-factory host="${hive.host}" port="${hive.port:10000}"/>
<hive-tasklet id="load-notifications">
<script location="classpath:hive/ddl/notifications-load.hql"/>
</hive-tasklet>
<hive-tasklet id="load-metrics">
<script location="classpath:hive/ddl/metrics-load.hql">
<arguments>INPUT_PATH=${mapreduce.output.path}</arguments>
</script>
</hive-tasklet>
Spring Batch Configuration
• One Spring Batch job per entity.
TOOLS AND TECHNOLOGIES | 43
<job id="metrics" restartable="false" parent="VolgaETLJob">
<step id="cleanMetricsOutputDirectory" next="metricsMapReduce">
<tasklet ref="setUpJobTasklet"/>
</step>
<step id="metricsMapReduce">
<tasklet ref="metricsJobTasklet">
<listeners>
<listener ref="mapReduceErrorThresholdListener"/>
</listeners>
</tasklet>
<fail on="FAILED" exit-code="Map Reduce Step Failed"/>
<end on="COMPLETED"/>
<!--<next on="*" to="loadMetricsIntoHive"/>-->
</step>
<step id="loadMetricsIntoHive">
<tasklet ref="load-notifications"/>
</step>
</job>
Spring Batch Listeners
• Monitor job flow
• Take action on job failure
• PagerDuty notifications
• Save job counters to the audit database
• Notify team if counters are not consistent with historical
audit data (based on thresholds)
TOOLS AND TECHNOLOGIES | 44
Spring Boot: Pulling Everything Together
• Runnable jar created during build process
• Controlled by Maven plugin
TOOLS AND TECHNOLOGIES | 45
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
<configuration>
<finalName>maas-etl-${project.version}</finalName>
<classifier>spring</classifier>
<mainClass>com.rackspace....JobRunner</mainClass>
<excludeGroupIds>org.slf4j</excludeGroupIds>
</configuration>
</plugin>
HIVE
• Typical Use Cases
• File formats
• ORC
• Abstractions
• Hive in the monitoring pipeline
• Query performance
Overview
• Translates SQL commands into MR jobs.
• Structured and unstructured data in multiple formats
• Standard access protocols, including JDBC and Thrift
• Provides several serialization mechanisms
• Integrates seamlessly with Hadoop: HCatalog, Pig,
HBase, etc.
HIVE | 47
Hive vs. RDBMS
HIVE | 48
Hive Traditional Databases
SQL Interface SQL Interface
Focus on batch analytics Mostly online, interactive analytics
No transactions Transactions are their way of life
No random inserts

Updates are not natively supported (but possible.)
Random insert and updates
Distributed processing via MR Distributed processing capabilities vary
Scales to hundreds of nodes Seldom scales beyond 20 nodes
Built for commodity hardware Expensive, proprietary hardware
Low cost per petabyte What’s a petabyte?
Abstraction Layers in Hive
49HIVE |
Database
Table
Partition
Skewed Keys
Table
Partition Partition
Unskewed
Keys
B
u
c
k
e
t
B
u
c
k
e
t
B
u
c
k
e
t
Optional
Schemas and File Formats
• We used the ORCFile format: built-in, easy to use and efficient.
• Efficient light-weight + generic compression
• Run length encoding for integers and strings, dictionary encoding, etc.
• Generic compression: Snappy, LZO, and ZLib (default)
• High performance
• Indexes value ranges within blocks of ORCFile data
• Predicate filter pushdown allows efficient scanning during queries.
• Flexible Data Model
• Hive types are supported including maps, structs and unions.
HIVE | 50
The ORC File Format
• An ORC file contains groups of row data called stripes,
along with auxiliary information in a file footer.
• Default size is 256 MB (orc.stripe.size).
• Large stripes allow for efficient reads from HDFS
configured independently from the block size.
HIVE | 51
The ORC File Format: Index
• Doesn’t answer queries
• Required for skipping rows:
• Row index entries provide offsets that enable seeking
• Min and max values for each column
HIVE | 52
ORC File Index Skipping
HIVE | 53
Skipping works for number types and for string types.
Done by recording a min and max value inside the inline index
and determining if the lookup value falls outside that range.
The ORC File Format: File Footer
• List of stripes in the file, the number of rows per stripe,
each column's data type.
• Column-level aggregates: count, min, max, and sum.
• ORC uses files footer to find the columns data streams.
HIVE | 54
Predicate Pushdowns
• “Push down” parts of the query to where the data is.
• filter/skip as much data as possible, and
• greatly reduce input size.
• Sorting a table on its secondary keys also reduces
execution time.
• Sorted columns are grouped together in one area on
disk and the other pieces will be skipped very quickly.
HIVE | 55
56HIVE |
ORC File
Query Performance
• Lower latency Hive queries rely on two major factors:
• Sorting and skipping data as much as possible
• Minimizing data shuffle from mappers to reducers
HIVE | 57
Improving Query Performance
• Divide data among different files/directories
• Partitions, buckets, etc.
• Skip records using small embedded indexes.
• ORCFile format.
• Sort data ahead of time.
• Simplifies joins making ORCFile skipping more
effective.
HIVE | 58
The Big Picture
DATA ENGINEERING | 59
Data Preprocessing
HDFS HDFSMapReduce
Start Here
JSON Hive File
Data Load
Dynamic Load
Partioning
Bucketing
Indexing
HDFS
Hive File
Staging Table Prod Table
Data Access
API Hive CLI
Apache Thrift
THANK YOU!
Get in touch:
alexvsilva@gmail.com
@thealexsilva

Contenu connexe

Tendances

Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
Uwe Printz
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Data Con LA
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
Jay
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
Avkash Chauhan
 
BIG DATA ANALYSIS
BIG DATA ANALYSISBIG DATA ANALYSIS
BIG DATA ANALYSIS
Nitesh Singh
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
Joydeep Sen Sarma
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
nzhang
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
Chester Chen
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
MapR Technologies
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
Douglas Bernardini
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
Bill Graham
 
Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤
Toshihiro Suzuki
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
obdit
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
larsgeorge
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
MapR Technologies
 
Hadoop sqoop
Hadoop sqoop Hadoop sqoop
Hadoop sqoop
Wei-Yu Chen
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Adam Kawa
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
Steve Staso
 

Tendances (20)

Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
BIG DATA ANALYSIS
BIG DATA ANALYSISBIG DATA ANALYSIS
BIG DATA ANALYSIS
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Hadoop sqoop
Hadoop sqoop Hadoop sqoop
Hadoop sqoop
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 

En vedette

Hive tuning
Hive tuningHive tuning
Hive tuning
Michael Zhang
 
Installing apache sqoop
Installing apache sqoopInstalling apache sqoop
Installing apache sqoop
Enrique Davila
 
Load data into hive and csv
Load data into hive and csvLoad data into hive and csv
Load data into hive and csv
Enrique Davila
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerde
Zheng Shao
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
Reza Ameri
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
Will Du
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
ragho
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
Jeff Hammerbacher
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
alanfgates
 
Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)
Thomas Vanhove
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Namit Jain
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object Model
Zheng Shao
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Big Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your DataBig Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your Data
Kai Wähner
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
Julian Hyde
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Someshwar Kale
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
Recruit Technologies
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Replacing Telco DB/DW to Hadoop and Hive
Replacing Telco DB/DW to Hadoop and HiveReplacing Telco DB/DW to Hadoop and Hive
Replacing Telco DB/DW to Hadoop and Hive
JunHo Cho
 

En vedette (20)

Hive tuning
Hive tuningHive tuning
Hive tuning
 
Installing apache sqoop
Installing apache sqoopInstalling apache sqoop
Installing apache sqoop
 
Load data into hive and csv
Load data into hive and csvLoad data into hive and csv
Load data into hive and csv
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerde
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object Model
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Big Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your DataBig Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your Data
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Replacing Telco DB/DW to Hadoop and Hive
Replacing Telco DB/DW to Hadoop and HiveReplacing Telco DB/DW to Hadoop and Hive
Replacing Telco DB/DW to Hadoop and Hive
 

Similaire à Data Engineering with Spring, Hadoop and Hive

Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log Processing
Cascading
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
Tillmann Eitelberg
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
Prashanth Shankar kumar
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
Kibrom Gebrehiwot
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
Gwen (Chen) Shapira
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Cloudera, Inc.
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
Big Data Joe™ Rossi
 
Spring Batch Performance Tuning
Spring Batch Performance TuningSpring Batch Performance Tuning
Spring Batch Performance Tuning
Gunnar Hillert
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
DataWorks Summit
 
Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-Cloud
Soam Acharya
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 

Similaire à Data Engineering with Spring, Hadoop and Hive (20)

Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log Processing
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
 
Spring Batch Performance Tuning
Spring Batch Performance TuningSpring Batch Performance Tuning
Spring Batch Performance Tuning
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-Cloud
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 

Dernier

Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
blueshagoo1
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
yuvarajkumar334
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
actyx
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
Vineet
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
zsafxbf
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Marlon Dumas
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
KiriakiENikolaidou
 

Dernier (20)

Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
 

Data Engineering with Spring, Hadoop and Hive

  • 1. Faster Data Flows with Hive, Spring and Hadoop Alex Silva Principal Data Engineer
  • 2. DATA ADVENTURES AT RACKSPACE • Datasets • Data pipeline: flows and systems • Creating a generic Hadoop ETL framework • Integrating Hadoop with Spring • Spring Hadoop, Spring Bach and Spring Boot • Hive • File formats • Queries and performance
  • 3. MAAS Dataset • System and platform monitoring • Pings, SSH, HTTP, HTTPS checks • Remote monitoring • CPU, file system, load average, disk memory • MySQL, Apache THE BUSINESS DOMAIN | 3
  • 4. The Dataset • Processing around 1.5B records/day • Stored in Cassandra • Exported to HDFS in batches • TBs of uncompressed JSON (“raw data”) daily • First dataset piped through ETL platform DATA ENGINEERING STATS | 4
  • 5. DATA PIPELINE • Data flow • Stages • ETL • Input formats • Generic Transformation Layer • Outputs
  • 6. Data Flow Diagram DATA FLOW | 6 Monitoring JSON Export HDFS Start Available and well-formed? No Stop EXTRACT AND TRANSFORM BAD ROW OR ERROR? LOG CSV STAGING FILE ETL JSON DATA HDFSYes Yes No LOAD Partioning Bucketing Indexing Staging Table Production Table ETL Hive Table Flume
  • 7. Systems Diagram SYSTEMS | 7 Monitoring Events HDFS JSON Extract MapReduce 1.2.0.1.3.2.0 Load Hive 0.12.0 Flume Log4J Appender Flume 1.5.0 Access End User Bad records sink Export
  • 8. ETL Summary • Extract • JSON files in HDFS • Transform • Generic Java based ETL framework • MapReduce jobs extract features • Quality checks • Load • Load data into partitioned ORC Hive tables DATA FLOW | 8
  • 10. Hadoop: Pros • Dataset volume • Data is grows exponentially at a very rapid rate • Integrates with existing ecosystem • HiveQL • Experimentation and exploration • No expensive software or hardware to buy TOOLS AND TECHNOLOGIES | 10
  • 11. Hadoop: Cons • Job monitoring and scheduling • Data quality • Error handling and notification • Programming model • Generic framework mitigates some of that TOOLS AND TECHNOLOGIES | 11
  • 12. CAN WE OVERCOME SOME OF THOSE?
  • 13. Keeping the Elephant “Lean” • Job control without the complexity of external tools • Checks and validations • Unified configuration model • Integration with scripts • Automation • Job restartability DATA ENGINEERING | 13
  • 14. HEY! WHAT ABOUT SPRING?
  • 16. What is it about? • Part of the Spring Framework • Run Hadoop apps as standard Java apps using DI • Unified declarative configuration model • APIs to run MapReduce, Hive, and Pig jobs. • Script HDFS operations using any JVM based languages. • Supports both classic MR and YARN TOOLS AND TECHNOLOGIES | 16
  • 17. The Apache Hadoop Namespace TOOLS AND TECHNOLOGIES | 17 Also supports annotation based configuration via the @EnableHadoop annotation.
  • 18. Job Configuration: Standard Hadoop APIs TOOLS AND TECHNOLOGIES | 18 Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); Job.setJarByClass(WordCountMapper.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);
  • 19. Configuring Hadoop with Spring SPRING HADOOP | 19 <context:property-placeholder location="hadoop-dev.properties"/> <hdp:configuration> fs.default.name=${hd.fs} </hdp:configuration> <hdp:job id="word-count-job" input-path=“${input.path}" output-path="${output.path}“ jar=“hadoop-examples.jar” mapper="examples.WordCount.WordMapper“ reducer="examples.WordCount.IntSumReducer"/> <hdp:job-runner id=“runner” job-ref="word-count-job“ run-at-startup=“true“ /> input.path=/wc/input/- output.path=/wc/word/- hd.fs=hdfs://localhost:9000
  • 20. SPRING HADOOP | 20 Configuration Attributes
  • 21. Creating a Job SPRING HADOOP | 21
  • 22. Injecting Jobs • Use DI to obtain reference to Spring managed Hadoop job • Perform additional validation and configuration before submitting TOOLS AND TECHNOLOGIES | 22 public'class'WordService'{' ' ''@Autowired' ''private'Job'mapReduceJob;''' ' ''public'void'processWords()'{''''' ''''mapReduceJob.submit();' ''}' }'
  • 23. Running a Job TOOLS AND TECHNOLOGIES | 23
  • 24. Distributed Cache TOOLS AND TECHNOLOGIES | 24
  • 25. Using Scripts TOOLS AND TECHNOLOGIES | 25
  • 26. Scripting Implicit Variables TOOLS AND TECHNOLOGIES | 26
  • 27. Scripting Support in HDFS • FSShell is designed to support scripting languages • Use these for housekeeping tasks: • Check for files, prepare input data, clean output directories, set flags, etc. TOOLS AND TECHNOLOGIES | 27
  • 29. What is it about? • Born out of collaboration with Accenture in 2007 • Fully automated processing of large volumes of data. • Logging, txn management, listeners, job statistics, restart, skipping, and resource management. • Automatic retries after failure • Synch, async and parallel processing • Data partitioning TOOLS AND TECHNOLOGIES | 29
  • 30. Hadoop Workflow Orchestration • Complex data flows • Reuses batch infrastructure to manage Hadoop workflows. • Steps can be any Hadoop job type or HDFS script • Jobs can be invoked by events or scheduled. • Steps can be sequential, conditional, split, 
 concurrent, or programmatically determined. • Works with flat files, XML, or databases. TOOLS AND TECHNOLOGIES | 30
  • 31. Spring Batch Configuration • Jobs are composed of steps TOOLS AND TECHNOLOGIES | 31 <job id="job1"> <step id="import" next="wordcount"> <tasklet ref=“import-tasklet"/> </step> <step id=“wc" next="pig"> <tasklet ref="wordcount-tasklet"/> </step> <step id="pig"> <tasklet ref="pig-tasklet“></step> <split id="parallel" next="hdfs"> <flow><step id="mrStep"> <tasklet ref="mr-tasklet"/> </step></flow> <flow><step id="hive"> <tasklet ref="hive-tasklet"/> </step></flow> </split> <step id="hdfs"> <tasklet ref="hdfs-tasklet"/></step> </job>
  • 32. Spring Data Hadoop Integration TOOLS AND TECHNOLOGIES | 32
  • 34. What is it about? • Builds production-ready Spring applications. • Creates a “runnable” jar with dependencies and classpath settings. • Can embed Tomcat or Jetty within the JAR • Automatic configuration • Out of the box features: • statistics, metrics, health checks and externalized configuration • No code generation and no requirement for XML configuration. TOOLS AND TECHNOLOGIES | 34
  • 35. PUTTING IT ALL TOGETHER
  • 36. Spring Data Flow Components TOOLS AND TECHNOLOGIES | 36 Spring Boot Extract Spring Batch 2.0 Load Spring Hadoop 2.01.1.5 HDFS Hive 0.12.0 MapReduce HDP 1.3
  • 37. Hierarchical View TOOLS AND TECHNOLOGIES | 37 Spring Boot Spring Batch Job control Spring Hadoop - Notifications - Validation - Scheduling - Data Flow - Callbacks
  • 38. HADOOP DATA FLOWS, SPRINGFIED
  • 39. Spring Hadoop Configuration • Job parameters configured by Spring • Sensible defaults used • Parameters can be overridden: • External properties file. • At runtime via system properties: -Dproperty.name = property.value TOOLS AND TECHNOLOGIES | 39 <configuration> fs.default.name=${hd.fs} io.sort.mb=${io.sort.mb:640mb} mapred.reduce.tasks=${mapred.reduce.tasks:1} mapred.job.tracker=${hd.jt:local} mapred.child.java.opts=${mapred.child.java.opts} </configuration>
  • 40. MapReduce Jobs • Configured via Spring Hadoop • One job per entity TOOLS AND TECHNOLOGIES | 40 <job id="metricsMR" input-path="${mapred.input.path}" output-path="${mapred.output.path}" mapper="GenericETLMapper" reducer="GenericETLReducer” input-format="org.apache.hadoop.mapreduce.lib.input.TextInputFormat" output-format="org.apache.hadoop.mapreduce.lib.output.TextOutputFormat" key="TextArrayWritable" value="org.apache.hadoop.io.NullWritable" map-key="org.apache.hadoop.io.Text" map-value="org.apache.hadoop.io.Text" jar-by-class="GenericETLMapper"> volga.etl.dto.class=Metric </job>
  • 41. MapReduce Jobs • Jobs are wrapped into Tasklet definitions TOOLS AND TECHNOLOGIES | 41 <job-tasklet job-ref="metricsMR" id="metricsJobTasklet"/>
  • 42. Hive Configuration • Hive steps also defined as tasklets • Parameters are passed from MapReduce phase to Hive phase TOOLS AND TECHNOLOGIES | 42 <hive-client-factory host="${hive.host}" port="${hive.port:10000}"/> <hive-tasklet id="load-notifications"> <script location="classpath:hive/ddl/notifications-load.hql"/> </hive-tasklet> <hive-tasklet id="load-metrics"> <script location="classpath:hive/ddl/metrics-load.hql"> <arguments>INPUT_PATH=${mapreduce.output.path}</arguments> </script> </hive-tasklet>
  • 43. Spring Batch Configuration • One Spring Batch job per entity. TOOLS AND TECHNOLOGIES | 43 <job id="metrics" restartable="false" parent="VolgaETLJob"> <step id="cleanMetricsOutputDirectory" next="metricsMapReduce"> <tasklet ref="setUpJobTasklet"/> </step> <step id="metricsMapReduce"> <tasklet ref="metricsJobTasklet"> <listeners> <listener ref="mapReduceErrorThresholdListener"/> </listeners> </tasklet> <fail on="FAILED" exit-code="Map Reduce Step Failed"/> <end on="COMPLETED"/> <!--<next on="*" to="loadMetricsIntoHive"/>--> </step> <step id="loadMetricsIntoHive"> <tasklet ref="load-notifications"/> </step> </job>
  • 44. Spring Batch Listeners • Monitor job flow • Take action on job failure • PagerDuty notifications • Save job counters to the audit database • Notify team if counters are not consistent with historical audit data (based on thresholds) TOOLS AND TECHNOLOGIES | 44
  • 45. Spring Boot: Pulling Everything Together • Runnable jar created during build process • Controlled by Maven plugin TOOLS AND TECHNOLOGIES | 45 <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> <configuration> <finalName>maas-etl-${project.version}</finalName> <classifier>spring</classifier> <mainClass>com.rackspace....JobRunner</mainClass> <excludeGroupIds>org.slf4j</excludeGroupIds> </configuration> </plugin>
  • 46. HIVE • Typical Use Cases • File formats • ORC • Abstractions • Hive in the monitoring pipeline • Query performance
  • 47. Overview • Translates SQL commands into MR jobs. • Structured and unstructured data in multiple formats • Standard access protocols, including JDBC and Thrift • Provides several serialization mechanisms • Integrates seamlessly with Hadoop: HCatalog, Pig, HBase, etc. HIVE | 47
  • 48. Hive vs. RDBMS HIVE | 48 Hive Traditional Databases SQL Interface SQL Interface Focus on batch analytics Mostly online, interactive analytics No transactions Transactions are their way of life No random inserts
 Updates are not natively supported (but possible.) Random insert and updates Distributed processing via MR Distributed processing capabilities vary Scales to hundreds of nodes Seldom scales beyond 20 nodes Built for commodity hardware Expensive, proprietary hardware Low cost per petabyte What’s a petabyte?
  • 49. Abstraction Layers in Hive 49HIVE | Database Table Partition Skewed Keys Table Partition Partition Unskewed Keys B u c k e t B u c k e t B u c k e t Optional
  • 50. Schemas and File Formats • We used the ORCFile format: built-in, easy to use and efficient. • Efficient light-weight + generic compression • Run length encoding for integers and strings, dictionary encoding, etc. • Generic compression: Snappy, LZO, and ZLib (default) • High performance • Indexes value ranges within blocks of ORCFile data • Predicate filter pushdown allows efficient scanning during queries. • Flexible Data Model • Hive types are supported including maps, structs and unions. HIVE | 50
  • 51. The ORC File Format • An ORC file contains groups of row data called stripes, along with auxiliary information in a file footer. • Default size is 256 MB (orc.stripe.size). • Large stripes allow for efficient reads from HDFS configured independently from the block size. HIVE | 51
  • 52. The ORC File Format: Index • Doesn’t answer queries • Required for skipping rows: • Row index entries provide offsets that enable seeking • Min and max values for each column HIVE | 52
  • 53. ORC File Index Skipping HIVE | 53 Skipping works for number types and for string types. Done by recording a min and max value inside the inline index and determining if the lookup value falls outside that range.
  • 54. The ORC File Format: File Footer • List of stripes in the file, the number of rows per stripe, each column's data type. • Column-level aggregates: count, min, max, and sum. • ORC uses files footer to find the columns data streams. HIVE | 54
  • 55. Predicate Pushdowns • “Push down” parts of the query to where the data is. • filter/skip as much data as possible, and • greatly reduce input size. • Sorting a table on its secondary keys also reduces execution time. • Sorted columns are grouped together in one area on disk and the other pieces will be skipped very quickly. HIVE | 55
  • 57. Query Performance • Lower latency Hive queries rely on two major factors: • Sorting and skipping data as much as possible • Minimizing data shuffle from mappers to reducers HIVE | 57
  • 58. Improving Query Performance • Divide data among different files/directories • Partitions, buckets, etc. • Skip records using small embedded indexes. • ORCFile format. • Sort data ahead of time. • Simplifies joins making ORCFile skipping more effective. HIVE | 58
  • 59. The Big Picture DATA ENGINEERING | 59 Data Preprocessing HDFS HDFSMapReduce Start Here JSON Hive File Data Load Dynamic Load Partioning Bucketing Indexing HDFS Hive File Staging Table Prod Table Data Access API Hive CLI Apache Thrift
  • 60. THANK YOU! Get in touch: alexvsilva@gmail.com @thealexsilva