SlideShare une entreprise Scribd logo
1  sur  23
My life as a
  beekeeper
   @89clouds
Who am I?
Pedro Figueiredo (pfig@89clouds.com)

Hadoop et al

SocialFacebook games, media (TV,
publishing)

Elastic MapReduce, Cloudera

NoSQL, as in “Not a SQL guy”
The problem with
      Hive



It looks like SQL
No, seriously
SELECT
  CONCAT(vishi,vislo),
  SUM(
    CASE WHEN searchengine = 'google'
       THEN 1
       ELSE 0
    END
  ) AS google_searches
FROM omniture
WHERE
  year(hittime) = 2011 AND
  month(hittime) = 8 AND
  is_search = 'Y'
GROUP BY CONCAT(vishi,vislo);
“It’s just like
     Oracle!”
Analysts will be very happy

At least until they join with that 30
billion-record table

Pro tip: explain MapReduce and then
MAPJOIN

 set
hive.mapjoin.smalltable.filesize=xxx;
Your first interview
      question


 “Explain the difference
 between CREATE TABLE and
 CREATE EXTERNAL TABLE”
Dynamic partitions

Partitions are the poor person’s
indexes

Unstructured data is full of surprises
 set   hive.exec.dynamic.partition.mode=nonstrict;
 set   hive.exec.dynamic.partition=true;
 set   hive.exec.max.dynamic.partitions=100000;
 set   hive.exec.max.dynamic.partitions.pernode=100000;

Plan your partitions ahead
Multi-vitamins

You can minimise input scans by using
multi-table INSERTs:

FROM input
INSERT INTO TABLE output1 SELECT foo
INSERT INTO TABLE output2 SELECT bar;
Persistence, do you
     speak it?
 External Hive metastore

 Avoid the pain of cluster set up

 Use an RDS metastore if on AWS, RDBMS
 otherwise.

 10GB will get you a long way, this
 thing is tiny
Now you have 2
      problems
Regular expressions are great, if
you’re using a real programming
language.

WHERE foo RLIKE ‘(a|b|c)’ will hurt

WHERE foo=‘a’ OR foo=‘b’ OR foo=‘c’

Generate these statements, if needs
be, it will pay off.
Avro

Serialisation framework (think
Thrift/Protocol Buffers).

Avro container files are
SequenceFile-like, splittable.

Support for snappy built-in.

If using the LinkedIn SerDe, the
table creation syntax changes.
Avro
CREATE EXTERNAL TABLE IF NOT EXISTS mytable
  PARTITIONED BY (ds STRING)
  ROW FORMAT SERDE
    'com.linkedin.haivvreo.AvroSerDe'
  WITH SERDEPROPERTIES ('schema.url'='hdfs:///user/
hadoop/avro/myschema.avsc')
  STORED AS
    INPUTFORMAT
'com.linkedin.haivvreo.AvroContainerInputFormat'
    OUTPUTFORMAT
'com.linkedin.haivvreo.AvroContainerOutputFormat'
  LOCATION '/data/mytable'
;
MAKE! MONEY! FAST!


Use spot instances in EMR

Usually stick around until America
wakes up

Brilliant for worker nodes
Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
To be or not to be
“Consider a traditional RDBMS”

At what size should we do this?

Hive is not an end, it’s the means

Data on HDFS/S3 is simply available,
not “available to Hive”

Hive isn’t suitable for near real
time
Hive != MapReduce

Don’t use Hive instead of Native/
Streaming

“I know, I’ll just stream this bit
through a shell script!”

Imo, Hive excels at analysis and
aggregation, so use it for that
Thank you



Fred Easey (@poppa_f)

Peter Hanlon
Questions?

 pfig@89clouds.com
 @pfig / @89clouds


http://89clouds.com/

Contenu connexe

Tendances

PuppetDB, Puppet Explorer and puppetdbquery
PuppetDB, Puppet Explorer and puppetdbqueryPuppetDB, Puppet Explorer and puppetdbquery
PuppetDB, Puppet Explorer and puppetdbqueryPuppet
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboyKenneth Geisshirt
 
HBase + Hue - LA HBase User Group
HBase + Hue - LA HBase User GroupHBase + Hue - LA HBase User Group
HBase + Hue - LA HBase User Groupgethue
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)Subhas Kumar Ghosh
 
puppet @techlifecookpad
puppet @techlifecookpadpuppet @techlifecookpad
puppet @techlifecookpadNaoya Nakazawa
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateYahoo Developer Network
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
 
Docker tips & tricks
Docker  tips & tricksDocker  tips & tricks
Docker tips & tricksDharmit Shah
 
Working with databases in Perl
Working with databases in PerlWorking with databases in Perl
Working with databases in PerlLaurent Dami
 
COSCUP2012: How to write a bash script like the python?
COSCUP2012: How to write a bash script like the python?COSCUP2012: How to write a bash script like the python?
COSCUP2012: How to write a bash script like the python?Lloyd Huang
 
GoとElixir、同時開発した時の気づき
GoとElixir、同時開発した時の気づきGoとElixir、同時開発した時の気づき
GoとElixir、同時開発した時の気づきTakahiro Kobaru
 
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Piotr Wikiel
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetupgethue
 
Value protocols and codables
Value protocols and codablesValue protocols and codables
Value protocols and codablesFlorent Vilmart
 
Parse, scale to millions
Parse, scale to millionsParse, scale to millions
Parse, scale to millionsFlorent Vilmart
 
サンプルから見るMap reduceコード
サンプルから見るMap reduceコードサンプルから見るMap reduceコード
サンプルから見るMap reduceコードShinpei Ohtani
 
Shell实现的windows回收站功能的脚本
Shell实现的windows回收站功能的脚本Shell实现的windows回收站功能的脚本
Shell实现的windows回收站功能的脚本Lingfei Kong
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in RustInfluxData
 

Tendances (20)

PuppetDB, Puppet Explorer and puppetdbquery
PuppetDB, Puppet Explorer and puppetdbqueryPuppetDB, Puppet Explorer and puppetdbquery
PuppetDB, Puppet Explorer and puppetdbquery
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboy
 
HBase + Hue - LA HBase User Group
HBase + Hue - LA HBase User GroupHBase + Hue - LA HBase User Group
HBase + Hue - LA HBase User Group
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
puppet @techlifecookpad
puppet @techlifecookpadpuppet @techlifecookpad
puppet @techlifecookpad
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
Docker tips & tricks
Docker  tips & tricksDocker  tips & tricks
Docker tips & tricks
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Working with databases in Perl
Working with databases in PerlWorking with databases in Perl
Working with databases in Perl
 
COSCUP2012: How to write a bash script like the python?
COSCUP2012: How to write a bash script like the python?COSCUP2012: How to write a bash script like the python?
COSCUP2012: How to write a bash script like the python?
 
GoとElixir、同時開発した時の気づき
GoとElixir、同時開発した時の気づきGoとElixir、同時開発した時の気づき
GoとElixir、同時開発した時の気づき
 
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
 
Value protocols and codables
Value protocols and codablesValue protocols and codables
Value protocols and codables
 
Parse, scale to millions
Parse, scale to millionsParse, scale to millions
Parse, scale to millions
 
サンプルから見るMap reduceコード
サンプルから見るMap reduceコードサンプルから見るMap reduceコード
サンプルから見るMap reduceコード
 
Shell实现的windows回收站功能的脚本
Shell实现的windows回收站功能的脚本Shell实现的windows回收站功能的脚本
Shell实现的windows回收站功能的脚本
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in Rust
 

En vedette

En vedette (6)

The problem with Perl
The problem with PerlThe problem with Perl
The problem with Perl
 
CPAN Training
CPAN TrainingCPAN Training
CPAN Training
 
Perl in Teh Cloud
Perl in Teh CloudPerl in Teh Cloud
Perl in Teh Cloud
 
30 Minutes To CPAN
30 Minutes To CPAN30 Minutes To CPAN
30 Minutes To CPAN
 
PERL Unit 6 regular expression
PERL Unit 6 regular expressionPERL Unit 6 regular expression
PERL Unit 6 regular expression
 
Logic Progamming in Perl
Logic Progamming in PerlLogic Progamming in Perl
Logic Progamming in Perl
 

Similaire à My life as a beekeeper

Good practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationGood practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationPrestaShop
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaSteve Watt
 
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...Big Data Spain
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]knowbigdata
 
Elasticsearch sur Azure : Make sense of your (BIG) data !
Elasticsearch sur Azure : Make sense of your (BIG) data !Elasticsearch sur Azure : Make sense of your (BIG) data !
Elasticsearch sur Azure : Make sense of your (BIG) data !Microsoft
 
Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.Peter Higgins
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04Krishna Sankar
 
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...GeeksLab Odessa
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerCodemotion
 
ClickHouse new features and development roadmap, by Aleksei Milovidov
ClickHouse new features and development roadmap, by Aleksei MilovidovClickHouse new features and development roadmap, by Aleksei Milovidov
ClickHouse new features and development roadmap, by Aleksei MilovidovAltinity Ltd
 

Similaire à My life as a beekeeper (20)

Hadoop
HadoopHadoop
Hadoop
 
Good practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationGood practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimization
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
 
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]
 
SQL -PHP Tutorial
SQL -PHP TutorialSQL -PHP Tutorial
SQL -PHP Tutorial
 
Sql user group
Sql user groupSql user group
Sql user group
 
Elasticsearch sur Azure : Make sense of your (BIG) data !
Elasticsearch sur Azure : Make sense of your (BIG) data !Elasticsearch sur Azure : Make sense of your (BIG) data !
Elasticsearch sur Azure : Make sense of your (BIG) data !
 
Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04
 
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
JavaScript ES6
JavaScript ES6JavaScript ES6
JavaScript ES6
 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe Seiler
 
Html5 Overview
Html5 OverviewHtml5 Overview
Html5 Overview
 
ClickHouse new features and development roadmap, by Aleksei Milovidov
ClickHouse new features and development roadmap, by Aleksei MilovidovClickHouse new features and development roadmap, by Aleksei Milovidov
ClickHouse new features and development roadmap, by Aleksei Milovidov
 
Python training for beginners
Python training for beginnersPython training for beginners
Python training for beginners
 

Dernier

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 

Dernier (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

My life as a beekeeper

  • 1. My life as a beekeeper @89clouds
  • 2. Who am I? Pedro Figueiredo (pfig@89clouds.com) Hadoop et al SocialFacebook games, media (TV, publishing) Elastic MapReduce, Cloudera NoSQL, as in “Not a SQL guy”
  • 3. The problem with Hive It looks like SQL
  • 4. No, seriously SELECT CONCAT(vishi,vislo), SUM( CASE WHEN searchengine = 'google' THEN 1 ELSE 0 END ) AS google_searches FROM omniture WHERE year(hittime) = 2011 AND month(hittime) = 8 AND is_search = 'Y' GROUP BY CONCAT(vishi,vislo);
  • 5. “It’s just like Oracle!” Analysts will be very happy At least until they join with that 30 billion-record table Pro tip: explain MapReduce and then MAPJOIN set hive.mapjoin.smalltable.filesize=xxx;
  • 6. Your first interview question “Explain the difference between CREATE TABLE and CREATE EXTERNAL TABLE”
  • 7. Dynamic partitions Partitions are the poor person’s indexes Unstructured data is full of surprises set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.dynamic.partitions.pernode=100000; Plan your partitions ahead
  • 8. Multi-vitamins You can minimise input scans by using multi-table INSERTs: FROM input INSERT INTO TABLE output1 SELECT foo INSERT INTO TABLE output2 SELECT bar;
  • 9. Persistence, do you speak it? External Hive metastore Avoid the pain of cluster set up Use an RDS metastore if on AWS, RDBMS otherwise. 10GB will get you a long way, this thing is tiny
  • 10. Now you have 2 problems Regular expressions are great, if you’re using a real programming language. WHERE foo RLIKE ‘(a|b|c)’ will hurt WHERE foo=‘a’ OR foo=‘b’ OR foo=‘c’ Generate these statements, if needs be, it will pay off.
  • 11. Avro Serialisation framework (think Thrift/Protocol Buffers). Avro container files are SequenceFile-like, splittable. Support for snappy built-in. If using the LinkedIn SerDe, the table creation syntax changes.
  • 12. Avro CREATE EXTERNAL TABLE IF NOT EXISTS mytable PARTITIONED BY (ds STRING) ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe' WITH SERDEPROPERTIES ('schema.url'='hdfs:///user/ hadoop/avro/myschema.avsc') STORED AS INPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat' OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat' LOCATION '/data/mytable' ;
  • 13. MAKE! MONEY! FAST! Use spot instances in EMR Usually stick around until America wakes up Brilliant for worker nodes
  • 14. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;
  • 15. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;
  • 16. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;
  • 17. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;
  • 18. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;
  • 19. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;
  • 20. To be or not to be “Consider a traditional RDBMS” At what size should we do this? Hive is not an end, it’s the means Data on HDFS/S3 is simply available, not “available to Hive” Hive isn’t suitable for near real time
  • 21. Hive != MapReduce Don’t use Hive instead of Native/ Streaming “I know, I’ll just stream this bit through a shell script!” Imo, Hive excels at analysis and aggregation, so use it for that
  • 22. Thank you Fred Easey (@poppa_f) Peter Hanlon
  • 23. Questions? pfig@89clouds.com @pfig / @89clouds http://89clouds.com/

Notes de l'éditeur

  1. \n
  2. \n
  3. \n
  4. \n
  5. https://www.facebook.com/note.php?note_id=470667928919\n“Currently, if the total size of small tables is larger than 25MB, then the conditional task will choose the original common join to run. 25MB is a very conservative number and you can change this number with set hive.smalltable.filesize=30000000”\nSELECT /* +mapjoin(f,b,g) */\nset hive.auto.convert.join = true;\nhive.smalltable.filesize, depending on version\nset hive.mapjoin.localtask.max.memory.usage = 0.999;\n\n
  6. \n
  7. Also, there’s no UPDATE, you can only overwrite a whole table, so use partitions\ne.g., 20 games with 40 events with 5 attrs on average, per day (date=/game=/event=/attr=): 1.46M partitions per year (4000/day)\nSET hive.exec.max.dynamic.partitions=100000;\nSET hive.exec.max.dynamic.partitions.pernode=100000;\navoid RECOVER PARTITIONS, generate a partition list and add them statically, or use a persistent metastore\n
  8. Or INSERT OVERWRITE. Append (INSERT INTO) only available from 0.8 onwards\nObviously works with partitions, static (with the value in the INSERT statement) or dynamic, but:\nThe dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause\n
  9. \n
  10. \n
  11. Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.\nNo manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.\nThe schema (defined in JSON) is included in the data files\nHive >= 0.9.1\n\n
  12. The new SerDe uses TBLPROPERTIES and avro.schema.url / literal. Another property is\norg.apache.hadoop.hive.serde2.avro.AvroSerDe\nAlso, the statement order is important!\nOne more thing: 1.6.x won’t read files created with 1.7.x. CDH3 up to u3 comes with 1.6.0, so be conservative\n
  13. Look at the historical prices, bid above it\nRegular price: $0.38, spot: $0.03\n
  14. These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  15. These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  16. These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  17. These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  18. These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  19. These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
  20. When using an RDBMS, it’s much harder to get at your data from other tools\n
  21. Convoluted, long-winded code\nReporting is hard\n
  22. \n
  23. \n