Soumettre la recherche
Mettre en ligne
A Perfect Hive Query for a Perfect Meeting
•
4 j'aime
•
491 vues
DataWorks Summit
Suivre
Technologie
Business
Signaler
Partager
Signaler
Partager
1 sur 131
Recommandé
A researcher perspective: what they want and how to pay for it
A researcher perspective: what they want and how to pay for it
Research Information Network
To share or not to share? Researchers' perspective on managing and sharing data
To share or not to share? Researchers' perspective on managing and sharing data
Research Information Network
Guia para proyecto para el Colegio Magali de Mill
Guia para proyecto para el Colegio Magali de Mill
Alejandra Ramírez
Curso Node.JS - NoSQL:BA 2012
Curso Node.JS - NoSQL:BA 2012
Mauricio Vieira
Mapping party in izuoshima island
Mapping party in izuoshima island
Hal Seki
Amazing ebook mindset passion and happiness life
Amazing ebook mindset passion and happiness life
Eko Mardianto
Arquitectura romana
Arquitectura romana
Instituto Universitario Politecnico Santiago Mariño
Pictures
Pictures
Hakan Unsal
Recommandé
A researcher perspective: what they want and how to pay for it
A researcher perspective: what they want and how to pay for it
Research Information Network
To share or not to share? Researchers' perspective on managing and sharing data
To share or not to share? Researchers' perspective on managing and sharing data
Research Information Network
Guia para proyecto para el Colegio Magali de Mill
Guia para proyecto para el Colegio Magali de Mill
Alejandra Ramírez
Curso Node.JS - NoSQL:BA 2012
Curso Node.JS - NoSQL:BA 2012
Mauricio Vieira
Mapping party in izuoshima island
Mapping party in izuoshima island
Hal Seki
Amazing ebook mindset passion and happiness life
Amazing ebook mindset passion and happiness life
Eko Mardianto
Arquitectura romana
Arquitectura romana
Instituto Universitario Politecnico Santiago Mariño
Pictures
Pictures
Hakan Unsal
Data Science Crash Course
Data Science Crash Course
DataWorks Summit
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
Managing the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
Security Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
The Digital Insurer
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Jago de Vreede
Contenu connexe
Plus de DataWorks Summit
Data Science Crash Course
Data Science Crash Course
DataWorks Summit
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
Managing the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
Security Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
Plus de DataWorks Summit
(20)
Data Science Crash Course
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Dernier
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
The Digital Insurer
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Jago de Vreede
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Product Anonymous
Architecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
The Digital Insurer
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
UiPathCommunity
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
Zilliz
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
apidays
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Remote DBA Services
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
The Digital Insurer
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
Overkill Security
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Edi Saputra
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
Rustici Software
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
apidays
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
apidays
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
Dropbox
Dernier
(20)
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Architecting Cloud Native Applications
Architecting Cloud Native Applications
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
A Perfect Hive Query for a Perfect Meeting
1.
Adam Kawa
2.
A deal was
made!
3.
4.
5.
6.
Martin will invite
Adam and Timbuktu, my favourite Swedish artist, for a beer or coke or whatever to drink * by Martin
7.
-
8.
Question
9.
Question Answers
10.
Data will tell
the truth!
11.
- - - - -
12.
Why? by Adam
13.
- - - - - -
14.
Introduction
15.
16.
… … … … …
… … … … …
17.
… … … …
18.
-
19.
✓ ✓ ✓ ✗ ✗
20.
HiveQL
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
A line where I
may have a bug ? !
31.
HiveQL
32.
Verbose and complex Java code
33.
-
34.
- -
35.
-
36.
- -
37.
- - -
38.
- - - -
39.
- - - - -
40.
41.
For Each Line -
42.
For Each Line - -
43.
track.txt
44.
user.txt track.txt
45.
stream.txt user.txt track.txt
46.
expected.txt stream.txt
47.
48.
49.
… …
50.
… … …
51.
… … …
52.
Bee test Be happy
!
53.
HiveQL
54.
- - -
55.
✗ - ✗ -
56.
… … … ✗
57.
- - -
58.
Threshold
59.
✓ - Threshold
60.
✗ Threshold
61.
✗ Try and see -
62.
- ?
63.
HiveQL
64.
-
65.
- - - -
66.
2 MapReduce job in
total
67.
Runs many Map joins
in a Map-Only job [HIVE-3784]
68.
- - -
69.
- - - -
70.
- - - - - - -
71.
- - - - - - -
72.
- - -
73.
Runs as a
single MR job [HIVE-3952]
74.
2 MapReduce job in
total
75.
HiveQL
76.
77.
✗ ✓ - -
78.
✓ ✗ -
79.
80.
-
81.
- My query generates
small amount of intermediate data -
82.
✓ ✗
83.
84.
85.
-
86.
- - - - - -
87.
- - - - - - - -
88.
2 months of
data 50 min 2 sec 10th place ?
89.
Changes are needed!
90.
File Format
91.
- -
92.
✓ - -
93.
✗ ✓ -
94.
16x
95.
3.5x
96.
32x
97.
-
98.
Computation
99.
- - -
100.
1.4x 2.4x
101.
8x
102.
✓ - - -
103.
✓ - -
104.
✓ - -
105.
Time
106.
The more congested queue/cluster,
the bigger benefits of reusing Time
107.
No scheduling overhead
to run new Reduce task Time
108.
Time Thinner tasks allows
to avoid stragglers
109.
Finished within 1,5
sec. Warm !
110.
111.
- - - -
112.
-
113.
✓ ✓
114.
- - - -
115.
Feature
116.
✓ - ✓ ✓ - ✓
117.
1.4x
118.
✗ ✓ ✓ ✗ ✓
119.
- -
120.
Feature
121.
✓ ✓ - - - ✓
122.
123.
14 months of
data 10 min 11 sec ?
124.
Results
125.
126.
127.
That’s all !
128.
- - - - -
129.
- - -