shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014

•

3 likes•1,410 views

This document discusses SQL-on-Hadoop tools like Shark and SparkSQL. Shark sits on top of Apache Spark and is tightly coupled with Hive, using Hive statements and metadata. It provides faster performance than Hive due to in-memory processing. SparkSQL is a new tool that is not dependent on Hive and uses a new SchemaRDD. The document recommends using columnar file formats like Parquet for better performance and disk usage and provides a hands-on demonstration comparing file formats and query execution times in Hive, Impala and Shark.

Technology

Shark attack on SQL-on-
Hadoop
Gerd König
May 27th, 2014
(Big) Data Engineer

Agenda
● SQL-on-Hadoop
○ Why that hype?
○ Tool overview and comparison
○ File formats matters
● Shark
○ Facts & figures
○ What makes the difference?
○ SparkSQL enters the playground
● Hands-On (quick ‘n dirty)
○ File formats & disk usage
○ Execution times (at a rough estimate) / Benchmarking
● Summary

SQL-on-Hadoop - Why that hype?
● Hadoop is widely accepted as “new technology”
● Hadoop gets more and more enterprise ready
● SQL is a well established language for many years
and used by DB developers as well as Business
Analysts
=> Huge demand for SQL(-like) access to Hadoop

SQL-on-Hadoop
A whole bunch of tools (just an excerpt)

Shark - Facts & figures
● ...sits on top of Apache Spark
● is tightly coupled with Hive, uses a slightly modified version
● use Hive statements, UDFs and Hive metastore (HCatalog)
● can be run in Shark-shell as well as Shark Server (connect e.g.
via beeline JDBC client)

Shark / SparkSQL
● What makes the difference?
○ Performance increase due to in-memory processing (‘low-latency
M/R’)
○ Interaction with other “Plugins” of the Spark stack, like ML-library,
e.g. call ML functions directly with your SQL resultset:
val youngUsers = sql2rdd("SELECT * FROM users WHERE age < 20")
println(youngUsers.count)
val featureMatrix = youngUsers.map(extractFeatures(_))
kmeans(featureMatrix)
● SparkSQL - A new star is born?
○ no dependencies to Hive, new type of RDD “SchemaRDD”
○ fires SQL against RDDs, Parquet files, Hive (via Wrappers)

File format matters
● An appropriate file format influences
○ performance, and
○ used disk space
● Use a columnar storage format for columnar data(bases)
○ RCFile, ORC, Parquet

Hands-On
● Part I
○ compare Parquet based table vs. flat file
● Part II
○ execute 1 query in Hive, Impala and Shark
○ get a feeling about runtime...

Further information
● Detailled Benchmarks by Berkeley AmpLab:
https://amplab.cs.berkeley.edu/benchmark/
● Shark
http://shark.cs.berkeley.edu/
● SparkSQL
https://github.com/apache/spark/tree/master/sql
http://people.apache.org/~pwendell/catalyst-docs/sql-
programming-guide.html

THANKS for your attention !
Gerd König
gerd.koenig@ymc.ch
Tel. +41 (0)71 508 24 74
@gerd_koenig
ch.linkedin.com/in/gerdkoenig
Q&A

What's hot

Under the Hood of a Shard-per-Core Database ArchitectureScyllaDB

Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible APIScyllaDB

MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...ScyllaDB

The Do’s and Don’ts of Benchmarking DatabasesScyllaDB

NoSQL Slideshare Presentation Ericsson Labs

Introducing Scylla Open Source 4.0ScyllaDB

Introduction to NoSQLPolarSeven Pty Ltd

Webinar how to build a highly available time series solution with kairos-db (1)Julia Angell

NoSQL DatabasesEduard Tudenhoefner

NoSQL DatabasesAshish Karki

«NoSQL Databases and Polyglot Persistence»Olga Lavrentieva

Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...ScyllaDB

Xldb2011 tue 1120_youtube_datawarehouseliqiang xu

Scylla Summit 2018: The Short and Straight Road That Leads from Cassandra to ...ScyllaDB

NoSQL and NewSQL: Tradeoffs between Scalable Performance & ConsistencyScyllaDB

Nosql databases for the .net developerJesus Rodriguez

Steering the Sea Monster - Integrating Scylla with KubernetesScyllaDB

Introduction to NoSqlOmid Vahdaty

Building a REST API with Cassandra on Datastax Astra Using Python and NodeAnant Corporation

Cassandra-vs-MongoDBJainul Musani

What's hot (20)

Under the Hood of a Shard-per-Core Database Architecture

Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API

MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...

The Do’s and Don’ts of Benchmarking Databases

NoSQL Slideshare Presentation

Introducing Scylla Open Source 4.0

Introduction to NoSQL

Webinar how to build a highly available time series solution with kairos-db (1)

NoSQL Databases

«NoSQL Databases and Polyglot Persistence»

Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...

Xldb2011 tue 1120_youtube_datawarehouse

Scylla Summit 2018: The Short and Straight Road That Leads from Cassandra to ...

NoSQL and NewSQL: Tradeoffs between Scalable Performance & Consistency

Nosql databases for the .net developer

Steering the Sea Monster - Integrating Scylla with Kubernetes

Introduction to NoSql

Building a REST API with Cassandra on Datastax Astra Using Python and Node

Cassandra-vs-MongoDB

Viewers also liked

Shark SQL and Rich Analytics at ScaleDataWorks Summit

20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference

BDAS Shark study report 03 v1.1Stefanie Zhao

The internalsHeribertus Bramundito

Social network analysisStefanie Zhao

Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore

Sharks powerpointSmithtown High School West

Viewers also liked (7)

Shark SQL and Rich Analytics at Scale

20130912 YTC_Reynold Xin_Spark and Shark

BDAS Shark study report 03 v1.1

The internals

Social network analysis

Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Sharks powerpoint

Similar to shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014

Spark Summit EU 2015: Lessons from 300+ production usersDatabricks

Spark 101Shahaf Azriely {TopLinked} ☁

New Analytics Toolbox DevNexus 2015Robbie Strickland

Spark SQLJoud Khattab

spark_v1_2Frank Schroeter

A look under the hood at Apache Spark's API and engine evolutionsDatabricks

Spark For Faster Batch ProcessingEdureka!

Apache Spark for BeginnersAnirudh

Spark SQL | Apache SparkEdureka!

Big Data Processing With SparkEdureka!

AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau

Stream, stream, stream: Different streaming methods with Spark and KafkaItai Yaffe

Unit II Real Time Data Processing tools.pptxRahul Borate

Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau

Apache spark-melbourne-april-2015-meetupNed Shawa

Apache Spark PDFNaresh Rupareliya

Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau

A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau

Similar to shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014 (20)

Spark Summit EU 2015: Lessons from 300+ production users

Spark 101

New Analytics Toolbox DevNexus 2015

Spark SQL

spark_v1_2

A look under the hood at Apache Spark's API and engine evolutions

Spark For Faster Batch Processing

Apache Spark for Beginners

Spark SQL | Apache Spark

Big Data Processing With Spark

AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...

Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018

Stream, stream, stream: Different streaming methods with Spark and Kafka

Unit II Real Time Data Processing tools.pptx

Improving PySpark performance: Spark Performance Beyond the JVM

Apache spark-melbourne-april-2015-meetup

Apache Spark PDF

Getting started with Apache Spark in Python - PyLadies Toronto 2016

A fast introduction to PySpark with a quick look at Arrow based UDFs

Recently uploaded

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Ransomware_Q4_2023. The report. [EN].pdfOverkill Security

A Year of the Servo Reboot: Where Are We Now?Igalia

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Manulife - Insurer Transformation Award 2024The Digital Insurer

Real Time Object Detection Using Open CVKhem

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

ICT role in 21st century education and its challengesrafiqahmad00786416

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Recently uploaded (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

Data Cloud, More than a CDP by Matt Robison

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Strategies for Landing an Oracle DBA Job as a Fresher

Ransomware_Q4_2023. The report. [EN].pdf

A Year of the Servo Reboot: Where Are We Now?

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Manulife - Insurer Transformation Award 2024

Real Time Object Detection Using Open CV

Artificial Intelligence Chap.5 : Uncertainty

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

presentation ICT roal in 21st century education

Powerful Google developer tools for immediate impact! (2023-24 C)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

ICT role in 21st century education and its challenges

Apidays New York 2024 - The value of a flexible API Management solution for O...

Exploring the Future Potential of AI-Enabled Smartphone Processors

shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014

1. Shark attack on SQL-on- Hadoop Gerd König May 27th, 2014 (Big) Data Engineer

2. Agenda ● SQL-on-Hadoop ○ Why that hype? ○ Tool overview and comparison ○ File formats matters ● Shark ○ Facts & figures ○ What makes the difference? ○ SparkSQL enters the playground ● Hands-On (quick ‘n dirty) ○ File formats & disk usage ○ Execution times (at a rough estimate) / Benchmarking ● Summary

3. SQL-on-Hadoop - Why that hype? ● Hadoop is widely accepted as “new technology” ● Hadoop gets more and more enterprise ready ● SQL is a well established language for many years and used by DB developers as well as Business Analysts => Huge demand for SQL(-like) access to Hadoop

4. SQL-on-Hadoop A whole bunch of tools (just an excerpt)

5. SQL-on-Hadoop Clustering some tools

6. Shark - Facts & figures ● ...sits on top of Apache Spark ● is tightly coupled with Hive, uses a slightly modified version ● use Hive statements, UDFs and Hive metastore (HCatalog) ● can be run in Shark-shell as well as Shark Server (connect e.g. via beeline JDBC client)

7. Shark / SparkSQL ● What makes the difference? ○ Performance increase due to in-memory processing (‘low-latency M/R’) ○ Interaction with other “Plugins” of the Spark stack, like ML-library, e.g. call ML functions directly with your SQL resultset: val youngUsers = sql2rdd("SELECT * FROM users WHERE age < 20") println(youngUsers.count) val featureMatrix = youngUsers.map(extractFeatures(_)) kmeans(featureMatrix) ● SparkSQL - A new star is born? ○ no dependencies to Hive, new type of RDD “SchemaRDD” ○ fires SQL against RDDs, Parquet files, Hive (via Wrappers)

8. File format matters ● An appropriate file format influences ○ performance, and ○ used disk space ● Use a columnar storage format for columnar data(bases) ○ RCFile, ORC, Parquet

9. Hands-On ● Part I ○ compare Parquet based table vs. flat file ● Part II ○ execute 1 query in Hive, Impala and Shark ○ get a feeling about runtime...

10. Further information ● Detailled Benchmarks by Berkeley AmpLab: https://amplab.cs.berkeley.edu/benchmark/ ● Shark http://shark.cs.berkeley.edu/ ● SparkSQL https://github.com/apache/spark/tree/master/sql http://people.apache.org/~pwendell/catalyst-docs/sql- programming-guide.html

11. THANKS for your attention ! Gerd König gerd.koenig@ymc.ch Tel. +41 (0)71 508 24 74 @gerd_koenig ch.linkedin.com/in/gerdkoenig Q&A

shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014

Similar to shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014 (20)

Recently uploaded

Recently uploaded (20)

shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014