ImpalaToGo use case

•Download as PPTX, PDF•

2 likes•1,494 views

David Groozman

This presentation is built to explain when ImpalaToGo is applicable, how it differs from competition.

Software

ImpalaToGo
Use case
By ImpalaToGo team
http://impala2go.info

ImpalaToGo required if ...
You have more than hundred gigabytes of data
in the cloud.
You want to slice and dice this dataset and look
for anomalies.
You can not predict queries in advance.
You just need brute force to query raw data.

Elastic solution required
It is hardly profitable to do big data analytics
using a non elastic setup.
Slicing and dicing 1TB of data interactively
requires dozens dedicated servers.

The gain from elasticity
50 Servers cluster, with scan rate capability of
about 40 GB/sec will cost : $12000 a month
(m3.2xlarge reserved instances)
or
$28 per hour.
By running cluster for 1-2 hour a day when
required you save about $10K a month.

What is elastic database?
Easy to spawn and resize cluster, in matter of
minutes.
Efficient work with cloud data storage. We do
not want ETL per session.

Cloud storage dilemma
On one hand, object storage like s3 is perfect
to access data - no issues with size or
accessibility from other machines.
On the other hand - object store access is slow.

ImpalaToGo introduction
ImpalaToGo is MPP (Massive parallel
processing database) built on top of Cloudera
Impala.
ImpalaToGo removes the need for local HDFS,
replacing it with S3 (or another remote DFS),
using local drives for caching.

Architecture
CSV, Parquet, Avro files in S3 bucket
Impala To Go cluster
Caching layer on local SSD drives
ImpalaToGo
cluster

s3 + open format = no ETL
You produce a file in one of the supported file
types and put it into S3 bucket. CSV is easiest
to create.
Formats are open, and usable by other
frameworks such as Spark
CSV, Parquet, Avro files in S3 bucket

Local drive = best cache
ImpalaToGo is using local SSD drives for the
cache.
Local SSD used to keep hot data set
Space is not wasted for replication - it is just
cache.
SSD is fast enough to keep CPU busy
Caching layer on local SSD drives

No storage = elasticity
Since the ImpalaToGo cluster only caches data from S3,
there is no possibility of data loss. Further, It is easy to
resize.
Adding a node takes 1-2 minutes. Most of this time is
waiting for the instance to run.
Removing node - instant.
Impala To Go cluster

Why do we need resize?
It is almost impossible to predict how much
time ad-hoc query will take.
Different queries on the same data can easy
range from 10-100x computation and memory
requirements.

Competition
Main competitors are
- Commercial MPP databases like Vertica,
Paracel, etc
- Redshift
- Hadoop in form of CDH, EMR
- SparkSQL, Presto, Hive
- Snowflake

Commercial MPP
They store data in proprietary format - so there
is ETL process.
They have their own storage layer - so they are
not elastic.
They may be more efficient than Impala engine
on some queries.

Amazon Redshift
It is efficient columnar database deployed and
managed by amazon. In many cases - faster
then ImpalaToGo.
Main drawbacks comparing with ImpalaTogo
- Locked In to amazon
- Requires hours to days to resize
- No UDF support

Hadoop CDH & EMR
Today, you can deploy a hadoop cluster and
manually cache data from S3, or wait each time
for S3 access.
Once Impala has the ability to efficiently work
with s3 this will become viable option, but
- requires hadoop skills.
- less elastic, because of HDFS

SparkSQL, Hive, Presto etc
SparkSQL, Hive, Presto, Drill are JVM based,
so they can not match native engines like
Impala, Vertica, etc on raw speed.
- Slower than ImpalaToGo
- Hard to utilize big heaps.

Snowflake
Snowflake is very similar to ImpalaToGo in
terms of architecture. Both store columnar data
in S3, both run elastic clusters.
- Snowflake is proprietary software
- Data stored in proprietary format.

Have more questions?
Please write to David Gruzman
david@bigdatacraft.com
Want to try - visit http://impala2go.info

What's hot

February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...Yahoo Developer Network

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack

Hadoop and Voldemort @ LinkedInHadoop User Group

Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Dataconomy Media

Cloud Optimized Big DataJoydeep Sen Sarma

Introduction to Apache KuduJeff Holoman

Cloudera Impala + PostgreSQLliuknag

Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy

HBase in Practice DataWorks Summit/Hadoop Summit

Messaging architecture @FB (Fifth Elephant Conference)Joydeep Sen Sarma

A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon

Migrating structured data between Hadoop and RDBMSBouquet

October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopYahoo Developer Network

HadoopRajesh Piryani

Cloudera Impala technical deep divehuguk

Hadoop distributions - ecosystemJakub Stransky

How Impala WorksYue Chen

Hadoop trainting in hyderabad@kelly technologiesKelly Technologies

Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy

Apache kuduAsim Jalis

What's hot (20)

February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase

Hadoop and Voldemort @ LinkedIn

Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...

Cloud Optimized Big Data

Introduction to Apache Kudu

Cloudera Impala + PostgreSQL

Intro to Apache Kudu (short) - Big Data Application Meetup

HBase in Practice

Messaging architecture @FB (Fifth Elephant Conference)

A brave new world in mutable big data relational storage (Strata NYC 2017)

Migrating structured data between Hadoop and RDBMS

October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop

Hadoop

Cloudera Impala technical deep dive

Hadoop distributions - ecosystem

How Impala Works

Hadoop trainting in hyderabad@kelly technologies

Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data

Apache kudu

Viewers also liked

Cloudera Impala InternalsDavid Groozman

Impala use case @ ZooskCloudera, Inc.

Incredible Impala Gwen (Chen) Shapira

Apache Spark & Cassandra use case at Telefónica Cbs by Antonio AlcacerStratio

Impala: Real-time Queries in HadoopCloudera, Inc.

The Impala CookbookCloudera, Inc.

Viewers also liked (6)

Cloudera Impala Internals

Impala use case @ Zoosk

Incredible Impala

Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer

Impala: Real-time Queries in Hadoop

The Impala Cookbook

Similar to ImpalaToGo use case

In-Memory Data Grids - Ampool (1)Chinmay Kulkarni

Why Oracle Engineered systems - 2013Connor McDonald

Tom Kyte and and Cary Milsap - 2013Connor McDonald

Big Data Lakes Benchmarking 2018Tom Grek

TupleJump: Breakthrough OLAP performance on Cassandra and SparkDataStax Academy

FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan

Breakthrough OLAP performance with Cassandra and SparkEvan Chan

Stsg17 speaker yousunjeongYousun Jeong

Spark Summit EU 2015: Lessons from 300+ production usersDatabricks

IEEE International Conference on Data Engineering 2015Yousun Jeong

From HDFS to S3: Migrate Pinterest Apache Spark ClustersDatabricks

Understanding and building big data Architectures - NoSQLHyderabad Scalability Meetup

Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati

Healthcare Claim Reimbursement using Apache SparkDatabricks

AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty

Definitive Guide to Select Right Data Warehouse (2020)Sprinkle Data Inc

Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA

SQL or NoSQL, that is the question!Andraz Tori

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks

Accelerating Analytics with EMR on your S3 Data LakeAlluxio, Inc.

Similar to ImpalaToGo use case (20)

In-Memory Data Grids - Ampool (1)

Why Oracle Engineered systems - 2013

Tom Kyte and and Cary Milsap - 2013

Big Data Lakes Benchmarking 2018

TupleJump: Breakthrough OLAP performance on Cassandra and Spark

FiloDB - Breakthrough OLAP Performance with Cassandra and Spark

Breakthrough OLAP performance with Cassandra and Spark

Stsg17 speaker yousunjeong

Spark Summit EU 2015: Lessons from 300+ production users

IEEE International Conference on Data Engineering 2015

From HDFS to S3: Migrate Pinterest Apache Spark Clusters

Understanding and building big data Architectures - NoSQL

Big Data and Hadoop in Cloud - Leveraging Amazon EMR

Healthcare Claim Reimbursement using Apache Spark

AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...

Definitive Guide to Select Right Data Warehouse (2020)

Explore big data at speed of thought with Spark 2.0 and Snappydata

SQL or NoSQL, that is the question!

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

Accelerating Analytics with EMR on your S3 Data Lake

Recently uploaded

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS

Right Money Management App For Your Financial GoalsJhone kinadey

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

Professional Resume Template for Software DevelopersVinodh Ram

Software Quality Assurance Interview QuestionsArshad QA

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

DNT_Corporate presentation know about usDynamic Netsoft

Clustering techniques data mining book ....ShaimaaMohamedGalal

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

Recently uploaded (20)

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

Right Money Management App For Your Financial Goals

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

why an Opensea Clone Script might be your perfect match.pdf

Professional Resume Template for Software Developers

Software Quality Assurance Interview Questions

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

HR Software Buyers Guide in 2024 - HRSoftware.com

Diamond Application Development Crafting Solutions with Precision

DNT_Corporate presentation know about us

Clustering techniques data mining book ....

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

Optimizing AI for immediate response in Smart CCTV

Hand gesture recognition PROJECT PPT.pptx

ImpalaToGo use case

1. ImpalaToGo Use case By ImpalaToGo team http://impala2go.info

2. ImpalaToGo required if ... You have more than hundred gigabytes of data in the cloud. You want to slice and dice this dataset and look for anomalies. You can not predict queries in advance. You just need brute force to query raw data.

3. Elastic solution required It is hardly profitable to do big data analytics using a non elastic setup. Slicing and dicing 1TB of data interactively requires dozens dedicated servers.

4. The gain from elasticity 50 Servers cluster, with scan rate capability of about 40 GB/sec will cost : $12000 a month (m3.2xlarge reserved instances) or $28 per hour. By running cluster for 1-2 hour a day when required you save about $10K a month.

5. What is elastic database? Easy to spawn and resize cluster, in matter of minutes. Efficient work with cloud data storage. We do not want ETL per session.

6. Cloud storage dilemma On one hand, object storage like s3 is perfect to access data - no issues with size or accessibility from other machines. On the other hand - object store access is slow.

7. ImpalaToGo introduction ImpalaToGo is MPP (Massive parallel processing database) built on top of Cloudera Impala. ImpalaToGo removes the need for local HDFS, replacing it with S3 (or another remote DFS), using local drives for caching.

8. Architecture CSV, Parquet, Avro files in S3 bucket Impala To Go cluster Caching layer on local SSD drives ImpalaToGo cluster

9. s3 + open format = no ETL You produce a file in one of the supported file types and put it into S3 bucket. CSV is easiest to create. Formats are open, and usable by other frameworks such as Spark CSV, Parquet, Avro files in S3 bucket

10. Local drive = best cache ImpalaToGo is using local SSD drives for the cache. Local SSD used to keep hot data set Space is not wasted for replication - it is just cache. SSD is fast enough to keep CPU busy Caching layer on local SSD drives

11. No storage = elasticity Since the ImpalaToGo cluster only caches data from S3, there is no possibility of data loss. Further, It is easy to resize. Adding a node takes 1-2 minutes. Most of this time is waiting for the instance to run. Removing node - instant. Impala To Go cluster

12. Why do we need resize? It is almost impossible to predict how much time ad-hoc query will take. Different queries on the same data can easy range from 10-100x computation and memory requirements.

13. Competition Main competitors are - Commercial MPP databases like Vertica, Paracel, etc - Redshift - Hadoop in form of CDH, EMR - SparkSQL, Presto, Hive - Snowflake

14. Commercial MPP They store data in proprietary format - so there is ETL process. They have their own storage layer - so they are not elastic. They may be more efficient than Impala engine on some queries.

15. Amazon Redshift It is efficient columnar database deployed and managed by amazon. In many cases - faster then ImpalaToGo. Main drawbacks comparing with ImpalaTogo - Locked In to amazon - Requires hours to days to resize - No UDF support

16. Hadoop CDH & EMR Today, you can deploy a hadoop cluster and manually cache data from S3, or wait each time for S3 access. Once Impala has the ability to efficiently work with s3 this will become viable option, but - requires hadoop skills. - less elastic, because of HDFS

17. SparkSQL, Hive, Presto etc SparkSQL, Hive, Presto, Drill are JVM based, so they can not match native engines like Impala, Vertica, etc on raw speed. - Slower than ImpalaToGo - Hard to utilize big heaps.

18. Snowflake Snowflake is very similar to ImpalaToGo in terms of architecture. Both store columnar data in S3, both run elastic clusters. - Snowflake is proprietary software - Data stored in proprietary format.

19. Have more questions? Please write to David Gruzman david@bigdatacraft.com Want to try - visit http://impala2go.info

ImpalaToGo use case

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to ImpalaToGo use case

Similar to ImpalaToGo use case (20)

Recently uploaded

Recently uploaded (20)

ImpalaToGo use case