2. ImpalaToGo required if ...
You have more than hundred gigabytes of data
in the cloud.
You want to slice and dice this dataset and look
for anomalies.
You can not predict queries in advance.
You just need brute force to query raw data.
3. Elastic solution required
It is hardly profitable to do big data analytics
using a non elastic setup.
Slicing and dicing 1TB of data interactively
requires dozens dedicated servers.
4. The gain from elasticity
50 Servers cluster, with scan rate capability of
about 40 GB/sec will cost : $12000 a month
(m3.2xlarge reserved instances)
or
$28 per hour.
By running cluster for 1-2 hour a day when
required you save about $10K a month.
5. What is elastic database?
Easy to spawn and resize cluster, in matter of
minutes.
Efficient work with cloud data storage. We do
not want ETL per session.
6. Cloud storage dilemma
On one hand, object storage like s3 is perfect
to access data - no issues with size or
accessibility from other machines.
On the other hand - object store access is slow.
7. ImpalaToGo introduction
ImpalaToGo is MPP (Massive parallel
processing database) built on top of Cloudera
Impala.
ImpalaToGo removes the need for local HDFS,
replacing it with S3 (or another remote DFS),
using local drives for caching.
9. s3 + open format = no ETL
You produce a file in one of the supported file
types and put it into S3 bucket. CSV is easiest
to create.
Formats are open, and usable by other
frameworks such as Spark
CSV, Parquet, Avro files in S3 bucket
10. Local drive = best cache
ImpalaToGo is using local SSD drives for the
cache.
Local SSD used to keep hot data set
Space is not wasted for replication - it is just
cache.
SSD is fast enough to keep CPU busy
Caching layer on local SSD drives
11. No storage = elasticity
Since the ImpalaToGo cluster only caches data from S3,
there is no possibility of data loss. Further, It is easy to
resize.
Adding a node takes 1-2 minutes. Most of this time is
waiting for the instance to run.
Removing node - instant.
Impala To Go cluster
12. Why do we need resize?
It is almost impossible to predict how much
time ad-hoc query will take.
Different queries on the same data can easy
range from 10-100x computation and memory
requirements.
13. Competition
Main competitors are
- Commercial MPP databases like Vertica,
Paracel, etc
- Redshift
- Hadoop in form of CDH, EMR
- SparkSQL, Presto, Hive
- Snowflake
14. Commercial MPP
They store data in proprietary format - so there
is ETL process.
They have their own storage layer - so they are
not elastic.
They may be more efficient than Impala engine
on some queries.
15. Amazon Redshift
It is efficient columnar database deployed and
managed by amazon. In many cases - faster
then ImpalaToGo.
Main drawbacks comparing with ImpalaTogo
- Locked In to amazon
- Requires hours to days to resize
- No UDF support
16. Hadoop CDH & EMR
Today, you can deploy a hadoop cluster and
manually cache data from S3, or wait each time
for S3 access.
Once Impala has the ability to efficiently work
with s3 this will become viable option, but
- requires hadoop skills.
- less elastic, because of HDFS
17. SparkSQL, Hive, Presto etc
SparkSQL, Hive, Presto, Drill are JVM based,
so they can not match native engines like
Impala, Vertica, etc on raw speed.
- Slower than ImpalaToGo
- Hard to utilize big heaps.
18. Snowflake
Snowflake is very similar to ImpalaToGo in
terms of architecture. Both store columnar data
in S3, both run elastic clusters.
- Snowflake is proprietary software
- Data stored in proprietary format.
19. Have more questions?
Please write to David Gruzman
david@bigdatacraft.com
Want to try - visit http://impala2go.info