Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)

Apache Tajo :
A Big Data Warehouse System
on Hadoop
Jaehwa Jung Research Director
Gruter TECHDAY 2014

©2014 Gruter. All rights reserved.
About me
• Bigdata Platform, Gruter Inc (http://www.gruter.com)
• Apache Tajo Committer
• jhjung@gruter.com
• http://blrunner.com
• 저서: 시작하세요!하둡 프로그래밍

Agenda
• Introduction to SQL-on-Hadoop
• Introduction to Apache Tajo
• What you can do with Tajo?
• Why you should use Tajo?
• Current Tajo Status
• Use Cases
• Demonstration

Hadoop Overview
MapReduce
(Distributed computation)
HDFS
(Distributed storage)
출처: http://www.quuxlabs.com/wp-content/uploads/2010/08/Yahoo-hadoop-cluster_OSCON_2007.jpg

SQL-on-Hadoop Overview
• HDFS에 저장된 데이터를 SQL로 처리하는 시스템
• 탈 MapReduce 모델
• 다양한 설계 목표 : DataWarehouse VS Query Engine

Apache Tajo Overview
• A big data warehouse system on Hadoop
• Apache Top-level project since March 2014
• Supports SQL standards
• Features
– Powerful distributed processing architecture (Not MapReduce)
– Advanced query optimization algorithms and techniques
– Long running queries : for many hours
– Interactive analysis queries : from 100 milliseconds
• Recent 0.9.0 release

Tajo Architecture
Master Server
TajoMaster
Slave Server
TajoWorker
QueryMaster
Local Query
Engine
StorageManager
Local
FileSystem
HDFS
Client
JDBC TSql Web UI
Slave Server
TajoWorker
QueryMaster
Local Query
Engine
StorageManager
Local
FileSystem
HDFS
Slave Server
TajoWorker
QueryMaster
Local Query
Engine
StorageManager
Local
FileSystem
HDFS
CatalogStore
DBMS
Submit a query HCatalog
Manage metadata
Allocate a query
Run &
monitor
a query
Run &
monitor
a query

Commercial Data Warehouse
Front-End
Analytics
Source Data Data Warehouse
OLTP
CRM
ERP
ecommerc
e
Other
ODS
(Operational
Data Store)
Data
Warehouse
Data Mart OLAP
Visualiz
ation
ETL
ETL
ETL
Reports
Data
Mining

Hadoop based Data Warehouse with Tajo
Front-End
Analytics
We can do ETL and Interactive Analytics!
Source Data Data Warehouse
OLTP
CRM
ERP
ecommerce
Other
ODS
(Operational
Data Store)
Data
Warehouse
Data Mart
Reports
OLAP
Visualiz
ation
Data
Mining
ETL
ETL
ETL

Mature SQL Feature Set
• Fully distributed query executions
– Inner join, and left/right/full outer join
– Groupby, sort, multiple distinct aggregation
– window function
• SQL data types
– CHAR, BOOL, INT, DOUBLE, TEXT, DATE, Etc
• Various file formats
– Text file (CSV), SequenceFile, RCFile, Parquet, Avro
• SQL Standards
– Non standard features : PgSQL and Oracle

Performance
• Faster than Hive 0.10 (1.5 – 10 times): http://slidesha.re/1yTBTaa
• Data Set : TPC-H Scale 100 or 1000
• H/W : 1 master + 6 data nodes
CPU 24 Cores (Xeon 2.5GHz, HT)
Memory 64GB
Disk 3TB * 6 SATA/HDD (7200 RPM)
Network 10Gb
• S/W
Hadoop cdh-4.3.0
Hive 0.10.0-cdh4.3.0
Impala impalad_version_1.1.1_RELEASE
Tajo 0.2-SNAPSHOT

Performance: Q1 – filter scan
select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price, sum(l_extendedprice*(1-
l_discount)) as sum_disc_price, sum(l_extendedprice*(1-
l_discount)*(1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc,
count(*) as count_order from lineitem where l_shipdate <= '1998-
09-01' group by l_returnflag, l_linestatus order by l_returnflag,
l_linestatus
1445.69
895.96
789.09
1500
1000
500
0
Q1: scan using about 20 text pattern matching filters
Hive
Impala
Tajo

Performance: Q2 – unions and joins
create table nation_region as select n_regionkey, r_regionkey, n_nationkey, n_name, r_name
from region join nation on n_regionkey = r_regionkey where r_name = 'EUROPE';
create table r2_1 as select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address,
s_phone, s_comment, ps_supplycost from nation_region join supplier on s_nationkey =
n_nationkey join partsupp on s_suppkey = ps_suppkey join part on p_partkey = ps_partkey
where p_size = 15 and p_type like '%BRASS';
create table r2_2 as select p_partkey, min(ps_supplycost) as min_ps_supplycost from r2_1
group by p_partkey;
select s_acctbal, s_name, n_name, r2_1.p_partkey, p_mfgr, s_address, s_phone, s_comment
from r2_1 join r2_2 on r2_1.p_partkey = r2_2.p_partkey where ps_supplycost =
min_ps_supplycost order by s_acctbal, n_name, s_name, r2_1.p_partkey;
63.64
9.11
38.64
70
60
50
40
30
20
10
0
Q2: 7 unions with joins
Hive
Impala
Tajo

Performance: Q3 - join
select l_orderkey, sum(l_extendedprice*(1-l_discount)) as
revenue, o_orderdate, o_shippriority from customer as c join
orders as o on c.c_mktsegment = 'BUILDING' and c.c_custkey =
o.o_custkey join lineitem as l on l.l_orderkey = o.o_orderkey
where o_orderdate < '1995-03-15' and l_shipdate > '1995-03-15'
group by l_orderkey, o_orderdate, o_shippriority order by revenue
desc, o_orderdate;
101.45
36.81
31.92
100
80
60
40
20
0
Q3: join
Hive
Impala
Tajo

Simple Operation and Software Stack
• Simple Installation and Operation
–
http://tajo.apache.org/docs/current/getting_started.h
tml
• Simple Software Stack Requirement
– No MapReduce and No Tez
– Yarn support but not mandatory
– Tajo + Linux system for single node cluster
– Tajo + HDFS for a distributed cluster

Simple Integration
• Integration with Hadoop Ecosystem
– Hadoop 2.2.0 – 2.5.1 support
– Be able to connect to Hive Metastore
– Directly process tables managed by Hive
• Yarn support (backport)
– Enable Tajo to deploy and run on Yarn cluster
– Allow users to add/remove cluster nodes to/from
Tajo cluster in runtime

Active Open Source Community
• Fully community-driven open source
• Stable development team
– 17 committers + many contributors

Join
• Join
– NATURAL, INNER, OUTER (LEFT, RIGHT, FULL)
– SEMI, ANTI Join (planned for v0.9)
• Join Predicates
– WHERE and ON predicates
– de-factor standard outer join behavior with
both
SELECpTr e*d FicRaOtMe st1 LEFT JOIN t2 ON t1.num = t2.num
WHERE t2.value = 'xxx';
SELECT * FROM t1 LEFT JOIN t2 WHERE t1.num =
t2.num and t2.value = ‘xxx’;

Window Function
• OVER clause
– row_number() and rank()
– Aggregation function support
– PARTITION and ORDER BY clause
SELECT depname, empno,
salary, enroll_date FROM ( SELECT
depname, empno, salary, enroll_date,
rank() OVER (PARTITION BY depname
ORDER BY salary DESC, empno) AS pos
FROM empsalary
) AS ss
WHERE
pos < 3;

WITH (‘parquet.compression’ = ‘SNAPPY’)
Table Partitions
• Column Value Partition
– Hive Compatible Partition
CREATE TABLE T1 (C1 INT, C2 TEXT)
using PARQUET
PARTITION BY COLUMN (C3 INT, C4 TEXT);
• Range Partition (planned for 1.0)
– Table will be partitioned by disjoint ranges.
– Will remove the partition granularity problem
of
Hive Partition

Comparison with other platform (1/2)
Function Tajo Hive Impala Spark
Computing 자체
MapReduce or
Tez
자체 자체
Resource
Management
자체 or
YARN
YARN 자체 자체 or YARN
Scheduler FIFO, Fair
FIFO, Fair,
Capacity
FIFO, Fair FIFO, Fair
Storage
HDFS, S3,
HBase
HDFS, HBase,
S3
HDFS, HBase
자체 RDD
(HDSF 등)
File Format
CSV, RC,
Parquet,
Avro 등
CSV, RC, ORC,
Parquet, Avro
등
CSV, RC,
Parquet, Avro
등
CSV, RC,
Parquet, Avro
등
Data Model Relational Relational Relational Relational
Query ANSI-SQL HiveQL HiveQL HiveQL

Comparison with other platform (2/2)
Function Tajo Hive Impala Spark
구현 언어 Java Java C++ Scala
Client
Java API, JDBC,
CLI
CLI, JDBC,
ODBC, Thrift
Server API
CLI, JDBC,
ODBC
Shark
JDBC/ODBC,
Scala, Java,
Python API
Query
Latency
Long run,
Interactive
Long run,
(Interactive-Tez)
Interactive Interactive
컴퓨팅 특
징
데이터는 Disk,
중간 데이터는
Memory/Disk
모두 사용
데이터는 Disk,
중간 데이터는
Memory/Disk
모두 사용
중간 데이터가
In-Memory
(최근 On-Disk
지원)
분석 대상 데이터
가 In-Memory에
로딩
License Apache Apache Apache Apache
Main
Sponsor
Gruter Hortonworks Cloudera Databricks

Future Works
• 2014 4Q
– HBase intergation
– In/Exists SubQuery
– User defined function
– Multi-tenant Scheduler
• 2015 1Q
– Authentication and Standard Access Control
– Scalar SubQuery
– ROLLUP, CUBE
• 2015 2Q
– Vectorized Engine(C++ Operator)
– TajoR

Replace Commercial Data Warehouse (SKT)
• ETL Processing: 120+ queries, ~4TB read/day
• OLAP Processing: 500+ queries
Operational
Systems
Integration
Layer
Data Warehouse
Data Mart
Marketing
Sales
ERP
SCM
ODS
Staging
Area Strategic
Marts
Data
Vault

Tajo-as-a-Service on AWS

TSql & Web UI
Watch this video for Apache Tajo:
http://www.youtube.com/watch?v=bFGjMLPEDq0

Get Involved!
• We are recruiting contributors!
• General
– http://tajo.apache.org
• 한국 Tajo 사용자 그룹 (Korean Tajo User Group)
- https://groups.google.com/forum/?hl=ko#!forum/tajo-user-kr
• Getting Started
– http://tajo.apache.org/docs/0.9.0/getting_started.html
• Downloads
– http://tajo.apache.org/docs/0.9.0/getting_started/downloading_
source.html
• Jira – Issue Tracker
– https://issues.apache.org/jira/browse/TAJO
• Join the mailing list
– dev-subscribe@tajo.apache.org
– issues-subscribe@tajo.apache.org

Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)

Similar to Gruter_TECHDAY_2014_03_ApacheTajo (in Korean) (20)

More from Gruter

More from Gruter (19)

Recently uploaded

Recently uploaded (20)

Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)