SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
Apache Tajo : A Big Data Warehouse Systemon Hadoop 
김형준 
babokim@gmail.com
©2014 Gruter. All rights reserved. 
Apache Tajo Overview 
•A big data warehouse system on Hadoop 
•Apache Top-level project since March 2014 
•Supports SQL standards 
•Features 
–Powerful distributed processing architecture (Not MapReduce) 
–Advanced query optimization algorithms and techniques 
–Long running queries : for many hours 
–Interactive analysis queries : from 100 milliseconds 
•Recent 0.9.0 release
©2014 Gruter. All rights reserved. 
Tajo ArchitectureMaster ServerTajoMasterSlaveServer 
TajoWorker 
QueryMaster 
Local Query Engine 
StorageManager 
Local 
FileSystemHDFS 
Client 
JDBC 
TSqlWeb UI 
Slave Server 
TajoWorker 
QueryMasterLocal Query EngineStorageManagerLocalFileSystem 
HDFS 
Slave Server 
TajoWorker 
QueryMaster 
Local Query EngineStorageManager 
Local 
FileSystem 
HDFS 
CatalogStoreDBMS 
HCatalog 
Submit a query 
Manage metadata 
Allocate a queryRun & monitor a queryRun & monitor a query
©2014 Gruter. All rights reserved. 
Mature SQL Feature Set 
•Fully distributed query executions 
–Inner join, and left/right/full outer join 
–Groupby, sort, multiple distinct aggregation 
–window function 
•SQL data types 
–CHAR, BOOL, INT, DOUBLE, TEXT, DATE, Etc 
•Various file formats 
–Text file (CSV), SequenceFile, RCFile, Parquet, Avro 
•SQL Standards 
–Non standard features : PgSQLand Oracle
©2014 Gruter. All rights reserved. 
Group by, Order by 
•Group by 
–Multi-level distributed Group by 
•tajo.dist-query.groupby.multi-level-aggr 
–Multiple Count Distinct 
•Select count(distinct col1), sum(distinct col2), ... 
–Hash Aggregation or Sort Aggregation 
•Order by 
–Fully distributed order by 
–Range partition 
5
©2014 Gruter. All rights reserved. 
Join 
•Join 
–NATURAL, INNER, OUTER (LEFT, RIGHT, FULL) 
–SEMI, ANTI Join (planned for v0.9) 
•Join Predicates 
–WHERE and ON predicates 
–de-factor standard outer join behavior with both predicates 
SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num WHERE t2.value = 'xxx'; SELECT * FROM t1 LEFT JOIN t2 WHERE t1.num = t2.num and t2.value = ‘xxx’;
©2014 Gruter. All rights reserved. 
Table Partitions 
•Column Value Partition 
–Hive Compatible Partition 
•Range Partition (planned for 1.0) 
–Table will be partitioned by disjoint ranges. 
–Will remove the partition granularity problem of Hive Partition 
CREATE TABLE T1 (C1 INT, C2 TEXT) using PARQUET WITH (‘parquet.compression’ = ‘SNAPPY’) 
PARTITION BY COLUMN (C3 INT, C4 TEXT);
©2014 Gruter. All rights reserved. 
Window Function 
•OVER clause 
–row_number() and rank() 
–Aggregation function support 
–PARTITION and ORDER BY clause 
SELECT depname, empno, salary, enroll_dateFROM ( SELECT depname, empno, salary, enroll_date, rank() OVER (PARTITION BY depnameORDER BY salary DESC, empno) AS posFROM empsalary) AS ssWHERE pos < 3;
©2014 Gruter. All rights reserved. 
Currently Not Supported 
•IN/Exists SubQuery(2014, 4Q) 
–Select * from t1 where col1 in (select col1...) 
•Scalar SubQuery(2015, 1Q) 
–Select col1, (select col2 from ...) from ... 
•Some analytic function(2015, 1Q or 요청시) 
–ROLLUP, CUBE, some Window 
•Create/Drop Function(2014, 4Q) 
–실행중에UDF를추가하는기능미제공 
•Alter meta property(2014, 4Q) 
–Partition 정보등 
9
©2014 Gruter. All rights reserved. 
Comparison with other platform 
Function 
Tajo 
Hive 
Impala 
Spark 
Computing 
자체 
MapReduceorTez 
자체 
자체 
Resource Management 
자체or YARN 
YARN 
자체 
자체or YARN 
Scheduler 
FIFO, Fair 
FIFO, Fair, Capacity 
FIFO, Fair 
FIFO, Fair 
Storage 
HDFS, S3, 
HBase 
HDFS, HBase,S3 
HDFS, HBase 
자체RDD 
(HDSF 등) 
File Format 
CSV, RC, Parquet, Avro등 
CSV, RC, ORC, Parquet, Avro등 
CSV, RC, Parquet, Avro등 
CSV, RC, Parquet, Avro등 
Data Model 
Relational 
Nested Data Model 
Relational 
Nested Data Model 
Query 
ANSI-SQL 
HiveQL 
HiveQL 
HiveQL 
구현언어 
Java 
Java 
C++ 
Scala 
Client 
JavaAPI, JDBC, CLI 
CLI, JDBC, ODBC, Thrift Server API 
CLI, JDBC, ODBC 
Shark JDBC/ODBC, 
Scala, Java, Python API 
Query Latency 
Long run, 
Interactive 
Long run, 
(Interactive-Tez) 
Interactive 
Interactive 
컴퓨팅특징 
데이터는Disk, 중간데이터는Memory/Disk모두사용 
데이터는Disk, 중간데이터는Memory/Disk모두사용 
중간데이터가In- Memory 
(최근On-Disk지원) 
분석대상데이터가In- Memory에로딩 
License 
Apache 
Apache 
Apache 
Apache 
Main Sponsor 
Gruter 
Hortonworks 
Cloudera 
Databricks 
10 
* 버전업등에따라내용은틀릴수있음
©2014 Gruter. All rights reserved. 
Performance(1) 
11-테스트장비: 1 master + 6 workersCPU: Xeon 2.5GHz, E5, 24 Core, Memory: 64GB, Network:10Gb Disk: 3TB * 6 SATA/HDD (7200 RPM) -테스트데이터: TPC-H Scale 100 or 1000 
-VersionHadoop: cdh-4.3.0, Hive: 0.10.0-cdh4.3.0, Impala: impalad_version_1.1.1_RELEASETajo: 0.2-SNAPSHOT
©2014 Gruter. All rights reserved. 
Performance(2) 
•실제사용데이터및질의 
–데이터 
•1.7TB (4.1B rows, Q1) 
•8 or less GB (results of Q1, rest of Qs) 
–Query 
•Q1: scan using about 20 text pattern matching filters 
•Q2: 7 unions with joins 
•Q3: join 
•Q4: group by and order by 
•Q5: 30 text pattern matching filters with OR conditions, group by, having, and order by 4 
12 
http://www.slideshare.net/gruter/tajo-case-studybayareahugmeetupevent20131105
©2014 Gruter. All rights reserved. 
1445.69 
895.96 
789.09 
0 
200 
400 
600 
8001000 
1200 
1400 
Q1: scan using about 20 text pattern matching filtersHive 
Impala 
Tajo 
Q1 –filter scan 
• processing time (sec.) 
• 
NB: 
* Tajo showed enhanced performance due to dynamic task scheduling
©2014 Gruter. All rights reserved. 
63.64 
9.11 
38.64 
010 
20 
30 
4050 
60 
70 
Q2: 7 unions with joins 
Hive 
Impala 
Tajo 
Q2 –unions, joinsprocessing time (sec.) NB: *Tajo materializing all query results to HDFS, as is the main goal*unions are processed in sequence in Tajo now (parallel processing is coming soon) 
• 
•
©2014 Gruter. All rights reserved. 101.45 
36.81 
31.92 
020 
40 
60 
80 
100 
Q3: joinHiveImpalaTajo 
Q3 –join 
processing time (sec.) 
• 
• NB: *Tajo has an optimal selection/projection push down
©2014 Gruter. All rights reserved. 24.7 
0.45 
0.65 
0 
5 
10 
15 
20 
25 
Q4: group by and order byHiveImpalaTajo 
Q4 –group by and sort 
processing time (sec.) 
• 
•
©2014 Gruter. All rights reserved. 128.78 
17.03 
6.03 
0 
20 
40 
60 
80 
100 
120Q6: 30 Text pattern matching filters with OR conditions, group by, having, and order by resulting in smaller set of outputHiveImpala 
Tajo 
Q5 –filters, group by, having and sort 
processing time (sec.) 
• 
• Q5:
©2014 Gruter. All rights reserved. 
Replace Commercial Data Warehouse 
•ETL(Batch) Processing: 120+ queries, ~4TB read/day 
•OLAP(Interactive) Processing: 500+ queriesOperational Systems 
Integration 
Layer 
Data 
Warehouse 
Data 
Mart 
Marketing 
Sales 
ERPSCM 
ODSStaging AreaStrategicMartsDataVault 
Front-End 
Analytics 
Reports 
OLAPVisualizationDataMining
©2014 Gruter. All rights reserved. 
Tajo-as-a-Service on AWS
©2014 Gruter. All rights reserved. 
Active Open Source Community 
•Fully community-driven open source 
•Stable development team 
–17 committers + many contributors
©2014 Gruter. All rights reserved. 
Future Works 
•2014 4Q 
–HBaseintergation 
–In/Exists SubQuery 
–User defined function 
–Multi-tenant Scheduler 
•2015 1Q 
–Authentication and Standard Access Control 
–Scalar SubQuery 
–ROLLUP, CUBE 
•2015 2Q 
–VectorizedEngine(C++ Operator) 
21 
Gruter에서진행하는마일스톤으로Apache Tajo 커뮤니티방향과는다를수있음
TAJO INTERNAL 
22
©2014 Gruter. All rights reserved. 
DBMS 실행방식 
23 
SelectionGroupbyJoin 
Orderby 
Eval 
Filter 
SCAN 
... 
Basic OperatorSLECT o_custkey, l_lineitem, count(*) FROM lineitemaJOIN orders b ON a.l_orderkey= b.o_orderkeyWHERE a.l_shipdate> ‘2014-09-01’ GROUP BY o_custkey, l_lineitem 
SCAN 
(lineitem) 
SCAN 
(orders) 
Fliter 
Join 
Groupby 
Operator 조합으로 
Tree 형태의 
PhysicalPlan생성 
Call() 
Call() Return Tuple 
Return Tuple
©2014 Gruter. All rights reserved. 
Query Execution Flow 
24 
TajoClientTajoCli 
TajoJDBC 
SeqScanExec 
Non-forward query 
LogicalPlanner 
LogicalPlanner 
Task 
Scheduler 
Logical Plan 
Optimize 
ScheduleTasks dynamically PhysicalPlanner 
GlobalPlanner 
SeqScanExec 
HashJoinExec 
SortJoinExecHashAggrExec 
TaskTableStoreExec 
TajoMaster 
Result 
Run query 
QueryMaster 
Forward query 
Distributed Plan 
Optimize 
SubQuery-BSubQuery-A Make SubQuery 
Resource 
Manager 
Get available resource 
Tajo 
Worker 
Run TaskMake Physical Operator 
Run 
Operator Input 
Output 
Result 
TajoPullServer 
Fetcher 
Intermediate
©2014 Gruter. All rights reserved. 
Logical Plan/Distributed Plan 
25 
(A join-groupby-sort query plan) 
(A distributed query execution plan) 
select col1, sum(col2) as total, avg(col3) as averagefromr1, r2where r1.col1 = r2.col2 group by col1 order by average;
©2014 Gruter. All rights reserved. 
Logical Plan Optimizer 
•Basic Rewrite Rule 
–Common sub expression elimination 
–Constant folding (CF), and Null propagation 
•Projection Push Down (PPD) 
–push expressions to operators lower as possible 
–narrow read columns 
–remove duplicated expressions 
•if some expressions has common expression 
•Filter Push Down (FPD) 
–reduce rows to be processed earlier as possible 
•Extensible Rewrite Rule 
–Allow developers to write their own rewrite rules 
26
©2014 Gruter. All rights reserved. 
Logical Plan Optimizer 
27 
SELECT 
item_id, 
order_id 
sum_price* (1.2 * 0.3) as total, 
FROM ( 
SELECT 
item_id, 
order_id, 
sum(price) as sum_price 
FROM 
ITEMS 
GROUP BY item_id, order_id 
) a 
WHERE item_id= 17234 
SELECT 
item_id, 
order_id, 
sum(price) * (3.6) 
FROM 
ITEMS 
GROUP BY 
item_id, order_id 
WHERE item_id= 17234 
Original 
Rewritten 
CF + PPD 
FPD
©2014 Gruter. All rights reserved. 
Filter Push Down Rule(Outer Join) 
Preserved Row Table 
Null Supplying Table 
Join Predicate 
Not Pushed(1) 
Pushed(2) 
Where Predicate 
Pushed(3) 
Not Pushed(4) 
28 
SELECT * FROM lineitema 
OUTER JOIN orders b ON 
a.l_orderkey= o_orderkeyAND 
a.l_shipdate= ‘2013-01-01’ AND 
b.o_orderstatus= ‘O’ 
WHERE a.l_quantity>20 
AND b.o_custkey= 100 
① 
② 
③ 
④ 
SCAN 
(lineitem) 
Filter 
(a.lquantity> 20) 
SCAN 
(orders) 
Filter 
(b.o_orderstatus=‘O’) 
JOIN 
a.l_orderkey= b.o_orderkey 
a.l_shipdate= ‘2013-01-01’ 
Filter 
(b.o_custkey= 100) ExecutionBlock-A 
ExecutionBlock-B 
ExecutionBlock-C
©2014 Gruter. All rights reserved. 
Logical Plan Optimizer 
•Cost-based Join Order 
–Don’t need to guess right join orders anymore 
–Greedy heuristic algorithm 
•Resulting in a bushy join tree instead of left-deep join tree 
29 
Left-deep Join Tree 
Bush Join Tree
©2014 Gruter. All rights reserved. 
기타 
•Union 
–Filter push down 
•Repartitioner 
–tajo.dist-query.join.partition-volume-mb 
–tajo.dist-query.groupby.partition-volume-mb 
–tajo.dist-query.table-partition.task-volume-mb 
•Broadcast Join 
–tajo.dist-query.join.broadcast.threshold-bytes 
•Host/Disk aware scheduler 
30
©2014 Gruter. All rights reserved. 
GRUTER: YOUR PARTNER 
IN THE BIG DATA REVOLUTION 
Phone +82-70-8129-2950 
Fax+82-70-8129-2952 
E-mail contact@gruter.com 
Webwww.gruter.com 
Phone +1-415-841-3345

Contenu connexe

Tendances

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology LandscapeShivanandaVSeeri
 
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1Donghan Kim
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterBill Graham
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big dataYukti Kaura
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlKhanderao Kand
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworksIJDKP
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoopguest27e6764
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveYahoo Developer Network
 

Tendances (20)

Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
paper
paperpaper
paper
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Sqrrl and Accumulo
Sqrrl and AccumuloSqrrl and Accumulo
Sqrrl and Accumulo
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
 

En vedette

오픈소스 프로젝트 따라잡기_공개
오픈소스 프로젝트 따라잡기_공개오픈소스 프로젝트 따라잡기_공개
오픈소스 프로젝트 따라잡기_공개Hyoungjun Kim
 
Presto, Zeppelin을 이용한 초간단 BI 구축 사례
Presto, Zeppelin을 이용한 초간단 BI 구축 사례Presto, Zeppelin을 이용한 초간단 BI 구축 사례
Presto, Zeppelin을 이용한 초간단 BI 구축 사례Hyoungjun Kim
 
KAIST 전산학과 iDBLab 20140318 신입생 소개자료
KAIST 전산학과 iDBLab 20140318 신입생 소개자료KAIST 전산학과 iDBLab 20140318 신입생 소개자료
KAIST 전산학과 iDBLab 20140318 신입생 소개자료Taehun Kim, Ph.D
 
S04 hybrid app_and_gae_management_v1.0
S04 hybrid app_and_gae_management_v1.0S04 hybrid app_and_gae_management_v1.0
S04 hybrid app_and_gae_management_v1.0Sun-Jin Jang
 
Jco 소셜 빅데이터_20120218
Jco 소셜 빅데이터_20120218Jco 소셜 빅데이터_20120218
Jco 소셜 빅데이터_20120218Hyoungjun Kim
 
Google을 지탱하는 기술3
Google을 지탱하는 기술3Google을 지탱하는 기술3
Google을 지탱하는 기술3sid choi
 
UNUS Big Data BEANs 소개서
UNUS Big Data BEANs 소개서 UNUS Big Data BEANs 소개서
UNUS Big Data BEANs 소개서 영민 최
 
20150125 AWS BlackBelt - Amazon RDS (Korean)
20150125 AWS BlackBelt - Amazon RDS (Korean)20150125 AWS BlackBelt - Amazon RDS (Korean)
20150125 AWS BlackBelt - Amazon RDS (Korean)Amazon Web Services Korea
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkSadayuki Furuhashi
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupSadayuki Furuhashi
 
채팅서버의 부하 분산 사례
채팅서버의 부하 분산 사례채팅서버의 부하 분산 사례
채팅서버의 부하 분산 사례John Kim
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
Ndc14 분산 서버 구축의 ABC
Ndc14 분산 서버 구축의 ABCNdc14 분산 서버 구축의 ABC
Ndc14 분산 서버 구축의 ABCHo Gyu Lee
 
Zeppelin(제플린) 서울시립대학교 데이터 마이닝연구실 활용사례
Zeppelin(제플린) 서울시립대학교 데이터 마이닝연구실 활용사례Zeppelin(제플린) 서울시립대학교 데이터 마이닝연구실 활용사례
Zeppelin(제플린) 서울시립대학교 데이터 마이닝연구실 활용사례Taejun Kim
 
Data discovery qlikview
Data discovery   qlikviewData discovery   qlikview
Data discovery qlikviewchoi3773
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?Sadayuki Furuhashi
 
[113]apache zeppelin 이문수
[113]apache zeppelin 이문수[113]apache zeppelin 이문수
[113]apache zeppelin 이문수NAVER D2
 

En vedette (20)

IRIS
IRISIRIS
IRIS
 
오픈소스 프로젝트 따라잡기_공개
오픈소스 프로젝트 따라잡기_공개오픈소스 프로젝트 따라잡기_공개
오픈소스 프로젝트 따라잡기_공개
 
Presto, Zeppelin을 이용한 초간단 BI 구축 사례
Presto, Zeppelin을 이용한 초간단 BI 구축 사례Presto, Zeppelin을 이용한 초간단 BI 구축 사례
Presto, Zeppelin을 이용한 초간단 BI 구축 사례
 
KAIST 전산학과 iDBLab 20140318 신입생 소개자료
KAIST 전산학과 iDBLab 20140318 신입생 소개자료KAIST 전산학과 iDBLab 20140318 신입생 소개자료
KAIST 전산학과 iDBLab 20140318 신입생 소개자료
 
S04 hybrid app_and_gae_management_v1.0
S04 hybrid app_and_gae_management_v1.0S04 hybrid app_and_gae_management_v1.0
S04 hybrid app_and_gae_management_v1.0
 
Jco 소셜 빅데이터_20120218
Jco 소셜 빅데이터_20120218Jco 소셜 빅데이터_20120218
Jco 소셜 빅데이터_20120218
 
Google을 지탱하는 기술3
Google을 지탱하는 기술3Google을 지탱하는 기술3
Google을 지탱하는 기술3
 
UNUS Big Data BEANs 소개서
UNUS Big Data BEANs 소개서 UNUS Big Data BEANs 소개서
UNUS Big Data BEANs 소개서
 
20150125 AWS BlackBelt - Amazon RDS (Korean)
20150125 AWS BlackBelt - Amazon RDS (Korean)20150125 AWS BlackBelt - Amazon RDS (Korean)
20150125 AWS BlackBelt - Amazon RDS (Korean)
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
 
채팅서버의 부하 분산 사례
채팅서버의 부하 분산 사례채팅서버의 부하 분산 사례
채팅서버의 부하 분산 사례
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Ndc14 분산 서버 구축의 ABC
Ndc14 분산 서버 구축의 ABCNdc14 분산 서버 구축의 ABC
Ndc14 분산 서버 구축의 ABC
 
Zeppelin(제플린) 서울시립대학교 데이터 마이닝연구실 활용사례
Zeppelin(제플린) 서울시립대학교 데이터 마이닝연구실 활용사례Zeppelin(제플린) 서울시립대학교 데이터 마이닝연구실 활용사례
Zeppelin(제플린) 서울시립대학교 데이터 마이닝연구실 활용사례
 
Data discovery qlikview
Data discovery   qlikviewData discovery   qlikview
Data discovery qlikview
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?
 
[113]apache zeppelin 이문수
[113]apache zeppelin 이문수[113]apache zeppelin 이문수
[113]apache zeppelin 이문수
 

Similaire à Tajo_Meetup_20141120

Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Gruter
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data AnalyticsSupersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analyticsmason_s
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choiTajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choiData Con LA
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdfDataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdfMiguel Angel Fajardo
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureJianfeng Zhang
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureRajesh Balamohan
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?DataWorks Summit
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisFelicia Haggarty
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 

Similaire à Tajo_Meetup_20141120 (20)

Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data AnalyticsSupersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Hadoop Platform at Yahoo
Hadoop Platform at YahooHadoop Platform at Yahoo
Hadoop Platform at Yahoo
 
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choiTajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdfDataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 

Dernier

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Dernier (20)

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

Tajo_Meetup_20141120

  • 1. Apache Tajo : A Big Data Warehouse Systemon Hadoop 김형준 babokim@gmail.com
  • 2. ©2014 Gruter. All rights reserved. Apache Tajo Overview •A big data warehouse system on Hadoop •Apache Top-level project since March 2014 •Supports SQL standards •Features –Powerful distributed processing architecture (Not MapReduce) –Advanced query optimization algorithms and techniques –Long running queries : for many hours –Interactive analysis queries : from 100 milliseconds •Recent 0.9.0 release
  • 3. ©2014 Gruter. All rights reserved. Tajo ArchitectureMaster ServerTajoMasterSlaveServer TajoWorker QueryMaster Local Query Engine StorageManager Local FileSystemHDFS Client JDBC TSqlWeb UI Slave Server TajoWorker QueryMasterLocal Query EngineStorageManagerLocalFileSystem HDFS Slave Server TajoWorker QueryMaster Local Query EngineStorageManager Local FileSystem HDFS CatalogStoreDBMS HCatalog Submit a query Manage metadata Allocate a queryRun & monitor a queryRun & monitor a query
  • 4. ©2014 Gruter. All rights reserved. Mature SQL Feature Set •Fully distributed query executions –Inner join, and left/right/full outer join –Groupby, sort, multiple distinct aggregation –window function •SQL data types –CHAR, BOOL, INT, DOUBLE, TEXT, DATE, Etc •Various file formats –Text file (CSV), SequenceFile, RCFile, Parquet, Avro •SQL Standards –Non standard features : PgSQLand Oracle
  • 5. ©2014 Gruter. All rights reserved. Group by, Order by •Group by –Multi-level distributed Group by •tajo.dist-query.groupby.multi-level-aggr –Multiple Count Distinct •Select count(distinct col1), sum(distinct col2), ... –Hash Aggregation or Sort Aggregation •Order by –Fully distributed order by –Range partition 5
  • 6. ©2014 Gruter. All rights reserved. Join •Join –NATURAL, INNER, OUTER (LEFT, RIGHT, FULL) –SEMI, ANTI Join (planned for v0.9) •Join Predicates –WHERE and ON predicates –de-factor standard outer join behavior with both predicates SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num WHERE t2.value = 'xxx'; SELECT * FROM t1 LEFT JOIN t2 WHERE t1.num = t2.num and t2.value = ‘xxx’;
  • 7. ©2014 Gruter. All rights reserved. Table Partitions •Column Value Partition –Hive Compatible Partition •Range Partition (planned for 1.0) –Table will be partitioned by disjoint ranges. –Will remove the partition granularity problem of Hive Partition CREATE TABLE T1 (C1 INT, C2 TEXT) using PARQUET WITH (‘parquet.compression’ = ‘SNAPPY’) PARTITION BY COLUMN (C3 INT, C4 TEXT);
  • 8. ©2014 Gruter. All rights reserved. Window Function •OVER clause –row_number() and rank() –Aggregation function support –PARTITION and ORDER BY clause SELECT depname, empno, salary, enroll_dateFROM ( SELECT depname, empno, salary, enroll_date, rank() OVER (PARTITION BY depnameORDER BY salary DESC, empno) AS posFROM empsalary) AS ssWHERE pos < 3;
  • 9. ©2014 Gruter. All rights reserved. Currently Not Supported •IN/Exists SubQuery(2014, 4Q) –Select * from t1 where col1 in (select col1...) •Scalar SubQuery(2015, 1Q) –Select col1, (select col2 from ...) from ... •Some analytic function(2015, 1Q or 요청시) –ROLLUP, CUBE, some Window •Create/Drop Function(2014, 4Q) –실행중에UDF를추가하는기능미제공 •Alter meta property(2014, 4Q) –Partition 정보등 9
  • 10. ©2014 Gruter. All rights reserved. Comparison with other platform Function Tajo Hive Impala Spark Computing 자체 MapReduceorTez 자체 자체 Resource Management 자체or YARN YARN 자체 자체or YARN Scheduler FIFO, Fair FIFO, Fair, Capacity FIFO, Fair FIFO, Fair Storage HDFS, S3, HBase HDFS, HBase,S3 HDFS, HBase 자체RDD (HDSF 등) File Format CSV, RC, Parquet, Avro등 CSV, RC, ORC, Parquet, Avro등 CSV, RC, Parquet, Avro등 CSV, RC, Parquet, Avro등 Data Model Relational Nested Data Model Relational Nested Data Model Query ANSI-SQL HiveQL HiveQL HiveQL 구현언어 Java Java C++ Scala Client JavaAPI, JDBC, CLI CLI, JDBC, ODBC, Thrift Server API CLI, JDBC, ODBC Shark JDBC/ODBC, Scala, Java, Python API Query Latency Long run, Interactive Long run, (Interactive-Tez) Interactive Interactive 컴퓨팅특징 데이터는Disk, 중간데이터는Memory/Disk모두사용 데이터는Disk, 중간데이터는Memory/Disk모두사용 중간데이터가In- Memory (최근On-Disk지원) 분석대상데이터가In- Memory에로딩 License Apache Apache Apache Apache Main Sponsor Gruter Hortonworks Cloudera Databricks 10 * 버전업등에따라내용은틀릴수있음
  • 11. ©2014 Gruter. All rights reserved. Performance(1) 11-테스트장비: 1 master + 6 workersCPU: Xeon 2.5GHz, E5, 24 Core, Memory: 64GB, Network:10Gb Disk: 3TB * 6 SATA/HDD (7200 RPM) -테스트데이터: TPC-H Scale 100 or 1000 -VersionHadoop: cdh-4.3.0, Hive: 0.10.0-cdh4.3.0, Impala: impalad_version_1.1.1_RELEASETajo: 0.2-SNAPSHOT
  • 12. ©2014 Gruter. All rights reserved. Performance(2) •실제사용데이터및질의 –데이터 •1.7TB (4.1B rows, Q1) •8 or less GB (results of Q1, rest of Qs) –Query •Q1: scan using about 20 text pattern matching filters •Q2: 7 unions with joins •Q3: join •Q4: group by and order by •Q5: 30 text pattern matching filters with OR conditions, group by, having, and order by 4 12 http://www.slideshare.net/gruter/tajo-case-studybayareahugmeetupevent20131105
  • 13. ©2014 Gruter. All rights reserved. 1445.69 895.96 789.09 0 200 400 600 8001000 1200 1400 Q1: scan using about 20 text pattern matching filtersHive Impala Tajo Q1 –filter scan • processing time (sec.) • NB: * Tajo showed enhanced performance due to dynamic task scheduling
  • 14. ©2014 Gruter. All rights reserved. 63.64 9.11 38.64 010 20 30 4050 60 70 Q2: 7 unions with joins Hive Impala Tajo Q2 –unions, joinsprocessing time (sec.) NB: *Tajo materializing all query results to HDFS, as is the main goal*unions are processed in sequence in Tajo now (parallel processing is coming soon) • •
  • 15. ©2014 Gruter. All rights reserved. 101.45 36.81 31.92 020 40 60 80 100 Q3: joinHiveImpalaTajo Q3 –join processing time (sec.) • • NB: *Tajo has an optimal selection/projection push down
  • 16. ©2014 Gruter. All rights reserved. 24.7 0.45 0.65 0 5 10 15 20 25 Q4: group by and order byHiveImpalaTajo Q4 –group by and sort processing time (sec.) • •
  • 17. ©2014 Gruter. All rights reserved. 128.78 17.03 6.03 0 20 40 60 80 100 120Q6: 30 Text pattern matching filters with OR conditions, group by, having, and order by resulting in smaller set of outputHiveImpala Tajo Q5 –filters, group by, having and sort processing time (sec.) • • Q5:
  • 18. ©2014 Gruter. All rights reserved. Replace Commercial Data Warehouse •ETL(Batch) Processing: 120+ queries, ~4TB read/day •OLAP(Interactive) Processing: 500+ queriesOperational Systems Integration Layer Data Warehouse Data Mart Marketing Sales ERPSCM ODSStaging AreaStrategicMartsDataVault Front-End Analytics Reports OLAPVisualizationDataMining
  • 19. ©2014 Gruter. All rights reserved. Tajo-as-a-Service on AWS
  • 20. ©2014 Gruter. All rights reserved. Active Open Source Community •Fully community-driven open source •Stable development team –17 committers + many contributors
  • 21. ©2014 Gruter. All rights reserved. Future Works •2014 4Q –HBaseintergation –In/Exists SubQuery –User defined function –Multi-tenant Scheduler •2015 1Q –Authentication and Standard Access Control –Scalar SubQuery –ROLLUP, CUBE •2015 2Q –VectorizedEngine(C++ Operator) 21 Gruter에서진행하는마일스톤으로Apache Tajo 커뮤니티방향과는다를수있음
  • 23. ©2014 Gruter. All rights reserved. DBMS 실행방식 23 SelectionGroupbyJoin Orderby Eval Filter SCAN ... Basic OperatorSLECT o_custkey, l_lineitem, count(*) FROM lineitemaJOIN orders b ON a.l_orderkey= b.o_orderkeyWHERE a.l_shipdate> ‘2014-09-01’ GROUP BY o_custkey, l_lineitem SCAN (lineitem) SCAN (orders) Fliter Join Groupby Operator 조합으로 Tree 형태의 PhysicalPlan생성 Call() Call() Return Tuple Return Tuple
  • 24. ©2014 Gruter. All rights reserved. Query Execution Flow 24 TajoClientTajoCli TajoJDBC SeqScanExec Non-forward query LogicalPlanner LogicalPlanner Task Scheduler Logical Plan Optimize ScheduleTasks dynamically PhysicalPlanner GlobalPlanner SeqScanExec HashJoinExec SortJoinExecHashAggrExec TaskTableStoreExec TajoMaster Result Run query QueryMaster Forward query Distributed Plan Optimize SubQuery-BSubQuery-A Make SubQuery Resource Manager Get available resource Tajo Worker Run TaskMake Physical Operator Run Operator Input Output Result TajoPullServer Fetcher Intermediate
  • 25. ©2014 Gruter. All rights reserved. Logical Plan/Distributed Plan 25 (A join-groupby-sort query plan) (A distributed query execution plan) select col1, sum(col2) as total, avg(col3) as averagefromr1, r2where r1.col1 = r2.col2 group by col1 order by average;
  • 26. ©2014 Gruter. All rights reserved. Logical Plan Optimizer •Basic Rewrite Rule –Common sub expression elimination –Constant folding (CF), and Null propagation •Projection Push Down (PPD) –push expressions to operators lower as possible –narrow read columns –remove duplicated expressions •if some expressions has common expression •Filter Push Down (FPD) –reduce rows to be processed earlier as possible •Extensible Rewrite Rule –Allow developers to write their own rewrite rules 26
  • 27. ©2014 Gruter. All rights reserved. Logical Plan Optimizer 27 SELECT item_id, order_id sum_price* (1.2 * 0.3) as total, FROM ( SELECT item_id, order_id, sum(price) as sum_price FROM ITEMS GROUP BY item_id, order_id ) a WHERE item_id= 17234 SELECT item_id, order_id, sum(price) * (3.6) FROM ITEMS GROUP BY item_id, order_id WHERE item_id= 17234 Original Rewritten CF + PPD FPD
  • 28. ©2014 Gruter. All rights reserved. Filter Push Down Rule(Outer Join) Preserved Row Table Null Supplying Table Join Predicate Not Pushed(1) Pushed(2) Where Predicate Pushed(3) Not Pushed(4) 28 SELECT * FROM lineitema OUTER JOIN orders b ON a.l_orderkey= o_orderkeyAND a.l_shipdate= ‘2013-01-01’ AND b.o_orderstatus= ‘O’ WHERE a.l_quantity>20 AND b.o_custkey= 100 ① ② ③ ④ SCAN (lineitem) Filter (a.lquantity> 20) SCAN (orders) Filter (b.o_orderstatus=‘O’) JOIN a.l_orderkey= b.o_orderkey a.l_shipdate= ‘2013-01-01’ Filter (b.o_custkey= 100) ExecutionBlock-A ExecutionBlock-B ExecutionBlock-C
  • 29. ©2014 Gruter. All rights reserved. Logical Plan Optimizer •Cost-based Join Order –Don’t need to guess right join orders anymore –Greedy heuristic algorithm •Resulting in a bushy join tree instead of left-deep join tree 29 Left-deep Join Tree Bush Join Tree
  • 30. ©2014 Gruter. All rights reserved. 기타 •Union –Filter push down •Repartitioner –tajo.dist-query.join.partition-volume-mb –tajo.dist-query.groupby.partition-volume-mb –tajo.dist-query.table-partition.task-volume-mb •Broadcast Join –tajo.dist-query.join.broadcast.threshold-bytes •Host/Disk aware scheduler 30
  • 31. ©2014 Gruter. All rights reserved. GRUTER: YOUR PARTNER IN THE BIG DATA REVOLUTION Phone +82-70-8129-2950 Fax+82-70-8129-2952 E-mail contact@gruter.com Webwww.gruter.com Phone +1-415-841-3345