SlideShare une entreprise Scribd logo
1  sur  13
Télécharger pour lire hors ligne
Integrating data stored in
RDBMS and Hadoop
leoricklin@gmail.com
Problem:
How to process data
stored in the
RDBMS (master
tables) and Hadoop
(log tables) within
one platform ?
Pros:
● 過程簡單, 讀取RDBMS成為RDD
後進行操作
Cons:
● 系統設定複雜, 需開通所有spark
nodes 對 RDBMS的接口
● JdbcRDD僅能使用在範圍查詢語法,
且範圍條件欄位限制為Long型態
=> 當作spark nodes分佈存取的依
據
Solution 1: Spark with JdbcRDD
Mysql table schema
mysql> create table calendar (
date_cd int not null
,wd_flag INT not null)
engine=csv;
mysql> select date_cd, wd_flag from calendar where date_cd >= 20130101 and date_cd <=
20141031;
+----------+---------+
| date_cd | wd_flag |
+----------+---------+
| 20141001 | 0 |
| 20141002 | 0 |
....
62 rows in set (0.05 sec)
利用JdbcRDD讀取Mysql
$SPARK_HOME/bin/spark-shell --jars /usr/share/java/mysql-connector-java-5.1.17.jar --
master yarn-client --num-executors 4 --executor-cores 2 --executor-memory 4g --driver-
memory 1g
scala>
import org.apache.spark.rdd.JdbcRDD
import java.sql.{Connection, DriverManager, ResultSet}
val url="jdbc:mysql://mysqlserver.com:3306/db"
val username = "user"
val password = "pwd"
val driverName="com.mysql.jdbc.Driver"
case class CAL(date_cd:String, hd_flag:Int)
val calrdd = new JdbcRDD( sc
, () => DriverManager.getConnection(url,username,password)
, "select date_cd, wd_flag from calendar where date_cd >= ? and date_cd <= ?"
, 20130101, 20141031, 12
, r =>( CAL(r.getString(1),r.getInt(2)) )
)
// calrdd.count: Long = 62
// scheduler.DAGScheduler: Job 0 finished: ... took 1.429489 s
利用HadooopRDD讀取HDFS
case class LOG(req_date:String, operation:String, http_code:Int)
val path = "hdfs://mycluster/tmp/log.CSV"
val logrdd = sc.textFile(path, 12 // ret: RDD[String]
).map( i => Helper.tokenize(i,",",true) // ret: RDD[Array[String]]
).map(ary => Helper.replaceChar(ary,"-","0") // ret: RDD[Array[String]]
).map(ary => toLOG(ary)) // ret: RDD[LOG]
// logrdd.count: Long = 4000000
// scheduler.DAGScheduler: Job 1 finished: ... took 10.404451 s
進行RDD join
import org.apache.spark.SparkContext._
val logkv = logrdd.map( i => (i.req_date.substring(0,8), i) ) // ret: RDD[String, LOG]
val calkv = calrdd.map( i => (i.date_cd, i) ) // ret: RDD[String, CAL]
val joinrdd = logkv.join(calkv)
/*
joinrdd.first: : (String, (LOG, CAL)) = (
20130129,(
LOG(20130129231106,Put,200)
,CAL(20130129,1))
)
scheduler.DAGScheduler: Job 20 finished... took 97.416966 s
*/
.
Solution 2: Spark with JDBC Client
Pros:
● 系統設定簡單, 僅開通driver nodes
對 RDBMS的接口
● 支援原生SELECT語法
● SQL查詢結果可自訂為RDD or
Broadcast variable.
Cons:
● 過程略複雜, 須將ResultSet在
driver node轉為RDD(或BV)後進
行操作
● RDBMS資料量受限於driver node
MEM限制
ref Mysql table schema
利用Jdbc Driver讀取Mysql
val conn:java.sql.Connection = SqlHelper.getConn(driverName, uri, username, password)
val stmt:java.sql.Statement = SqlHelper.getStmt(conn)
val sql="select date_cd,wd_flag from calendar where date_cd >= 20130101 and date_cd <=
20141031"
val ret:java.sql.ResultSet = SqlHelper.getResult(stmt, sql)
val rows = new scala.collection.mutable.ListBuffer[Tuple2[String,Int]]()
while (ret.next) {
rows += calToTup(ret)
}
// rows.size:Int = 62
// rows(0):(String, Int) = (20141001,0)
// return immediately
val calrdd = sc.parallelize(rows, 12)
ref 利用HadooopRDD讀取HDFS
ref 進行RDD join
進行Map-side Join
...
val sqlkv = new scala.collection.mutable.HashMap[String,Int]()
while (ret.next) {
sqlkv += calToTup(ret)
}
val sqlbc = sc.broadcast(sqlkv)
val logkv = logrdd.map( ... ) /* 利用HadooopRDD讀取HDFS並轉換為RDD[String,LOG] */
val result = logkv.mapPartitions( { iter =>
var sqlkv = sqlbc.value
for{
(key, value) <- iter
if(sqlkv.contains(key))
} yield (key, (value, sqlkv.getOrElse(key, () => "")))
})
/*
result.first: (String, (LOG, Any)) = ( 20130129,( LOG(20130129231106,Put,200) ,1 ) )
scheduler.DAGScheduler: Job 0 finished... took 2.051693 s
*/
Summary
● JdbcRDD方式
o Pros: 程式撰寫較簡易
o Cons: 系統設定較複雜
● JDBC Clien方式
o Pros: 系統設定較簡易
o Cons: 程式撰寫較複雜(但可進行性能優化)

Contenu connexe

Tendances

Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed_Hat_Storage
 
How to use postgresql.conf to configure and tune the PostgreSQL server
How to use postgresql.conf to configure and tune the PostgreSQL serverHow to use postgresql.conf to configure and tune the PostgreSQL server
How to use postgresql.conf to configure and tune the PostgreSQL serverEDB
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...Red_Hat_Storage
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2hdhappy001
 
9/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'169/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'16Kangaroot
 
Migration DB2 to EDB - Project Experience
 Migration DB2 to EDB - Project Experience Migration DB2 to EDB - Project Experience
Migration DB2 to EDB - Project ExperienceEDB
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudNicolas Poggi
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologiesinside-BigData.com
 
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...Red_Hat_Storage
 
Learning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under ContainersLearning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under Containersinside-BigData.com
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAPEDB
 
Performance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storagePerformance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storageDataWorks Summit
 
How to Design for Database High Availability
How to Design for Database High AvailabilityHow to Design for Database High Availability
How to Design for Database High AvailabilityEDB
 
Best Practices & Lessons Learned from Deployment of PostgreSQL
 Best Practices & Lessons Learned from Deployment of PostgreSQL Best Practices & Lessons Learned from Deployment of PostgreSQL
Best Practices & Lessons Learned from Deployment of PostgreSQLEDB
 

Tendances (20)

Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
 
How to use postgresql.conf to configure and tune the PostgreSQL server
How to use postgresql.conf to configure and tune the PostgreSQL serverHow to use postgresql.conf to configure and tune the PostgreSQL server
How to use postgresql.conf to configure and tune the PostgreSQL server
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
 
9/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'169/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'16
 
Migration DB2 to EDB - Project Experience
 Migration DB2 to EDB - Project Experience Migration DB2 to EDB - Project Experience
Migration DB2 to EDB - Project Experience
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
 
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
 
HBase with MapR
HBase with MapRHBase with MapR
HBase with MapR
 
Learning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under ContainersLearning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under Containers
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoop
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
Performance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storagePerformance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storage
 
How to Design for Database High Availability
How to Design for Database High AvailabilityHow to Design for Database High Availability
How to Design for Database High Availability
 
Best Practices & Lessons Learned from Deployment of PostgreSQL
 Best Practices & Lessons Learned from Deployment of PostgreSQL Best Practices & Lessons Learned from Deployment of PostgreSQL
Best Practices & Lessons Learned from Deployment of PostgreSQL
 

En vedette

Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
 
Scala introduction
Scala introductionScala introduction
Scala introductionvito jeng
 
Cloudera 助力台灣大數據產業的發展
Cloudera 助力台灣大數據產業的發展Cloudera 助力台灣大數據產業的發展
Cloudera 助力台灣大數據產業的發展Etu Solution
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Etu Solution
 
Vpon - 廣告效果導向為基礎的行動廣告系統
Vpon - 廣告效果導向為基礎的行動廣告系統Vpon - 廣告效果導向為基礎的行動廣告系統
Vpon - 廣告效果導向為基礎的行動廣告系統Vpon
 
Strata Beijing - Deep Learning in Production on Spark
Strata Beijing - Deep Learning in Production on SparkStrata Beijing - Deep Learning in Production on Spark
Strata Beijing - Deep Learning in Production on SparkAdam Gibson
 
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)Kuo-Chun Su
 
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告Etu Solution
 
翻轉醫療-人類基因大數據解密
翻轉醫療-人類基因大數據解密翻轉醫療-人類基因大數據解密
翻轉醫療-人類基因大數據解密Chung-Tsai Su
 
唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pubChao Zhu
 
大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)
大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)
大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)Amazon Web Services
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
AWS電商和零售業解決方案介紹
AWS電商和零售業解決方案介紹AWS電商和零售業解決方案介紹
AWS電商和零售業解決方案介紹Amazon Web Services
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 

En vedette (16)

Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Scala introduction
Scala introductionScala introduction
Scala introduction
 
Cloudera 助力台灣大數據產業的發展
Cloudera 助力台灣大數據產業的發展Cloudera 助力台灣大數據產業的發展
Cloudera 助力台灣大數據產業的發展
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
Vpon - 廣告效果導向為基礎的行動廣告系統
Vpon - 廣告效果導向為基礎的行動廣告系統Vpon - 廣告效果導向為基礎的行動廣告系統
Vpon - 廣告效果導向為基礎的行動廣告系統
 
Strata Beijing - Deep Learning in Production on Spark
Strata Beijing - Deep Learning in Production on SparkStrata Beijing - Deep Learning in Production on Spark
Strata Beijing - Deep Learning in Production on Spark
 
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)
 
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
 
翻轉醫療-人類基因大數據解密
翻轉醫療-人類基因大數據解密翻轉醫療-人類基因大數據解密
翻轉醫療-人類基因大數據解密
 
唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub
 
大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)
大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)
大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
AWS電商和零售業解決方案介紹
AWS電商和零售業解決方案介紹AWS電商和零售業解決方案介紹
AWS電商和零售業解決方案介紹
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 

Similaire à Integrating data stored in rdbms and hadoop

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Nko workshop - node js & nosql
Nko workshop - node js & nosqlNko workshop - node js & nosql
Nko workshop - node js & nosqlSimon Su
 
Oracle sharding : Installation & Configuration
Oracle sharding : Installation & ConfigurationOracle sharding : Installation & Configuration
Oracle sharding : Installation & Configurationsuresh gandhi
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiSpark Summit
 
Bogdan Kecman Advanced Databasing
Bogdan Kecman Advanced DatabasingBogdan Kecman Advanced Databasing
Bogdan Kecman Advanced DatabasingBogdan Kecman
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 
Bogdan Kecman INIT Presentation
Bogdan Kecman INIT PresentationBogdan Kecman INIT Presentation
Bogdan Kecman INIT Presentationarhismece
 
Improve PostgreSQL replication with Oracle GoldenGate
Improve PostgreSQL replication with Oracle GoldenGateImprove PostgreSQL replication with Oracle GoldenGate
Improve PostgreSQL replication with Oracle GoldenGateBobby Curtis
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
How Spark Does It Internally?
How Spark Does It Internally?How Spark Does It Internally?
How Spark Does It Internally?Knoldus Inc.
 
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?SegFaultConf
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Keshav Murthy
 
Introduction to Grunt.js on Taiwan JavaScript Conference
Introduction to Grunt.js on Taiwan JavaScript ConferenceIntroduction to Grunt.js on Taiwan JavaScript Conference
Introduction to Grunt.js on Taiwan JavaScript ConferenceBo-Yi Wu
 

Similaire à Integrating data stored in rdbms and hadoop (20)

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Nko workshop - node js & nosql
Nko workshop - node js & nosqlNko workshop - node js & nosql
Nko workshop - node js & nosql
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Oracle sharding : Installation & Configuration
Oracle sharding : Installation & ConfigurationOracle sharding : Installation & Configuration
Oracle sharding : Installation & Configuration
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Bogdan Kecman Advanced Databasing
Bogdan Kecman Advanced DatabasingBogdan Kecman Advanced Databasing
Bogdan Kecman Advanced Databasing
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS Accelerator
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Bogdan Kecman INIT Presentation
Bogdan Kecman INIT PresentationBogdan Kecman INIT Presentation
Bogdan Kecman INIT Presentation
 
Improve PostgreSQL replication with Oracle GoldenGate
Improve PostgreSQL replication with Oracle GoldenGateImprove PostgreSQL replication with Oracle GoldenGate
Improve PostgreSQL replication with Oracle GoldenGate
 
Node.js
Node.jsNode.js
Node.js
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
How Spark Does It Internally?
How Spark Does It Internally?How Spark Does It Internally?
How Spark Does It Internally?
 
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.
 
Introduction to Grunt.js on Taiwan JavaScript Conference
Introduction to Grunt.js on Taiwan JavaScript ConferenceIntroduction to Grunt.js on Taiwan JavaScript Conference
Introduction to Grunt.js on Taiwan JavaScript Conference
 

Plus de leorick lin

How to prepare for pca certification 2021
How to prepare for pca certification 2021How to prepare for pca certification 2021
How to prepare for pca certification 2021leorick lin
 
1.5.ensemble learning with apache spark m llib 1.5
1.5.ensemble learning with apache spark m llib 1.51.5.ensemble learning with apache spark m llib 1.5
1.5.ensemble learning with apache spark m llib 1.5leorick lin
 
1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml1.5.recommending music with apache spark ml
1.5.recommending music with apache spark mlleorick lin
 
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformatanalyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformatleorick lin
 
Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3leorick lin
 
Email Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML PipelineEmail Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML Pipelineleorick lin
 

Plus de leorick lin (6)

How to prepare for pca certification 2021
How to prepare for pca certification 2021How to prepare for pca certification 2021
How to prepare for pca certification 2021
 
1.5.ensemble learning with apache spark m llib 1.5
1.5.ensemble learning with apache spark m llib 1.51.5.ensemble learning with apache spark m llib 1.5
1.5.ensemble learning with apache spark m llib 1.5
 
1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml
 
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformatanalyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
 
Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3
 
Email Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML PipelineEmail Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML Pipeline
 

Dernier

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Dernier (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Integrating data stored in rdbms and hadoop

  • 1. Integrating data stored in RDBMS and Hadoop leoricklin@gmail.com
  • 2. Problem: How to process data stored in the RDBMS (master tables) and Hadoop (log tables) within one platform ?
  • 3. Pros: ● 過程簡單, 讀取RDBMS成為RDD 後進行操作 Cons: ● 系統設定複雜, 需開通所有spark nodes 對 RDBMS的接口 ● JdbcRDD僅能使用在範圍查詢語法, 且範圍條件欄位限制為Long型態 => 當作spark nodes分佈存取的依 據 Solution 1: Spark with JdbcRDD
  • 4. Mysql table schema mysql> create table calendar ( date_cd int not null ,wd_flag INT not null) engine=csv; mysql> select date_cd, wd_flag from calendar where date_cd >= 20130101 and date_cd <= 20141031; +----------+---------+ | date_cd | wd_flag | +----------+---------+ | 20141001 | 0 | | 20141002 | 0 | .... 62 rows in set (0.05 sec)
  • 5. 利用JdbcRDD讀取Mysql $SPARK_HOME/bin/spark-shell --jars /usr/share/java/mysql-connector-java-5.1.17.jar -- master yarn-client --num-executors 4 --executor-cores 2 --executor-memory 4g --driver- memory 1g scala> import org.apache.spark.rdd.JdbcRDD import java.sql.{Connection, DriverManager, ResultSet} val url="jdbc:mysql://mysqlserver.com:3306/db" val username = "user" val password = "pwd" val driverName="com.mysql.jdbc.Driver" case class CAL(date_cd:String, hd_flag:Int) val calrdd = new JdbcRDD( sc , () => DriverManager.getConnection(url,username,password) , "select date_cd, wd_flag from calendar where date_cd >= ? and date_cd <= ?" , 20130101, 20141031, 12 , r =>( CAL(r.getString(1),r.getInt(2)) ) ) // calrdd.count: Long = 62 // scheduler.DAGScheduler: Job 0 finished: ... took 1.429489 s
  • 6. 利用HadooopRDD讀取HDFS case class LOG(req_date:String, operation:String, http_code:Int) val path = "hdfs://mycluster/tmp/log.CSV" val logrdd = sc.textFile(path, 12 // ret: RDD[String] ).map( i => Helper.tokenize(i,",",true) // ret: RDD[Array[String]] ).map(ary => Helper.replaceChar(ary,"-","0") // ret: RDD[Array[String]] ).map(ary => toLOG(ary)) // ret: RDD[LOG] // logrdd.count: Long = 4000000 // scheduler.DAGScheduler: Job 1 finished: ... took 10.404451 s
  • 7. 進行RDD join import org.apache.spark.SparkContext._ val logkv = logrdd.map( i => (i.req_date.substring(0,8), i) ) // ret: RDD[String, LOG] val calkv = calrdd.map( i => (i.date_cd, i) ) // ret: RDD[String, CAL] val joinrdd = logkv.join(calkv) /* joinrdd.first: : (String, (LOG, CAL)) = ( 20130129,( LOG(20130129231106,Put,200) ,CAL(20130129,1)) ) scheduler.DAGScheduler: Job 20 finished... took 97.416966 s */ .
  • 8. Solution 2: Spark with JDBC Client Pros: ● 系統設定簡單, 僅開通driver nodes 對 RDBMS的接口 ● 支援原生SELECT語法 ● SQL查詢結果可自訂為RDD or Broadcast variable. Cons: ● 過程略複雜, 須將ResultSet在 driver node轉為RDD(或BV)後進 行操作 ● RDBMS資料量受限於driver node MEM限制
  • 10. 利用Jdbc Driver讀取Mysql val conn:java.sql.Connection = SqlHelper.getConn(driverName, uri, username, password) val stmt:java.sql.Statement = SqlHelper.getStmt(conn) val sql="select date_cd,wd_flag from calendar where date_cd >= 20130101 and date_cd <= 20141031" val ret:java.sql.ResultSet = SqlHelper.getResult(stmt, sql) val rows = new scala.collection.mutable.ListBuffer[Tuple2[String,Int]]() while (ret.next) { rows += calToTup(ret) } // rows.size:Int = 62 // rows(0):(String, Int) = (20141001,0) // return immediately val calrdd = sc.parallelize(rows, 12)
  • 12. 進行Map-side Join ... val sqlkv = new scala.collection.mutable.HashMap[String,Int]() while (ret.next) { sqlkv += calToTup(ret) } val sqlbc = sc.broadcast(sqlkv) val logkv = logrdd.map( ... ) /* 利用HadooopRDD讀取HDFS並轉換為RDD[String,LOG] */ val result = logkv.mapPartitions( { iter => var sqlkv = sqlbc.value for{ (key, value) <- iter if(sqlkv.contains(key)) } yield (key, (value, sqlkv.getOrElse(key, () => ""))) }) /* result.first: (String, (LOG, Any)) = ( 20130129,( LOG(20130129231106,Put,200) ,1 ) ) scheduler.DAGScheduler: Job 0 finished... took 2.051693 s */
  • 13. Summary ● JdbcRDD方式 o Pros: 程式撰寫較簡易 o Cons: 系統設定較複雜 ● JDBC Clien方式 o Pros: 系統設定較簡易 o Cons: 程式撰寫較複雜(但可進行性能優化)