ETL (production) use cases explored giving insights for the practical use of Hadoop by Bryan at a Hadoop User Group (HUG) Ireland event, which was hosted by Synchronoss in Dublin on January 11th, 2016.
2. 2SYNCHRONOSS PROPRIETARY
Company Snapshot
(Q3’2014 revenue)
Market Leader
• Synchronoss provides Personal Cloud and Activation
Platforms to Tier One Operators, MSO’s and Enterprises
around the globe
Business Model
Highlights
• Monthly Subscription Fee per active Personal Cloud
subscriber (SAAS)
• Revenue model consists of transaction fee for every
activation
Tier-One, Blue
Chip Customers
Proven Scale
• 130+ Million Cloud Subscribers connected in our Personal
Cloud around the globe
• Activating millions of devices each week
Strong Financial
Position
• Strong, consistent growth in revenue scale and profitability
since IPO in 2006
• Healthy balance sheet and cash flow
3. 3SYNCHRONOSS PROPRIETARY
Cloud
Synchronoss is driving the acceleration of the Personal Cloud
market with strong growth across its platform and technology.
2011 Today
Customers
Data Classes
Supported
Personal
Cloud Usage
Ingest Rate
Subscriber
Growth
75+ Leading global mobile carriers
20M
Contacts
30 Billion Entities
(Photos, videos, call logs, contacts, music, documents, Messages)
1Terabyte
per month
+215 Terabytes per day
A few thousand
subs per month 400K-500K New Subs per Week
130M+ Cloud Subscribers
3.5 Billion Addressable Market
6. 6SYNCHRONOSS PROPRIETARY
Writing to HDFS
• hdfs dfs -put <file> <path_on_hdfs>
• hdfs dfs -text <filename.txt|gz|snappy)
• HDFS good for large files. Not good at dealing with small
files (sequence files)
• Log files - hdfs porter, retries, parallelise, corrupted files,
file size should match block size. 128MB block size.
~2.5m rows /file
Other options:
• NFS mount
• MapR proprietary file system
• Flume
• Camus/Goblin
12. 12SYNCHRONOSS PROPRIETARY
Set up Hive to Mongo
-- create hive table that points to MongoDB collection view
CREATE EXTERNAL TABLE 10_mongo_handset_state (
id STRING,
segment STRUCT<lcid:STRING,
action:STRING,
type:STRING>,
ts STRING,
cd STRING)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler‘
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","segment":"sg"}')
TBLPROPERTIES('mongo.uri'='mongodb://ec2-52-55.eu-west-1.compute.amazonaws.com:27017/db.
fab09d7f52d3fe1278?readPreference=secondary',
'mongo.input.query'='{"cd" : { "$gte" : {"$date":1447927200000}, "$lt" : {"$date":1447930800000} }}',
'mongo.input.split.create_input_splits'='false');
13. 13SYNCHRONOSS PROPRIETARY
Mongo load to Hive
Now load the mongo db data into Hive/hdfs
INSERT OVERWRITE TABLE 10_handset_state PARTITION
(pdate, phour)
select
c,
IF(segment.lcid IS NULL, '', segment.lcid),
IF(segment.action IS NULL, '', UPPER(segment.action)),
IF(segment.type IS NULL, '', LOWER(segment.type)),
'20151119',
lpad(CAST(hour(from_unixtime(unix_timestamp(cd,"EEE
MMM dd HH:mm:ss z yyyy"))) as STRING), 2, '0')
from 10_mongo_handset_state;
14. 14SYNCHRONOSS PROPRIETARY
Mongo load to Hive
INFO : number of splits:1
INFO : 2015-12-09 02:03:50,020 Stage-1 map = 0%, reduce = 0%
INFO : 2015-12-09 02:05:33,136 Stage-1 map = 100%, reduce = 0%, Cumulative CPU
118.25 sec
INFO : MapReduce Total cumulative CPU time: 1 minutes 58 seconds 250 msec
INFO : Ended Job = job_1449567915620_2552
INFO : Loading partition {pdate=20151209, phour=01}
INFO : Time taken for adding to write entity : 0
INFO : Partition default.10001_pc_handset_event{pdate=20151209, phour=01}
stats: [numFiles=1, numRows=1797013, totalSize=313391267, rawDataSize=311594254]
15. 15SYNCHRONOSS PROPRIETARY
HBase Overview
• NoSQL distributed, scalable database modelled on
Google’s BigTable
• Key/Value store
• Data persisted to HDFS
• Resilient, HA
• Sparse
• Automatic sharding
18. 18SYNCHRONOSS PROPRIETARY
Hive-HBase
-- HBase managed table
CREATE EXTERNAL TABLE IF NOT EXISTS hbase_user_profile_uploads
(key string, size BIGINT, number int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler‘
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,ul:size,ul:num")
TBLPROPERTIES("hbase.table.name" = "user_profile_uploads");
-- sample key '0ab94b27311b468186f5d!20130604!HANDSET!APPLE/
IPHONE!image/jpeg‘
INSERT OVERWRITE TABLE hbase_user_profile_uploads
SELECT concat(userid,'!',pdate,'!',platform,'!',device,'!',fileType), fileSize,
number
FROM 10_user_profile_uploads
where pdate=20151118;
19. 19SYNCHRONOSS PROPRIETARY
HBase queries
hbase shell>get 'user_profile_uploads', '0ab94b27311b468186f5d!20140513!
HANDSET!SAMSUNG/SCH-I545!image/jpeg‘
// exact key search – v quick – returns 1 row
PrefixFilter v fast
scan 'user_profile_uploads', {FILTER => "PrefixFilter ('0ab94b27311b468186f5d')"}
// 3 row(s) in 0.0630 seconds
However if key is at end of table will take a long time
scan 'user_profile_uploads', {FILTER => "PrefixFilter (‘zb94b27311b468186f5d')"}
//12 row(s) in 16 seconds
####### Optimum Solution is to use STARTROW along with Filter ############
scan 'user_profile_uploads', {STARTROW => ‘zb94b27311b468186f5d', FILTER =>
"PrefixFilter (‘zb94b27311b468186f5d')"} //12 row(s) in 0.1560 seconds
20. 20SYNCHRONOSS PROPRIETARY
Hadoop
• Linear scalability
• Predictable reporting
• Reproducible and reliable reports
• Democratized data
• Applications were black boxes – no longer so. Out of
the darkness…
• Enables data-driven decision making
• Jump In!