TRHUG 2015 - Veloxity Big Data Migration Use Case

TRHUG 2015
Veloxity Migration Use Case
v1.2

About Me
● Hakan Ilter
○ GittiGidiyor / eBay
■ Software Platform & Research Manager
■ Java, Spring, Microservices
○ devveri.com
■ Big Data Consultant and Blogger
○ Search, Big Data, NoSQL

About Veloxity
● Veloxity
○ Wireless Telecom Company
■ Based in Sunnyvale, California
○ Founded in 2013
■ by two Turkish entrepreneurs
○ CXM solutions
■ Mobile consumer experience management
○ Powerful SDK
○ “Actionable” Analytics

● Rapidly Growing
○ Now
■ 75K Devices
■ 30 GB / day
○ Short-term
■ 750K devices
■ 300 GB / day
○ Mid-term
■ 7M devices
■ 3 TB / day
About Data

● Legacy System
○ RDBMS-Centric Architecture
■ .NET Codebase
■ MSSQL Server
○ Stored Procedures
■ Hundreds of SPs
■ Thousands of lines of code
○ Works fine (for a while)
Before Migration

● Legacy System Problems
○ RDBMS-Centric Architecture
■ .NET doesn’t fit
■ Can’t scale MSSQL Server
○ Stored Procedures
■ Hard to develop/maintain
■ Stored Procedure Hell!
○ Looking for another solution
Before Migration

● Hadoop
○ MapReduce
■ Can process large amounts of data
○ Hive
■ SQL over unstructured data
○ Impala
■ Massive parallel processing SQL engine
○ Cloudera CDH 5.x
■ Enterprise-ready Big Data Platform
The answer is Hadoop

● MapReduce + Hive + Impala
○ MapReduce
■ Processes JSON input
■ Creates major tables
■ Parquet columnar format as output
○ Hive
■ Query over raw data
○ Impala
■ Builds aggregation tables
■ Analytics based on these tables
Veloxity Big Data v1

● Spark + Impala
○ Spark
■ Replaces MapReduce
■ Better Developer Productivity
■ Better Performance
■ Rich APIs for Java, Scala, Python
■ In-memory storage
○ Impala
■ Fastest MPP SQL Engine
■ Better than Hive or Spark SQL

Big Data Architecture
Devices
GZipped JSON data
Tomcat Web App
Copy to HDFS
Hadoop Cluster
Build Model
with Spark
Hive Metastore
Build Aggregations
with Impala
MSSQL Server
Analytics App
Reporting User
REST
Impala
Queries
SQL
Queries
Import with Sqoop
Export with Pig

● Other Tools
○ Java
■ Spring Framework, Tomcat App Server
○ Bash Script
■ For task executions, flows, etc.
■ Because of Oozie!
○ Sqoop
■ Great (only) for imports
○ Pig
■ Good for data cleaning and exports

● Data Process & Query Performance
○ Hardware
■ Amazon EC2
■ m3.2xlarge
■ 8 Core, 30 GB Ram, Standard disk
■ 1 Name Node, 3 Data Nodes
○ Software
■ Cloudera CDH 5.3.2
■ Impala 2.1.2
■ Hive 0.13.1
■ Spark 1.2.0
Performance Comparison

● Input Data
○ 4 GB Gzip compressed
○ 12 GB uncompressed
○ 859 files
● Task
○ Process JSON files
○ Validate each record
○ Fix problems
○ Build a model
○ Save as Parquet Format
Data Process Performance

● Results
Data Process Performance

● Input Data
○ 542 MB Snappy compressed
○ 1.6 GB uncompressed
○ 11 M rows
○ 468 Parquet files
● Query
SELECT
deviceId, COUNT(*), AVG(rxSpeed), MAX(rxSpeed), AVG
(txSpeed), MAX(txSpeed), SUM(rxData), SUM(txData)
FROM stats
GROUP BY deviceId
ORDER BY deviceId
LIMIT 100
Query Performance

● Lessons Learned
● CDH updates are critical
○ Always test first!
○ Use VMs for testing
● Install Spark manually
○ The latest Spark version 1.5.0
○ CDH 5.4.x still comes with Spark 1.3
● The small files problem
○ Merge small files often
Lessons Learned

● More...
● Partitioning
○ Use partitions wisely
○ Too many partitions = slower queries
● Metadata management
○ Improvement is needed
○ Can’t remove a partition with query
● Don’t use Google Gson for JSON
○ Extremely slow
○ Use Boon Project instead
Lessons Learned

● Future Plans
● Vert.x
○ Lightweight, Non-blocking IO
● Apache Kafka
○ Enables streaming data
● Spark Streaming
○ Real-time data processing
● Spark Data Frames
○ No need for other tools (Sqoop, Pig, etc.)

● More...
● Cloudera Kudu
○ New Storage for Fast Analytics on Fast Data
■ https://github.com/cloudera/kudu
● Project Tungsten
○ Bringing Spark Closer to Bare Metal
■ http://bit.ly/1KPpFBC
● Impala Roadmap
○ Nested Types
○ Performance Improvements

TRHUG 2015 - Veloxity Big Data Migration Use Case

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (14)

Similaire à TRHUG 2015 - Veloxity Big Data Migration Use Case

Similaire à TRHUG 2015 - Veloxity Big Data Migration Use Case (20)

Dernier

Dernier (20)

TRHUG 2015 - Veloxity Big Data Migration Use Case