SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
TRHUG 2015
Veloxity Migration Use Case
v1.2
About Me
● Hakan Ilter
○ GittiGidiyor / eBay
■ Software Platform & Research Manager
■ Java, Spring, Microservices
○ devveri.com
■ Big Data Consultant and Blogger
○ Search, Big Data, NoSQL
About Veloxity
● Veloxity
○ Wireless Telecom Company
■ Based in Sunnyvale, California
○ Founded in 2013
■ by two Turkish entrepreneurs
○ CXM solutions
■ Mobile consumer experience management
○ Powerful SDK
○ “Actionable” Analytics
● Rapidly Growing
○ Now
■ 75K Devices
■ 30 GB / day
○ Short-term
■ 750K devices
■ 300 GB / day
○ Mid-term
■ 7M devices
■ 3 TB / day
About Data
● Legacy System
○ RDBMS-Centric Architecture
■ .NET Codebase
■ MSSQL Server
○ Stored Procedures
■ Hundreds of SPs
■ Thousands of lines of code
○ Works fine (for a while)
Before Migration
● Legacy System Problems
○ RDBMS-Centric Architecture
■ .NET doesn’t fit
■ Can’t scale MSSQL Server
○ Stored Procedures
■ Hard to develop/maintain
■ Stored Procedure Hell!
○ Looking for another solution
Before Migration
● Hadoop
○ MapReduce
■ Can process large amounts of data
○ Hive
■ SQL over unstructured data
○ Impala
■ Massive parallel processing SQL engine
○ Cloudera CDH 5.x
■ Enterprise-ready Big Data Platform
The answer is Hadoop
● MapReduce + Hive + Impala
○ MapReduce
■ Processes JSON input
■ Creates major tables
■ Parquet columnar format as output
○ Hive
■ Query over raw data
○ Impala
■ Builds aggregation tables
■ Analytics based on these tables
Veloxity Big Data v1
● Spark + Impala
○ Spark
■ Replaces MapReduce
■ Better Developer Productivity
■ Better Performance
■ Rich APIs for Java, Scala, Python
■ In-memory storage
○ Impala
■ Fastest MPP SQL Engine
■ Better than Hive or Spark SQL
Veloxity Big Data v2
Big Data Architecture
Devices
GZipped JSON data
Tomcat Web App
Copy to HDFS
Hadoop Cluster
Build Model
with Spark
Hive Metastore
Build Aggregations
with Impala
MSSQL Server
Analytics App
Reporting User
REST
Impala
Queries
SQL
Queries
Import with Sqoop
Export with Pig
Veloxity Big Data v2
● Other Tools
○ Java
■ Spring Framework, Tomcat App Server
○ Bash Script
■ For task executions, flows, etc.
■ Because of Oozie!
○ Sqoop
■ Great (only) for imports
○ Pig
■ Good for data cleaning and exports
● Data Process & Query Performance
○ Hardware
■ Amazon EC2
■ m3.2xlarge
■ 8 Core, 30 GB Ram, Standard disk
■ 1 Name Node, 3 Data Nodes
○ Software
■ Cloudera CDH 5.3.2
■ Impala 2.1.2
■ Hive 0.13.1
■ Spark 1.2.0
Performance Comparison
● Input Data
○ 4 GB Gzip compressed
○ 12 GB uncompressed
○ 859 files
● Task
○ Process JSON files
○ Validate each record
○ Fix problems
○ Build a model
○ Save as Parquet Format
Data Process Performance
● Results
Data Process Performance
● Input Data
○ 542 MB Snappy compressed
○ 1.6 GB uncompressed
○ 11 M rows
○ 468 Parquet files
● Query
SELECT
deviceId, COUNT(*), AVG(rxSpeed), MAX(rxSpeed), AVG
(txSpeed), MAX(txSpeed), SUM(rxData), SUM(txData)
FROM stats
GROUP BY deviceId
ORDER BY deviceId
LIMIT 100
Query Performance
● Results
Query Performance
● Lessons Learned
● CDH updates are critical
○ Always test first!
○ Use VMs for testing
● Install Spark manually
○ The latest Spark version 1.5.0
○ CDH 5.4.x still comes with Spark 1.3
● The small files problem
○ Merge small files often
Lessons Learned
● More...
● Partitioning
○ Use partitions wisely
○ Too many partitions = slower queries
● Metadata management
○ Improvement is needed
○ Can’t remove a partition with query
● Don’t use Google Gson for JSON
○ Extremely slow
○ Use Boon Project instead
Lessons Learned
Veloxity Big Data v3
● Future Plans
● Vert.x
○ Lightweight, Non-blocking IO
● Apache Kafka
○ Enables streaming data
● Spark Streaming
○ Real-time data processing
● Spark Data Frames
○ No need for other tools (Sqoop, Pig, etc.)
● More...
● Cloudera Kudu
○ New Storage for Fast Analytics on Fast Data
■ https://github.com/cloudera/kudu
● Project Tungsten
○ Bringing Spark Closer to Bare Metal
■ http://bit.ly/1KPpFBC
● Impala Roadmap
○ Nested Types
○ Performance Improvements
Veloxity Big Data v3
Thanks!

Contenu connexe

Tendances

Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
Hafizur Rahman
 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop Presentation
Norberto Leite
 

Tendances (20)

Pptx present
Pptx presentPptx present
Pptx present
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
Leveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL EnvironmentLeveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL Environment
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop Presentation
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDaysConexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoop
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
 

En vedette

Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
Mohammad_Tariq
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
Yahoo Developer Network
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
Yahoo Developer Network
 

En vedette (14)

Hadoop @ devveri.com
Hadoop @ devveri.comHadoop @ devveri.com
Hadoop @ devveri.com
 
Pig ve Hive ile Hadoop üzerinde Veri Analizi
Pig ve Hive ile Hadoop üzerinde Veri AnaliziPig ve Hive ile Hadoop üzerinde Veri Analizi
Pig ve Hive ile Hadoop üzerinde Veri Analizi
 
Büyük veri teknolojilerine giriş v1l
Büyük veri teknolojilerine giriş v1lBüyük veri teknolojilerine giriş v1l
Büyük veri teknolojilerine giriş v1l
 
Pig ve Hive ile Hadoop Üzerinde Veri Analizi v2
Pig ve Hive ile Hadoop Üzerinde Veri Analizi v2Pig ve Hive ile Hadoop Üzerinde Veri Analizi v2
Pig ve Hive ile Hadoop Üzerinde Veri Analizi v2
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
 
Dev Con 2014
Dev Con 2014Dev Con 2014
Dev Con 2014
 
Büyük Veri İşlemleri ve Hadoop
Büyük Veri İşlemleri ve HadoopBüyük Veri İşlemleri ve Hadoop
Büyük Veri İşlemleri ve Hadoop
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
GDG İstanbul Şubat Etkinliği - Sunum
GDG İstanbul Şubat Etkinliği - SunumGDG İstanbul Şubat Etkinliği - Sunum
GDG İstanbul Şubat Etkinliği - Sunum
 

Similaire à TRHUG 2015 - Veloxity Big Data Migration Use Case

Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
Krivoy Rog IT Community
 

Similaire à TRHUG 2015 - Veloxity Big Data Migration Use Case (20)

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_data
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
Piano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processingPiano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processing
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Yipit - AWS Start-Up Customer
Yipit - AWS Start-Up Customer Yipit - AWS Start-Up Customer
Yipit - AWS Start-Up Customer
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
The Internet in Database: A Cassandra Use Case
The Internet in Database: A Cassandra Use CaseThe Internet in Database: A Cassandra Use Case
The Internet in Database: A Cassandra Use Case
 
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 

TRHUG 2015 - Veloxity Big Data Migration Use Case

  • 2. About Me ● Hakan Ilter ○ GittiGidiyor / eBay ■ Software Platform & Research Manager ■ Java, Spring, Microservices ○ devveri.com ■ Big Data Consultant and Blogger ○ Search, Big Data, NoSQL
  • 3. About Veloxity ● Veloxity ○ Wireless Telecom Company ■ Based in Sunnyvale, California ○ Founded in 2013 ■ by two Turkish entrepreneurs ○ CXM solutions ■ Mobile consumer experience management ○ Powerful SDK ○ “Actionable” Analytics
  • 4. ● Rapidly Growing ○ Now ■ 75K Devices ■ 30 GB / day ○ Short-term ■ 750K devices ■ 300 GB / day ○ Mid-term ■ 7M devices ■ 3 TB / day About Data
  • 5. ● Legacy System ○ RDBMS-Centric Architecture ■ .NET Codebase ■ MSSQL Server ○ Stored Procedures ■ Hundreds of SPs ■ Thousands of lines of code ○ Works fine (for a while) Before Migration
  • 6. ● Legacy System Problems ○ RDBMS-Centric Architecture ■ .NET doesn’t fit ■ Can’t scale MSSQL Server ○ Stored Procedures ■ Hard to develop/maintain ■ Stored Procedure Hell! ○ Looking for another solution Before Migration
  • 7. ● Hadoop ○ MapReduce ■ Can process large amounts of data ○ Hive ■ SQL over unstructured data ○ Impala ■ Massive parallel processing SQL engine ○ Cloudera CDH 5.x ■ Enterprise-ready Big Data Platform The answer is Hadoop
  • 8. ● MapReduce + Hive + Impala ○ MapReduce ■ Processes JSON input ■ Creates major tables ■ Parquet columnar format as output ○ Hive ■ Query over raw data ○ Impala ■ Builds aggregation tables ■ Analytics based on these tables Veloxity Big Data v1
  • 9. ● Spark + Impala ○ Spark ■ Replaces MapReduce ■ Better Developer Productivity ■ Better Performance ■ Rich APIs for Java, Scala, Python ■ In-memory storage ○ Impala ■ Fastest MPP SQL Engine ■ Better than Hive or Spark SQL Veloxity Big Data v2
  • 10. Big Data Architecture Devices GZipped JSON data Tomcat Web App Copy to HDFS Hadoop Cluster Build Model with Spark Hive Metastore Build Aggregations with Impala MSSQL Server Analytics App Reporting User REST Impala Queries SQL Queries Import with Sqoop Export with Pig
  • 11. Veloxity Big Data v2 ● Other Tools ○ Java ■ Spring Framework, Tomcat App Server ○ Bash Script ■ For task executions, flows, etc. ■ Because of Oozie! ○ Sqoop ■ Great (only) for imports ○ Pig ■ Good for data cleaning and exports
  • 12. ● Data Process & Query Performance ○ Hardware ■ Amazon EC2 ■ m3.2xlarge ■ 8 Core, 30 GB Ram, Standard disk ■ 1 Name Node, 3 Data Nodes ○ Software ■ Cloudera CDH 5.3.2 ■ Impala 2.1.2 ■ Hive 0.13.1 ■ Spark 1.2.0 Performance Comparison
  • 13. ● Input Data ○ 4 GB Gzip compressed ○ 12 GB uncompressed ○ 859 files ● Task ○ Process JSON files ○ Validate each record ○ Fix problems ○ Build a model ○ Save as Parquet Format Data Process Performance
  • 15. ● Input Data ○ 542 MB Snappy compressed ○ 1.6 GB uncompressed ○ 11 M rows ○ 468 Parquet files ● Query SELECT deviceId, COUNT(*), AVG(rxSpeed), MAX(rxSpeed), AVG (txSpeed), MAX(txSpeed), SUM(rxData), SUM(txData) FROM stats GROUP BY deviceId ORDER BY deviceId LIMIT 100 Query Performance
  • 17. ● Lessons Learned ● CDH updates are critical ○ Always test first! ○ Use VMs for testing ● Install Spark manually ○ The latest Spark version 1.5.0 ○ CDH 5.4.x still comes with Spark 1.3 ● The small files problem ○ Merge small files often Lessons Learned
  • 18. ● More... ● Partitioning ○ Use partitions wisely ○ Too many partitions = slower queries ● Metadata management ○ Improvement is needed ○ Can’t remove a partition with query ● Don’t use Google Gson for JSON ○ Extremely slow ○ Use Boon Project instead Lessons Learned
  • 19. Veloxity Big Data v3 ● Future Plans ● Vert.x ○ Lightweight, Non-blocking IO ● Apache Kafka ○ Enables streaming data ● Spark Streaming ○ Real-time data processing ● Spark Data Frames ○ No need for other tools (Sqoop, Pig, etc.)
  • 20. ● More... ● Cloudera Kudu ○ New Storage for Fast Analytics on Fast Data ■ https://github.com/cloudera/kudu ● Project Tungsten ○ Bringing Spark Closer to Bare Metal ■ http://bit.ly/1KPpFBC ● Impala Roadmap ○ Nested Types ○ Performance Improvements Veloxity Big Data v3