SlideShare une entreprise Scribd logo
1  sur  35
A Scalable Data Transformation Framework
using the Hadoop Ecosystem
Raj Nair
Director–Data Platform
Kiru
Pakkirisamy CTO
AGENDA
• About Penton and Serendio Inc
• Data Processing at Penton
• PoC Use Case
• Functional Aspects of the Use Case
• Big Data Architecture, Design and Implementation
• Lessons Learned
• Conclusion
• Questions
About Penton
• Professional information services company
• Provide actionable information to five core markets
Agriculture Transportation Natural
Products
Infrastructure Industrial Design
& Manufacturing
Success Stories
EquipmentWatch.com Govalytics.com
Prices, Specs, Costs, Rental Analytics around Gov’t capital spending
down to county level
SourceESB NextTrend.com
Vertical Directory, electronic parts Identify new product trends in the natural
products industry
About Serendio
Serendio provides Big Data Science
Solutions & Services for
Data-Driven Enterprises.
www.serendio.com
Data Processing at Penton
What got us thinking?
• Business units process data in silos
• Heavy ETL
– Hours to process, in some cases days
• Not even using all the data we want
• Not logging what we needed to
• Can’t scale for future requirements
data data
Data Processing Pipeline New
features
New
Insights
New
Products
Biz Value
Shoehorning
Assembly Line
processing
Staging
Perftuning
Data Processing Pipeline
The Data Processing Pipeline
Penton examples
• Daily Inventory data, ingested throughout the day
(tens of thousands of parts)
• Auction and survey data gathered daily
• Aviation Fleet data, varying frequency
Ingest, store
Clean, validate
Apply Business Rules
Map
Analyze
Report
Distribute
Slow Extract, Transform and Load = Frustration + missed business SLAs
Won’t scale for future
Various data formats, mostly unstructured
Current Design
• Survey data loaded as CSV files
• Data needs to be scrubbed/mapped
• All CSV rows loaded into one table
• Once scrubbed/mapped data is loaded into main tables
• Not all rows are loaded, some may be used in the future
What were our options?
Adopt Hadoop Ecosystem
- M/R: Ideal for Batch Processing
- Flexible for storage
- NoSQL: scale, usability and flexibility
Expand RDBMS options
- Expensive
- Complex
Unstructured data
Business Rules
HBASE Oracle
SQL
Server
Drools
POC Use Case
Primary Use Case
• Daily model data – upload and map
– Ingest data, build buckets
– Map data (batch and interactive)
– Build Aggregates (dynamic)
Issue: Mapping time
Functional Aspects
Data Scrubbing
• Standardized names for fields/columns
• Example - Country
– Unites States of America -> USA
– United States -> USA
Data Mapping
• Converting Fields - > Ids
– Manufacturer - Caterpillar -> 25
– Model - Caterpillar/Front Loader -> 300
• Requires the use of lookup tables and partial/fuzzy
matching strings
Data Exporting
• Move scrubbed/mapped data to main RDBMS
Key Pain Points
• CSV data table continues to grow
• Large size of the table impacts operations on rows in a single
file
• CSV data could grow rapidly in the future
Criteria for New Design
• Ability to store an individual file and manipulate it easily
– No join/relationships across CSV files
• Solution should have good integration with RDBMS
• Could possibly host the complete application in future
• Technology stack should possibly have advanced analytics
capabilities
NoSQL model would allow to quickly retrieve/address
individual file and manipulate it
Big Data Architecture
Solution Architecture
Existing Business
Applications
REST APIREST API
CSV and Rule Management EndpointsCSV and Rule Management Endpoints
HBASEHBASE
HADOOP HDFSHADOOP HDFS
CSV
Files
Master database of
Products/ Parts
Current
Oracle
Schema
Push
Updates
Insert
Accepted
Data
RDB-> Data Upload UIRDB-> Data Upload UI
API Calls
MR JobsMR Jobs
Launch
Survey
REST
Survey
RESTDroolsDrools
Use
HBase as
a store for
CSV files
Data
manipulation
APIs exposed
through REST
layer
Drools –
for rule
based
data
scrubbing
Operations on
individual files
in UI through
Hbase Get/Put
Operations on
all/groups of
files using MR
jobs
Hbase Schema Design
• One row per HBase row
• One file per HBase row
– One cell per column qualifier (simple and started the development
with this approach)
– One row per column qualifier (more performant approach)
Hbase Rowkey Design
• Row Key
– Composite
• Created Date (YYYYMMDD)
• User
• FileType
• GUID
• Salting for better region splitting
– One byte
Hbase Column Family Design
• Column Family
– Data separated from Metadata into two or more
column families
– One cf for mapping data (more later)
– One cf for analytics data (used by analytics
coprocessors)
M/R Jobs
• Jobs
– Scrubbing
– Mapping
– Export
• Schedule
– Manually from UI
– On schedule using Oozie
Sqoop Jobs
• One time
– FileDetailExport (current CSV)
– RuleImport (all current rules)
• Periodic
– Lookup Table Data import
• Manufacture
• Model
• State
• Country
• Currency
• Condition
• Participant
Application Integration - REST
• Hide HBase AP/Java APIs from rest of
application
• Language independence for PHP front-end
• REST APIs for
– CSV Management
– Drools Rule Management
Lessons Learned
Performance Benefits
• Mapping
– 20000 csv files, 20 million records
– Time taken – 1/3rd of RDBMS processing
• Metrics
– < 10 secs vs (Oracle Materialized View)
• Upload a file
– < 10 secs
• Delete a file
– < 10 secs
Hbase Tuning
• Heap Size for
– RegionServer
– MapReduce Tasks
• Table Compression
– SNAPPY for Column Family holding csv data
• Table data caching
– IN_MEMORY for lookup tables
Application Design Challenges
• Pagination – implemented using intermediate REST layer and
scan.setStartRow.
• Translating SQL queries
– Used Scan/Filter and Java (especially on coprocessor)
– No secondary indexes - used FuzzyRowFilter
– Maybe something like Phoenix would have helped
• Some issues in mixed mode. Want to move to 0.96.0 for
better/individual column family flushing but needed to 'port'
coprocessors (to protobuf)
Hbase Value Proposition
• Better response in UI for CSV file operations - Operations
within a file (map, import, reject etc) not dependent on the
db size
• Relieve load on RDBMS - no more CSV data tables
• Scale out batch processing performance on the cheap (vs
vertical RDBMS upgrade)
• Redundant store for CSV files
• Versioning to track data cleansing
Roadmap
• Benchmark with 0.96
• Retire Coprocessors in favor of Phoenix (?)
• Lookup Data tables are small. Need to find a better
alternative than HBase
• Design UI for a more Big Data appropriate model
– Search oriented paradigm, than exploratory/ paginative
– Add REST endpoints to support such UI
Wrap-Up
Conclusion
• PoC demonstrated
– value of the Hadoop ecosystem
– Co-existence of Big data technologies with current solutions
– Adoption can significantly improve scale
– New skill requirements
Thank You
Rajesh.Nair@Penton.com
Kiru@Serendio.com

Contenu connexe

Tendances

Big Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQLBig Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQL
Abhijit Sharma
 

Tendances (20)

Hadoop online trainings
Hadoop online trainingsHadoop online trainings
Hadoop online trainings
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Building your data warehouse with Redshift
Building your data warehouse with RedshiftBuilding your data warehouse with Redshift
Building your data warehouse with Redshift
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; python
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SF
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
Geospatial Big Data - Foss4gNA
Geospatial Big Data - Foss4gNAGeospatial Big Data - Foss4gNA
Geospatial Big Data - Foss4gNA
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Cost Efficiency Strategies for Managed Apache Spark Service
Cost Efficiency Strategies for Managed Apache Spark ServiceCost Efficiency Strategies for Managed Apache Spark Service
Cost Efficiency Strategies for Managed Apache Spark Service
 
NoSQL Database: Classification, Characteristics and Comparison
NoSQL Database: Classification, Characteristics and ComparisonNoSQL Database: Classification, Characteristics and Comparison
NoSQL Database: Classification, Characteristics and Comparison
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
 
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
Big Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQLBig Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQL
 

En vedette

Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Cloudera, Inc.
 

En vedette (6)

Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
 
Drools Workshop @JBCNCONF 2016
Drools Workshop @JBCNCONF 2016Drools Workshop @JBCNCONF 2016
Drools Workshop @JBCNCONF 2016
 
Rule Engine & Drools
Rule Engine & DroolsRule Engine & Drools
Rule Engine & Drools
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 
Business Rules on Hadoop
Business Rules on HadoopBusiness Rules on Hadoop
Business Rules on Hadoop
 

Similaire à A Scalable Data Transformation Framework using the Hadoop Ecosystem

Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
Farzad Nozarian
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 

Similaire à A Scalable Data Transformation Framework using the Hadoop Ecosystem (20)

SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Apache drill
Apache drillApache drill
Apache drill
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
SQL In/On/Around Hadoop
SQL In/On/Around Hadoop SQL In/On/Around Hadoop
SQL In/On/Around Hadoop
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapR
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 

Plus de Serendio Inc.

Plus de Serendio Inc. (10)

Hands on Training – Graph Database with Neo4j
Hands on Training – Graph Database with Neo4jHands on Training – Graph Database with Neo4j
Hands on Training – Graph Database with Neo4j
 
Next Generation Analytics Architecture for Business Advantage
Next Generation Analytics Architecture for Business AdvantageNext Generation Analytics Architecture for Business Advantage
Next Generation Analytics Architecture for Business Advantage
 
Amazon kindle fire social report - Sep 29
Amazon kindle fire social report - Sep 29Amazon kindle fire social report - Sep 29
Amazon kindle fire social report - Sep 29
 
Serendio academy awards-feb27-2011- 730pm edt
Serendio academy awards-feb27-2011- 730pm edtSerendio academy awards-feb27-2011- 730pm edt
Serendio academy awards-feb27-2011- 730pm edt
 
Serendio academy awards-feb27-2011- 11am edt
Serendio academy awards-feb27-2011- 11am edtSerendio academy awards-feb27-2011- 11am edt
Serendio academy awards-feb27-2011- 11am edt
 
Serendio academy awards predictions derived from social media as of -10 am ed...
Serendio academy awards predictions derived from social media as of -10 am ed...Serendio academy awards predictions derived from social media as of -10 am ed...
Serendio academy awards predictions derived from social media as of -10 am ed...
 
Social Nuggets Academy awards Predictions derived from social media as of feb...
Social Nuggets Academy awards Predictions derived from social media as of feb...Social Nuggets Academy awards Predictions derived from social media as of feb...
Social Nuggets Academy awards Predictions derived from social media as of feb...
 
Serendio academy awards-feb25-2011-
Serendio academy awards-feb25-2011-Serendio academy awards-feb25-2011-
Serendio academy awards-feb25-2011-
 
Ontology in Social Media Analysis
Ontology in Social Media AnalysisOntology in Social Media Analysis
Ontology in Social Media Analysis
 
Digital research renaissance - Social Media Analysis
Digital research renaissance - Social Media AnalysisDigital research renaissance - Social Media Analysis
Digital research renaissance - Social Media Analysis
 

Dernier

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 

Dernier (20)

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 

A Scalable Data Transformation Framework using the Hadoop Ecosystem

  • 1. A Scalable Data Transformation Framework using the Hadoop Ecosystem Raj Nair Director–Data Platform Kiru Pakkirisamy CTO
  • 2. AGENDA • About Penton and Serendio Inc • Data Processing at Penton • PoC Use Case • Functional Aspects of the Use Case • Big Data Architecture, Design and Implementation • Lessons Learned • Conclusion • Questions
  • 3. About Penton • Professional information services company • Provide actionable information to five core markets Agriculture Transportation Natural Products Infrastructure Industrial Design & Manufacturing Success Stories EquipmentWatch.com Govalytics.com Prices, Specs, Costs, Rental Analytics around Gov’t capital spending down to county level SourceESB NextTrend.com Vertical Directory, electronic parts Identify new product trends in the natural products industry
  • 4. About Serendio Serendio provides Big Data Science Solutions & Services for Data-Driven Enterprises. www.serendio.com
  • 6. What got us thinking? • Business units process data in silos • Heavy ETL – Hours to process, in some cases days • Not even using all the data we want • Not logging what we needed to • Can’t scale for future requirements data data
  • 7. Data Processing Pipeline New features New Insights New Products Biz Value Shoehorning Assembly Line processing Staging Perftuning Data Processing Pipeline The Data Processing Pipeline
  • 8. Penton examples • Daily Inventory data, ingested throughout the day (tens of thousands of parts) • Auction and survey data gathered daily • Aviation Fleet data, varying frequency Ingest, store Clean, validate Apply Business Rules Map Analyze Report Distribute Slow Extract, Transform and Load = Frustration + missed business SLAs Won’t scale for future Various data formats, mostly unstructured
  • 9. Current Design • Survey data loaded as CSV files • Data needs to be scrubbed/mapped • All CSV rows loaded into one table • Once scrubbed/mapped data is loaded into main tables • Not all rows are loaded, some may be used in the future
  • 10. What were our options? Adopt Hadoop Ecosystem - M/R: Ideal for Batch Processing - Flexible for storage - NoSQL: scale, usability and flexibility Expand RDBMS options - Expensive - Complex Unstructured data Business Rules HBASE Oracle SQL Server Drools
  • 12. Primary Use Case • Daily model data – upload and map – Ingest data, build buckets – Map data (batch and interactive) – Build Aggregates (dynamic) Issue: Mapping time
  • 14. Data Scrubbing • Standardized names for fields/columns • Example - Country – Unites States of America -> USA – United States -> USA
  • 15. Data Mapping • Converting Fields - > Ids – Manufacturer - Caterpillar -> 25 – Model - Caterpillar/Front Loader -> 300 • Requires the use of lookup tables and partial/fuzzy matching strings
  • 16. Data Exporting • Move scrubbed/mapped data to main RDBMS
  • 17. Key Pain Points • CSV data table continues to grow • Large size of the table impacts operations on rows in a single file • CSV data could grow rapidly in the future
  • 18. Criteria for New Design • Ability to store an individual file and manipulate it easily – No join/relationships across CSV files • Solution should have good integration with RDBMS • Could possibly host the complete application in future • Technology stack should possibly have advanced analytics capabilities NoSQL model would allow to quickly retrieve/address individual file and manipulate it
  • 20. Solution Architecture Existing Business Applications REST APIREST API CSV and Rule Management EndpointsCSV and Rule Management Endpoints HBASEHBASE HADOOP HDFSHADOOP HDFS CSV Files Master database of Products/ Parts Current Oracle Schema Push Updates Insert Accepted Data RDB-> Data Upload UIRDB-> Data Upload UI API Calls MR JobsMR Jobs Launch Survey REST Survey RESTDroolsDrools Use HBase as a store for CSV files Data manipulation APIs exposed through REST layer Drools – for rule based data scrubbing Operations on individual files in UI through Hbase Get/Put Operations on all/groups of files using MR jobs
  • 21. Hbase Schema Design • One row per HBase row • One file per HBase row – One cell per column qualifier (simple and started the development with this approach) – One row per column qualifier (more performant approach)
  • 22. Hbase Rowkey Design • Row Key – Composite • Created Date (YYYYMMDD) • User • FileType • GUID • Salting for better region splitting – One byte
  • 23. Hbase Column Family Design • Column Family – Data separated from Metadata into two or more column families – One cf for mapping data (more later) – One cf for analytics data (used by analytics coprocessors)
  • 24. M/R Jobs • Jobs – Scrubbing – Mapping – Export • Schedule – Manually from UI – On schedule using Oozie
  • 25. Sqoop Jobs • One time – FileDetailExport (current CSV) – RuleImport (all current rules) • Periodic – Lookup Table Data import • Manufacture • Model • State • Country • Currency • Condition • Participant
  • 26. Application Integration - REST • Hide HBase AP/Java APIs from rest of application • Language independence for PHP front-end • REST APIs for – CSV Management – Drools Rule Management
  • 28. Performance Benefits • Mapping – 20000 csv files, 20 million records – Time taken – 1/3rd of RDBMS processing • Metrics – < 10 secs vs (Oracle Materialized View) • Upload a file – < 10 secs • Delete a file – < 10 secs
  • 29. Hbase Tuning • Heap Size for – RegionServer – MapReduce Tasks • Table Compression – SNAPPY for Column Family holding csv data • Table data caching – IN_MEMORY for lookup tables
  • 30. Application Design Challenges • Pagination – implemented using intermediate REST layer and scan.setStartRow. • Translating SQL queries – Used Scan/Filter and Java (especially on coprocessor) – No secondary indexes - used FuzzyRowFilter – Maybe something like Phoenix would have helped • Some issues in mixed mode. Want to move to 0.96.0 for better/individual column family flushing but needed to 'port' coprocessors (to protobuf)
  • 31. Hbase Value Proposition • Better response in UI for CSV file operations - Operations within a file (map, import, reject etc) not dependent on the db size • Relieve load on RDBMS - no more CSV data tables • Scale out batch processing performance on the cheap (vs vertical RDBMS upgrade) • Redundant store for CSV files • Versioning to track data cleansing
  • 32. Roadmap • Benchmark with 0.96 • Retire Coprocessors in favor of Phoenix (?) • Lookup Data tables are small. Need to find a better alternative than HBase • Design UI for a more Big Data appropriate model – Search oriented paradigm, than exploratory/ paginative – Add REST endpoints to support such UI
  • 34. Conclusion • PoC demonstrated – value of the Hadoop ecosystem – Co-existence of Big data technologies with current solutions – Adoption can significantly improve scale – New skill requirements