SlideShare une entreprise Scribd logo
1  sur  35
A Scalable Data Transformation Framework
@ Penton using the Hadoop Ecosystem
Raj Nair
Director–Data Platform
Kiru
Pakkirisamy CTO
AGENDA
• About Penton and Serendio Inc
• Data Processing at Penton
• PoC Use Case
• Functional Aspects of the Use Case
• Big Data Architecture, Design and Implementation
• Lessons Learned
• Conclusion
• Questions
About Penton
• Professional information services company
• Provide actionable information to five core markets
Agriculture Transportation Natural
Products
Infrastructure Industrial Design
& Manufacturing
Success Stories
EquipmentWatch.com Govalytics.com
Prices, Specs, Costs, Rental Analytics around Gov’t capital spending
down to county level
SourceESB NextTrend.com
Vertical Directory, electronic parts Identify new product trends in the natural
products industry
About Serendio
Serendio provides Big Data Science
Solutions & Services for
Data-Driven Enterprises.
www.serendio.com
Data Processing at Penton
What got us thinking?
• Business units process data in silos
• Heavy ETL
– Hours to process, in some cases days
• Not even using all the data we want
• Not logging what we needed to
• Can’t scale for future requirements
Data Processing Pipeline New
features
New
Insights
New
Products
Biz ValueAssembly Line
processing
Data Processing Pipeline
The Data Processing Pipeline
Penton examples
• Daily Inventory data, ingested throughout the day
(tens of thousands of parts)
• Auction and survey data gathered daily
• Aviation Fleet data, varying frequency
Ingest, store
Clean, validate
Apply Business Rules
Map
Analyze
Report
Distribute
Slow Extract, Transform and Load = Frustration + missed business SLAs
Won’t scale for future
Various data formats, mostly unstructured
What were our options?
Adopt Hadoop Ecosystem
- M/R: Ideal for Batch Processing
- Flexible for storage
- NoSQL: scale, usability and flexibility
Expand RDBMS options
- Expensive
- Complex
HBASE Oracle
SQL
Server
Drools
POC Use Case
Primary Use Case
• Daily model data – upload and map
– Ingest data, build buckets
– Map data (batch and interactive)
– Build Aggregates (dynamic)
Issue: Mapping time
Functional Aspects
Data Scrubbing
• Standardized names for fields/columns
• Example - Country
– Unites States of America -> USA
– United States -> USA
Data Mapping
• Converting Fields - > Ids
– Manufacturer - Caterpillar -> 25
– Model - Caterpillar/Front Loader -> 300
• Requires the use of lookup tables and partial/fuzzy
matching strings
Data Exporting
• Move scrubbed/mapped data to main RDBMS
Current Design
• Survey data loaded as CSV files
• Data needs to be scrubbed/mapped
• All CSV rows loaded into one table
• Once scrubbed/mapped data is loaded into main tables
• Not all rows are loaded, some may be used in the future
Key Pain Points
• CSV data table continues to grow
• Large size of the table impacts operations on rows in a single
file
• CSV data could grow rapidly in the future
Criteria for New Design
• Ability to store an individual file and manipulate it easily
– No join/relationships across CSV files
• Solution should have good integration with RDBMS
• Could possibly host the complete application in future
• Technology stack should possibly have advanced analytics
capabilities
NoSQL model would allow to quickly retrieve/address
individual file and manipulate it
Big Data Architecture
Solution Architecture
Existing Business
Applications
REST API
CSV and Rule Management Endpoints
HBASE
HADOOP HDFS
CSV
Files
Master database of
Products/ Parts
Current
Oracle
Schema
Push
Updates
Insert
Accepted
Data
RDB-> Data Upload UI
API Calls
MR Jobs
Launch
Survey
RESTDrools
Use HBase
as a store for
CSV files
Data manipulation
APIs exposed
through REST layer
Drools – for
rule based
data
scrubbing
Operations on
individual files in
UI through Hbase
Get/Put
Operations on
all/groups of files
using MR jobs
Hbase Schema Design
• One row per HBase row
• One file per HBase row
– One cell per column qualifier (simple and started the development
with this approach)
– One row per column qualifier (more performant approach)
Hbase Rowkey Design
• Row Key
– Composite
• Created Date (YYYYMMDD)
• User
• FileType
• GUID
• Salting for better region splitting
– One byte
Hbase Column Family Design
• Column Family
– Data separated from Metadata into two or more
column families
– One cf for mapping data (more later)
– One cf for analytics data (used by analytics
coprocessors)
M/R Jobs
• Jobs
– Scrubbing
– Mapping
– Export
• Schedule
– Manually from UI
– On schedule using Oozie
Sqoop Jobs
• One time
– FileDetailExport (current CSV)
– RuleImport (all current rules)
• Periodic
– Lookup Table Data import
• Manufacture
• Model
• State
• Country
• Currency
• Condition
• Participant
Application Integration - REST
• Hide HBase AP/Java APIs from rest of
application
• Language independence for PHP front-end
• REST APIs for
– CSV Management
– Drools Rule Management
Lessons Learned
Performance Benefits
• Mapping
– 20000 csv files, 20 million records
– Time taken – 1/3rd of RDBMS processing
• Metrics
– < 10 secs vs (Oracle Materialized View)
• Upload a file
– < 10 secs
• Delete a file
– < 10 secs
Hbase Tuning
• Heap Size for
– RegionServer
– MapReduce Tasks
• Table Compression
– SNAPPY for Column Family holding csv data
• Table data caching
– IN_MEMORY for lookup tables
Application Design Challenges
• Pagination – implemented using intermediate REST layer and
scan.setStartRow.
• Translating SQL queries
– Used Scan/Filter and Java (especially on coprocessor)
– No secondary indexes - used FuzzyRowFilter
– Maybe something like Phoenix would have helped
• Some issues in mixed mode. Want to move to 0.96.0 for
better/individual column family flushing but needed to 'port'
coprocessors (to protobuf)
Hbase Value Proposition
• Better response in UI for CSV file operations - Operations
within a file (map, import, reject etc) not dependent on the
db size
• Relieve load on RDBMS - no more CSV data tables
• Scale out batch processing performance on the cheap (vs
vertical RDBMS upgrade)
• Redundant store for CSV files
• Versioning to track data cleansing
Roadmap
• Benchmark with 0.96
• Retire Coprocessors in favor of Phoenix (?)
• Lookup Data tables are small. Need to find a better alternative
than HBase
• Design UI for a more Big Data appropriate model
– Search oriented paradigm, than exploratory/ paginative
– Add REST endpoints to support such UI
Wrap-Up
Conclusion
• PoC demonstrated
– value of the Hadoop ecosystem
– Co-existence of Big data technologies with current solutions
– Adoption can significantly improve scale
– New skill requirements
Thank You
Rajesh.Nair@Penton.com
Kiru@Serendio.com

Contenu connexe

Tendances

Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Kim Hammar
 
Rich Data Graphs for MapReduce
Rich Data Graphs for MapReduceRich Data Graphs for MapReduce
Rich Data Graphs for MapReduceScott Cinnamond
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Amazon Web Services
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or WorseEric Sun
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Joydeep Sen Sarma
 
Why Java Professionals Should Learn Hadoop
Why Java Professionals Should Learn HadoopWhy Java Professionals Should Learn Hadoop
Why Java Professionals Should Learn HadoopBigClasses Com
 
MySql to HBase in 5 Steps
MySql to HBase in 5 StepsMySql to HBase in 5 Steps
MySql to HBase in 5 StepsScott Cinnamond
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Joydeep Sen Sarma
 
Hadoop online trainings
Hadoop online trainingsHadoop online trainings
Hadoop online trainingsGeek Trainings
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetupRemus Rusanu
 
SAP BW Powered by SAP HANA
SAP BW Powered by  SAP HANASAP BW Powered by  SAP HANA
SAP BW Powered by SAP HANABigClasses Com
 
Building your data warehouse with Redshift
Building your data warehouse with RedshiftBuilding your data warehouse with Redshift
Building your data warehouse with RedshiftAmazon Web Services
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
NoSQL Database: Classification, Characteristics and Comparison
NoSQL Database: Classification, Characteristics and ComparisonNoSQL Database: Classification, Characteristics and Comparison
NoSQL Database: Classification, Characteristics and ComparisonMayuree Srikulwong
 

Tendances (20)

Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
 
Rich Data Graphs for MapReduce
Rich Data Graphs for MapReduceRich Data Graphs for MapReduce
Rich Data Graphs for MapReduce
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or Worse
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)
 
Why Java Professionals Should Learn Hadoop
Why Java Professionals Should Learn HadoopWhy Java Professionals Should Learn Hadoop
Why Java Professionals Should Learn Hadoop
 
Cutting edge Essbase
Cutting edge EssbaseCutting edge Essbase
Cutting edge Essbase
 
MySql to HBase in 5 Steps
MySql to HBase in 5 StepsMySql to HBase in 5 Steps
MySql to HBase in 5 Steps
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012
 
Hadoop course content
Hadoop course contentHadoop course content
Hadoop course content
 
Spark + HBase
Spark + HBase Spark + HBase
Spark + HBase
 
Hadoop online trainings
Hadoop online trainingsHadoop online trainings
Hadoop online trainings
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
SAP BW Powered by SAP HANA
SAP BW Powered by  SAP HANASAP BW Powered by  SAP HANA
SAP BW Powered by SAP HANA
 
Building your data warehouse with Redshift
Building your data warehouse with RedshiftBuilding your data warehouse with Redshift
Building your data warehouse with Redshift
 
Sap bw on hana ramireddy ppt
Sap bw on hana ramireddy pptSap bw on hana ramireddy ppt
Sap bw on hana ramireddy ppt
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
NoSQL Database: Classification, Characteristics and Comparison
NoSQL Database: Classification, Characteristics and ComparisonNoSQL Database: Classification, Characteristics and Comparison
NoSQL Database: Classification, Characteristics and Comparison
 

En vedette

Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Jboss jbpm and drools 1 introduction to drools architecture
Jboss jbpm and drools   1 introduction to drools architectureJboss jbpm and drools   1 introduction to drools architecture
Jboss jbpm and drools 1 introduction to drools architectureZoran Hristov
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks
 
JBoss Drools and Drools Fusion (CEP): Making Business Rules react to RTE
JBoss Drools and Drools Fusion (CEP): Making Business Rules react to RTEJBoss Drools and Drools Fusion (CEP): Making Business Rules react to RTE
JBoss Drools and Drools Fusion (CEP): Making Business Rules react to RTEtsurdilovic
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat SheetHortonworks
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 

En vedette (8)

Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Jboss jbpm and drools 1 introduction to drools architecture
Jboss jbpm and drools   1 introduction to drools architectureJboss jbpm and drools   1 introduction to drools architecture
Jboss jbpm and drools 1 introduction to drools architecture
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration
 
JBoss Drools and Drools Fusion (CEP): Making Business Rules react to RTE
JBoss Drools and Drools Fusion (CEP): Making Business Rules react to RTEJBoss Drools and Drools Fusion (CEP): Making Business Rules react to RTE
JBoss Drools and Drools Fusion (CEP): Making Business Rules react to RTE
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Business Rules on Hadoop
Business Rules on HadoopBusiness Rules on Hadoop
Business Rules on Hadoop
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Similaire à A Scalable Data Transformation Framework using Hadoop Ecosystem

Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010John Sichi
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysRahul Agarwal
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's includedJames Serra
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataAshnikbiz
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?Mark Rittman
 

Similaire à A Scalable Data Transformation Framework using Hadoop Ecosystem (20)

SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Apache drill
Apache drillApache drill
Apache drill
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion Days
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Dernier (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

A Scalable Data Transformation Framework using Hadoop Ecosystem

  • 1. A Scalable Data Transformation Framework @ Penton using the Hadoop Ecosystem Raj Nair Director–Data Platform Kiru Pakkirisamy CTO
  • 2. AGENDA • About Penton and Serendio Inc • Data Processing at Penton • PoC Use Case • Functional Aspects of the Use Case • Big Data Architecture, Design and Implementation • Lessons Learned • Conclusion • Questions
  • 3. About Penton • Professional information services company • Provide actionable information to five core markets Agriculture Transportation Natural Products Infrastructure Industrial Design & Manufacturing Success Stories EquipmentWatch.com Govalytics.com Prices, Specs, Costs, Rental Analytics around Gov’t capital spending down to county level SourceESB NextTrend.com Vertical Directory, electronic parts Identify new product trends in the natural products industry
  • 4. About Serendio Serendio provides Big Data Science Solutions & Services for Data-Driven Enterprises. www.serendio.com
  • 6. What got us thinking? • Business units process data in silos • Heavy ETL – Hours to process, in some cases days • Not even using all the data we want • Not logging what we needed to • Can’t scale for future requirements
  • 7. Data Processing Pipeline New features New Insights New Products Biz ValueAssembly Line processing Data Processing Pipeline The Data Processing Pipeline
  • 8. Penton examples • Daily Inventory data, ingested throughout the day (tens of thousands of parts) • Auction and survey data gathered daily • Aviation Fleet data, varying frequency Ingest, store Clean, validate Apply Business Rules Map Analyze Report Distribute Slow Extract, Transform and Load = Frustration + missed business SLAs Won’t scale for future Various data formats, mostly unstructured
  • 9. What were our options? Adopt Hadoop Ecosystem - M/R: Ideal for Batch Processing - Flexible for storage - NoSQL: scale, usability and flexibility Expand RDBMS options - Expensive - Complex HBASE Oracle SQL Server Drools
  • 11. Primary Use Case • Daily model data – upload and map – Ingest data, build buckets – Map data (batch and interactive) – Build Aggregates (dynamic) Issue: Mapping time
  • 13. Data Scrubbing • Standardized names for fields/columns • Example - Country – Unites States of America -> USA – United States -> USA
  • 14. Data Mapping • Converting Fields - > Ids – Manufacturer - Caterpillar -> 25 – Model - Caterpillar/Front Loader -> 300 • Requires the use of lookup tables and partial/fuzzy matching strings
  • 15. Data Exporting • Move scrubbed/mapped data to main RDBMS
  • 16. Current Design • Survey data loaded as CSV files • Data needs to be scrubbed/mapped • All CSV rows loaded into one table • Once scrubbed/mapped data is loaded into main tables • Not all rows are loaded, some may be used in the future
  • 17. Key Pain Points • CSV data table continues to grow • Large size of the table impacts operations on rows in a single file • CSV data could grow rapidly in the future
  • 18. Criteria for New Design • Ability to store an individual file and manipulate it easily – No join/relationships across CSV files • Solution should have good integration with RDBMS • Could possibly host the complete application in future • Technology stack should possibly have advanced analytics capabilities NoSQL model would allow to quickly retrieve/address individual file and manipulate it
  • 20. Solution Architecture Existing Business Applications REST API CSV and Rule Management Endpoints HBASE HADOOP HDFS CSV Files Master database of Products/ Parts Current Oracle Schema Push Updates Insert Accepted Data RDB-> Data Upload UI API Calls MR Jobs Launch Survey RESTDrools Use HBase as a store for CSV files Data manipulation APIs exposed through REST layer Drools – for rule based data scrubbing Operations on individual files in UI through Hbase Get/Put Operations on all/groups of files using MR jobs
  • 21. Hbase Schema Design • One row per HBase row • One file per HBase row – One cell per column qualifier (simple and started the development with this approach) – One row per column qualifier (more performant approach)
  • 22. Hbase Rowkey Design • Row Key – Composite • Created Date (YYYYMMDD) • User • FileType • GUID • Salting for better region splitting – One byte
  • 23. Hbase Column Family Design • Column Family – Data separated from Metadata into two or more column families – One cf for mapping data (more later) – One cf for analytics data (used by analytics coprocessors)
  • 24. M/R Jobs • Jobs – Scrubbing – Mapping – Export • Schedule – Manually from UI – On schedule using Oozie
  • 25. Sqoop Jobs • One time – FileDetailExport (current CSV) – RuleImport (all current rules) • Periodic – Lookup Table Data import • Manufacture • Model • State • Country • Currency • Condition • Participant
  • 26. Application Integration - REST • Hide HBase AP/Java APIs from rest of application • Language independence for PHP front-end • REST APIs for – CSV Management – Drools Rule Management
  • 28. Performance Benefits • Mapping – 20000 csv files, 20 million records – Time taken – 1/3rd of RDBMS processing • Metrics – < 10 secs vs (Oracle Materialized View) • Upload a file – < 10 secs • Delete a file – < 10 secs
  • 29. Hbase Tuning • Heap Size for – RegionServer – MapReduce Tasks • Table Compression – SNAPPY for Column Family holding csv data • Table data caching – IN_MEMORY for lookup tables
  • 30. Application Design Challenges • Pagination – implemented using intermediate REST layer and scan.setStartRow. • Translating SQL queries – Used Scan/Filter and Java (especially on coprocessor) – No secondary indexes - used FuzzyRowFilter – Maybe something like Phoenix would have helped • Some issues in mixed mode. Want to move to 0.96.0 for better/individual column family flushing but needed to 'port' coprocessors (to protobuf)
  • 31. Hbase Value Proposition • Better response in UI for CSV file operations - Operations within a file (map, import, reject etc) not dependent on the db size • Relieve load on RDBMS - no more CSV data tables • Scale out batch processing performance on the cheap (vs vertical RDBMS upgrade) • Redundant store for CSV files • Versioning to track data cleansing
  • 32. Roadmap • Benchmark with 0.96 • Retire Coprocessors in favor of Phoenix (?) • Lookup Data tables are small. Need to find a better alternative than HBase • Design UI for a more Big Data appropriate model – Search oriented paradigm, than exploratory/ paginative – Add REST endpoints to support such UI
  • 34. Conclusion • PoC demonstrated – value of the Hadoop ecosystem – Co-existence of Big data technologies with current solutions – Adoption can significantly improve scale – New skill requirements