SlideShare une entreprise Scribd logo
1  sur  35
Building scalable,
flexible data pipelines
for Big Data	

Vivek Aanand Ganesan	

vivganes@gmail.com	


1
Agenda
	


§  Introduction
§  Dealing with Legacy
§  Data Lineage and Provenance
§  Data Lifecycle Management
§  Data Pipeline Engineering for fun and profit

2
Big Data Introduction
Current state of Big Data Landscape	


§  Hadoop
§  Solves for the three V’s:Volume,Velocity, and Variety
§  Primarily batch processing for large data sets
§  Hadoop2 YARN: distributed computing platform
§  Not only Hadoop!
§  Real-time systems: Storm, Spark, Samza etc.
§  Wide variety of NoSQL systems: Cassandra, Riak, etc.
§  Don’t forget Legacy!

3
Big Data Promise
Why is Big Data so hot?	


§  This is what the Big Data vendors sell:
§  Throw some data in
§  Analyze it using map/reduce
§  Visualize your analytics/Generate insights
§  Do some Predictive Analytics or Recommendations
§  Profit!!
§  Rinse and Repeat!!!
4
Big Data Problems
Why is Big Data so hard?	


§  Real-life environments are not that simple!
§  For instance, privacy and compliance issues
§  Extract, Transform and Load is non-trivial
§  Building reliable ingest across complex
environments
§  Data Lifecycle Management is not mature yet
5
Legacy Data
Why are Legacy environments important for Big Data?	


§  Outside of Silicon Valley:
§  Companies have been around for a while
§  Have lots of valuable legacy data
§  Some of it in Mainframes
§  Some of it in flat files
§  Some of it in relational DBs
6
Mainframe Data
How would you handle Mainframe data?	


§  The open source Hadoop eco system does
not provide a way to import data from main
frames
§  Only a commercial solution available as of
today
§  Think about that for a second
7
More on Mainframes
Why worry about Mainframe data?	


§  Mainframes still run important systems
§  Separates schema from data (kinda like
Hadoop)
§  COBOL Copybooks
§  Hadoop can offload legacy data processing
§  But, you must first get the data in!!!
8
Other Legacy issues
Random collection of issues in dealing with legacy data	


§  Unknown or incorrigible schema
§  Invalid data
§  Inconsistent data
§  Missing data
§  Fuzzy data
§  Sparse data
9
Big Data ETL
What is the problem with it?	


§  First of all the name
§  Extract, Transform, Load was written in the
old days when data sets were smaller
§  Inherent assumption that the Transformation
will happen out of band
§  Assumption does not hold for Big Data!
10
ELT
Will ELT solve the problem?	


§  Flip the transform and load steps
§  Get the data in and then transform it
§  This way the transform is not out of band
§  Leverage the power of the underlying Big Data
platform to do the transform
§  Makes perfect sense … except when
11
Privacy and Security
Issues with ELT approach for privacy and security	


§  Loading raw data before transforming it poses
privacy and security challenges
§  What if the raw data contains SSNs or Credit
card numbers?
§  What if it is only meant to be seen by a few?
§  Once you load, the data is now available
12
The solution
Deal with it during extraction (as best as you can)	


§  Do a secure extract
§  Perform a security/privacy audit of the raw
data and build in rules to mask/anonymize/
scrub data during the extraction
§  Somewhat solves the security problem but
complicates the Extract step
13
Some exceptions
What if you don’t know which parts of the data set need to protected?	


§  Secure extract assumes that the data schema
is known and the privacy levels are known
§  Not a valid assumption at all times
§  For e.g., what if the legacy data set has
Facebook profile data before the new privacy
rules went in to effect?
14
Data Lineage and Provenance
What is data lineage and data provenance?	


Data Lineage	

Data Lineage records the origin of the data set.
This includes the time, place, original format
and privacy/security information.	

	

Data Provenance	

Records all the change history to the data set.
This includes timestamp, change agent,
purpose, process and edit log.	

15
Data Lineage
Why is this a big deal?	


§  Let’s go back to the Facebook problem
§  The solution is to record lineage information
§  This protects the consumer of the data set –
assures that the data was available for use as
of the point and time of origin
§  Protect yourself from law suites and fines!
16
Metadata
Data about the data	


§  Astute observation: Metadata extraction is an
integral part of managing data and
implementing data lineage and data
provenance
§  It can be rule-based but increasingly more
automated systems are desirable
17
Data Provenance
Why is it important?	


§  Data Lineage solves one piece of the puzzle –
namely, origin and metadata
§  What if data is changed during or after the
extract step?
§  For purposes of audit and traceability, this
must be recorded!
18
Data Provenance Approach
How to implement data provenance?	


§  Can be workflow-based or dataflow-based
§  Workflow-based is much easier
§  Records the changes as part of the
workflow
§  Dataflow-based is much harder
§  Needs to record each and every access to
19

the data
Current Toolset
What exists currently in the open source big data ecosystem?	


§  Nothing really to help with any of this
§  There are commercial products
§  But, no open source tools yet (or at least
none that are in production use that I am
aware of)
§  Would be a great idea to build one!
20
Data Lifecycle Management
Dealing with data throughout its lifecycle	


§  Management of data from ingest to sunset
§  It involves dealing with all of the associated
metadata, lineage and provenance artifacts
§  It also involves moving data around (large
datasets in the Big Data world)
§  That is a data pipeline problem!
21
Data Lifecycle Management Tools
What exists in the Big Data eco system to handle this?	


§  Current toolset is pretty limited
§  Apache Falcon (Hadoop sub-project) is a step
in this direction but still not widely available
for production use
§  It is possible to roll your own
§  But, it is a significant engineering effort
22
Modern Data Lifecycle Management
Modern data architecture needs modern data lifecycle management	


§  Modern data architecture involves more than
just Hadoop
§  Queuing systems – for e.g. Kafka
§  Stream processing – for e.g. Storm
§  Real-time systems – for e.g. Spark
§  NoSQL system – for e.g. HBase
§  Integration with MPP systems

23
Data Lifecycle Management done right
Data Lifecycle Management across the Big Data Environment	


§  Dealing with the various systems in the Big
Data Landscape
§  Ability to setup schedules and periodic runs
§  Also, provide on-demand data processing
§  Treat data as an asset – apply asset
management practices
24
Data Pipelining for fun and profit
Dealing with data pipelines as a distinct role in the Big Data Engineering world	


§  Data Pipeline Engineering is a legitimate role
in the Big Data environment
§  The complexity and all of the attendant issues
makes it a specialty in its own right!
§  It is much more than just ETL
§  Security, Lineage, Provenance and Lifecycle
25

Management are all essential
So you want to be a data pipeline engineer?
What are the tools of the trade and ninja skill to master?	


§  Languages

§  Systems

§  Python
§  Java

§  Sqoop

§  Scala

§  Storm

§  Pig

26

§  Flume

§  Hive/HBase
Can you make this easier?
I just want to write some code and be done with it	


§  Pick your language

§  Cascading

§  Java

§  Data Pipeline

§  Scala

framework

§  Clojure

§  Full-featured

§  But, wait!
§  What about all the other stuff?
27
Integration and Extensions
Integrate with your favorite tools and extend when needed	


§  Start with a solid pipeline framework like
Cascading (or its offshoots like Scalding or
Cascalog)
§  Integrate with either commercial or open
source tools for specific functionality needed
§  Look at Cascading extensions: Lingual,
Driven and Load
28
Build your own extensions
Extend Cascading with your own requirements	


§  A programming framework such as
Cascading makes it much easier to extend
to build custom data lineage, provenance and
lifecycle management solutions
§  You can also integrate with Security and
Privacy solutions
§  This is a flexible approach
29
Build for scale 
Understanding scale for data pipelines	


§  Scaling data pipelines is quite complex due
to the multiple moving pieces
§  Pipeline is only as fast as the slowest piece
§  Hadoop scales - proven
§  Flume scales - proven
§  Kafka scales - proven
30
Scaling Sqoop
Scaling relation DB load	


§  What about Sqoop?
§  Not as easy or straight forward to scale
§  Start slow and incrementally increase load
§  Watch for network statistics and optimize
§  Load aggregates if that is all you need
§  Parallelize as much without killing the DB
31
Scaling Storm
	


§  Lot of these systems depend on ZK
§  Storm also relies on Zero MQ (this is
changing)
§  Provision for average load (not peak load)
§  Benchmark with typical event size
(compress for larger events)
§  Storm on YARN will solve many issues
32
Scaling Tips
	


§  Measure end-to-end throughput
§  Benchmark and fine-tune the best
performing parts of the pipeline first
§  Scale the slower parts next – increase
incrementally
§  Batch the slower parts – aggregate if you
can and parallelize as much as possible
33
Summary	


§  Pipeline Engineering will be one of the most
challenging areas in Big Data with several big
issues remaining to be solved
§  Expect plenty of innovation and action in
this space
§  It is a great place to start a Big Data career
34
Thank You

Questions? Comments?	


Thank You!	

	

Please contact vivganes@gmail.com
with your questions and/or comments.	


35

Contenu connexe

Tendances

Big Data – A New Testing Challenge
Big Data – A New Testing ChallengeBig Data – A New Testing Challenge
Big Data – A New Testing ChallengeTEST Huddle
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesRob Winters
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoIasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoCodecamp Romania
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationDatabricks
 
Delta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache SparkDelta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache SparkGeorge Chow
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015DataKitchen
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Big Data Spain
 
Harnessing the Power of Apache Hadoop
Harnessing the Power of Apache Hadoop Harnessing the Power of Apache Hadoop
Harnessing the Power of Apache Hadoop Cloudera, Inc.
 
Tableau @ Spil Games
Tableau @ Spil GamesTableau @ Spil Games
Tableau @ Spil GamesRob Winters
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInHakka Labs
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpJoseph Arriola
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientistsdatamantra
 
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Big Data Spain
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseDatabricks
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019DataKitchen
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board
 

Tendances (20)

Big Data – A New Testing Challenge
Big Data – A New Testing ChallengeBig Data – A New Testing Challenge
Big Data – A New Testing Challenge
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoIasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
 
Delta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache SparkDelta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache Spark
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
 
Harnessing the Power of Apache Hadoop
Harnessing the Power of Apache Hadoop Harnessing the Power of Apache Hadoop
Harnessing the Power of Apache Hadoop
 
Tableau @ Spil Games
Tableau @ Spil GamesTableau @ Spil Games
Tableau @ Spil Games
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcp
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent Enterprise
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
 

En vedette

Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 
Collaborative filtering getting_started
Collaborative filtering getting_startedCollaborative filtering getting_started
Collaborative filtering getting_startedVivek Aanand Ganesan
 
Recommendation Engines Program Kickoff
Recommendation Engines Program KickoffRecommendation Engines Program Kickoff
Recommendation Engines Program KickoffVivek Aanand Ganesan
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Building Scalable Big Data Pipelines
Building Scalable Big Data PipelinesBuilding Scalable Big Data Pipelines
Building Scalable Big Data PipelinesChristian Gügi
 

En vedette (8)

Mongodb hackathon 02
Mongodb hackathon 02Mongodb hackathon 02
Mongodb hackathon 02
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Collaborative filtering getting_started
Collaborative filtering getting_startedCollaborative filtering getting_started
Collaborative filtering getting_started
 
Recommendation Engines Program Kickoff
Recommendation Engines Program KickoffRecommendation Engines Program Kickoff
Recommendation Engines Program Kickoff
 
Mongodb hackathon 01
Mongodb hackathon 01Mongodb hackathon 01
Mongodb hackathon 01
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Building Scalable Big Data Pipelines
Building Scalable Big Data PipelinesBuilding Scalable Big Data Pipelines
Building Scalable Big Data Pipelines
 

Similaire à Building Flexible Data Pipelines for Big Data

ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
Using Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeUsing Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeDATAVERSITY
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduCloudera, Inc.
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduCloudera, Inc.
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Kent Graziano
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Precisely
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxPankajkumar496281
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database RoundtableEric Kavanagh
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InSnapLogic
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systemselliando dias
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeeling Cheung
 
Data Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data LakeData Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data LakeDenodo
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data ScienceNiko Vuokko
 
Cassandra Essentials Day Cambridge
Cassandra Essentials Day CambridgeCassandra Essentials Day Cambridge
Cassandra Essentials Day CambridgeMarc Fielding
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtGenoveva Vargas-Solar
 

Similaire à Building Flexible Data Pipelines for Big Data (20)

Big Data Boom
Big Data BoomBig Data Boom
Big Data Boom
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Using Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeUsing Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-Purpose
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptx
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
 
Data Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data LakeData Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data Lake
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data Science
 
Cassandra Essentials Day Cambridge
Cassandra Essentials Day CambridgeCassandra Essentials Day Cambridge
Cassandra Essentials Day Cambridge
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbt
 

Dernier

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Dernier (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Building Flexible Data Pipelines for Big Data

  • 1. Building scalable, flexible data pipelines for Big Data Vivek Aanand Ganesan vivganes@gmail.com 1
  • 2. Agenda §  Introduction §  Dealing with Legacy §  Data Lineage and Provenance §  Data Lifecycle Management §  Data Pipeline Engineering for fun and profit 2
  • 3. Big Data Introduction Current state of Big Data Landscape §  Hadoop §  Solves for the three V’s:Volume,Velocity, and Variety §  Primarily batch processing for large data sets §  Hadoop2 YARN: distributed computing platform §  Not only Hadoop! §  Real-time systems: Storm, Spark, Samza etc. §  Wide variety of NoSQL systems: Cassandra, Riak, etc. §  Don’t forget Legacy! 3
  • 4. Big Data Promise Why is Big Data so hot? §  This is what the Big Data vendors sell: §  Throw some data in §  Analyze it using map/reduce §  Visualize your analytics/Generate insights §  Do some Predictive Analytics or Recommendations §  Profit!! §  Rinse and Repeat!!! 4
  • 5. Big Data Problems Why is Big Data so hard? §  Real-life environments are not that simple! §  For instance, privacy and compliance issues §  Extract, Transform and Load is non-trivial §  Building reliable ingest across complex environments §  Data Lifecycle Management is not mature yet 5
  • 6. Legacy Data Why are Legacy environments important for Big Data? §  Outside of Silicon Valley: §  Companies have been around for a while §  Have lots of valuable legacy data §  Some of it in Mainframes §  Some of it in flat files §  Some of it in relational DBs 6
  • 7. Mainframe Data How would you handle Mainframe data? §  The open source Hadoop eco system does not provide a way to import data from main frames §  Only a commercial solution available as of today §  Think about that for a second 7
  • 8. More on Mainframes Why worry about Mainframe data? §  Mainframes still run important systems §  Separates schema from data (kinda like Hadoop) §  COBOL Copybooks §  Hadoop can offload legacy data processing §  But, you must first get the data in!!! 8
  • 9. Other Legacy issues Random collection of issues in dealing with legacy data §  Unknown or incorrigible schema §  Invalid data §  Inconsistent data §  Missing data §  Fuzzy data §  Sparse data 9
  • 10. Big Data ETL What is the problem with it? §  First of all the name §  Extract, Transform, Load was written in the old days when data sets were smaller §  Inherent assumption that the Transformation will happen out of band §  Assumption does not hold for Big Data! 10
  • 11. ELT Will ELT solve the problem? §  Flip the transform and load steps §  Get the data in and then transform it §  This way the transform is not out of band §  Leverage the power of the underlying Big Data platform to do the transform §  Makes perfect sense … except when 11
  • 12. Privacy and Security Issues with ELT approach for privacy and security §  Loading raw data before transforming it poses privacy and security challenges §  What if the raw data contains SSNs or Credit card numbers? §  What if it is only meant to be seen by a few? §  Once you load, the data is now available 12
  • 13. The solution Deal with it during extraction (as best as you can) §  Do a secure extract §  Perform a security/privacy audit of the raw data and build in rules to mask/anonymize/ scrub data during the extraction §  Somewhat solves the security problem but complicates the Extract step 13
  • 14. Some exceptions What if you don’t know which parts of the data set need to protected? §  Secure extract assumes that the data schema is known and the privacy levels are known §  Not a valid assumption at all times §  For e.g., what if the legacy data set has Facebook profile data before the new privacy rules went in to effect? 14
  • 15. Data Lineage and Provenance What is data lineage and data provenance? Data Lineage Data Lineage records the origin of the data set. This includes the time, place, original format and privacy/security information. Data Provenance Records all the change history to the data set. This includes timestamp, change agent, purpose, process and edit log. 15
  • 16. Data Lineage Why is this a big deal? §  Let’s go back to the Facebook problem §  The solution is to record lineage information §  This protects the consumer of the data set – assures that the data was available for use as of the point and time of origin §  Protect yourself from law suites and fines! 16
  • 17. Metadata Data about the data §  Astute observation: Metadata extraction is an integral part of managing data and implementing data lineage and data provenance §  It can be rule-based but increasingly more automated systems are desirable 17
  • 18. Data Provenance Why is it important? §  Data Lineage solves one piece of the puzzle – namely, origin and metadata §  What if data is changed during or after the extract step? §  For purposes of audit and traceability, this must be recorded! 18
  • 19. Data Provenance Approach How to implement data provenance? §  Can be workflow-based or dataflow-based §  Workflow-based is much easier §  Records the changes as part of the workflow §  Dataflow-based is much harder §  Needs to record each and every access to 19 the data
  • 20. Current Toolset What exists currently in the open source big data ecosystem? §  Nothing really to help with any of this §  There are commercial products §  But, no open source tools yet (or at least none that are in production use that I am aware of) §  Would be a great idea to build one! 20
  • 21. Data Lifecycle Management Dealing with data throughout its lifecycle §  Management of data from ingest to sunset §  It involves dealing with all of the associated metadata, lineage and provenance artifacts §  It also involves moving data around (large datasets in the Big Data world) §  That is a data pipeline problem! 21
  • 22. Data Lifecycle Management Tools What exists in the Big Data eco system to handle this? §  Current toolset is pretty limited §  Apache Falcon (Hadoop sub-project) is a step in this direction but still not widely available for production use §  It is possible to roll your own §  But, it is a significant engineering effort 22
  • 23. Modern Data Lifecycle Management Modern data architecture needs modern data lifecycle management §  Modern data architecture involves more than just Hadoop §  Queuing systems – for e.g. Kafka §  Stream processing – for e.g. Storm §  Real-time systems – for e.g. Spark §  NoSQL system – for e.g. HBase §  Integration with MPP systems 23
  • 24. Data Lifecycle Management done right Data Lifecycle Management across the Big Data Environment §  Dealing with the various systems in the Big Data Landscape §  Ability to setup schedules and periodic runs §  Also, provide on-demand data processing §  Treat data as an asset – apply asset management practices 24
  • 25. Data Pipelining for fun and profit Dealing with data pipelines as a distinct role in the Big Data Engineering world §  Data Pipeline Engineering is a legitimate role in the Big Data environment §  The complexity and all of the attendant issues makes it a specialty in its own right! §  It is much more than just ETL §  Security, Lineage, Provenance and Lifecycle 25 Management are all essential
  • 26. So you want to be a data pipeline engineer? What are the tools of the trade and ninja skill to master? §  Languages §  Systems §  Python §  Java §  Sqoop §  Scala §  Storm §  Pig 26 §  Flume §  Hive/HBase
  • 27. Can you make this easier? I just want to write some code and be done with it §  Pick your language §  Cascading §  Java §  Data Pipeline §  Scala framework §  Clojure §  Full-featured §  But, wait! §  What about all the other stuff? 27
  • 28. Integration and Extensions Integrate with your favorite tools and extend when needed §  Start with a solid pipeline framework like Cascading (or its offshoots like Scalding or Cascalog) §  Integrate with either commercial or open source tools for specific functionality needed §  Look at Cascading extensions: Lingual, Driven and Load 28
  • 29. Build your own extensions Extend Cascading with your own requirements §  A programming framework such as Cascading makes it much easier to extend to build custom data lineage, provenance and lifecycle management solutions §  You can also integrate with Security and Privacy solutions §  This is a flexible approach 29
  • 30. Build for scale Understanding scale for data pipelines §  Scaling data pipelines is quite complex due to the multiple moving pieces §  Pipeline is only as fast as the slowest piece §  Hadoop scales - proven §  Flume scales - proven §  Kafka scales - proven 30
  • 31. Scaling Sqoop Scaling relation DB load §  What about Sqoop? §  Not as easy or straight forward to scale §  Start slow and incrementally increase load §  Watch for network statistics and optimize §  Load aggregates if that is all you need §  Parallelize as much without killing the DB 31
  • 32. Scaling Storm §  Lot of these systems depend on ZK §  Storm also relies on Zero MQ (this is changing) §  Provision for average load (not peak load) §  Benchmark with typical event size (compress for larger events) §  Storm on YARN will solve many issues 32
  • 33. Scaling Tips §  Measure end-to-end throughput §  Benchmark and fine-tune the best performing parts of the pipeline first §  Scale the slower parts next – increase incrementally §  Batch the slower parts – aggregate if you can and parallelize as much as possible 33
  • 34. Summary §  Pipeline Engineering will be one of the most challenging areas in Big Data with several big issues remaining to be solved §  Expect plenty of innovation and action in this space §  It is a great place to start a Big Data career 34
  • 35. Thank You Questions? Comments? Thank You! Please contact vivganes@gmail.com with your questions and/or comments. 35