SlideShare a Scribd company logo
1 of 20
Big Data Ecosystem 
Ivo Vachkov 
Xi Group Ltd.
Big Data ??? 
 Definition 
 The 3Vs: 
 Volume 
 Velocity 
 Variety 
 Added later: 
 Veracity 
 Variability 
 Complexity
Processing Paradigms 
 Batch Processing 
 Large volumes 
 Lower volatility 
 Incremental updates 
 Real-time Processing 
 Smaller volumes 
 Higher volatility 
 Possible full regeneration
The Data Path 
 From Collection … 
 … to Processing … 
 … to Query: 
 Consumption 
 Visualization 
 [Predictive] Analysis 
 Monitoring / Validation 
 ETL, anyone?!
The Data Path
Data Path / Collection 
 Multiple sources (RDBMS, Logs, activity streams, message 
queues, time series, etc.) 
 Multiple types (structured, unstructured, free text, bags of 
words, raw, normalized, etc.) 
 Collection starts with raw data and produces digital 
artifacts suitable for machine processing.
Data Path / Collection 
 Wide variety of components and technologies: 
 Flat files, binary formats (AVRO, CSV, etc.) on a typical file 
system 
 Cluster-specific file systems 
 RDBMS/SQL, NoSQL, NewSQL, MPP DBs, Graph Databases, 
Document Databases 
 Column Stores 
 Key-Value Stores 
 Time Series Stores 
 Streaming and transformation engines
Data Path / Processing 
 Different processing paradigms: 
 Batch Processing 
 Real-time Processing 
 Multiple expected outcomes: 
 Data 
 Action 
 Different destinations: 
 Data stores 
 Data-driven Control Planes
Data Path / Processing 
 Smaller number of technologies: 
 Map / Reduce (Hadoop, CouchDB, MongoDB, Riak) 
 Cluster Computing (PMV, MPI, LAM, OpenMP, etc.) 
 HPC / Supercomputing 
 Data parallelism is the key! 
 Data locality is important!
Data Path / Processing 
 The importance of M/R 
 Self-hosted solutions: 
 Apache Hadoop 
 Cloudera, HortonWorks, etc. 
 Cloud-based solutions: 
 AWS EMR (+Data Pipeline, +Kinesis, +S3, +Dynamo) 
 Joyent Manta 
 … many others …
Data Path / Query 
 Processing will create digital artifact 
 Extremely high variety of technologies, components, 
services to deal with those artifacts: 
 SQL interfaces on top of NoSQL stores 
 NoSQL to NoSQL 
 NoSQL to RDBMS 
 Output to 3rd party API services 
 Output to proprietary interfaces 
 … a lot more …
Data Path / Query 
 “Query-friendly” stores: 
 Classical RDBMS, NewSQL 
 Big Table & Column Stores 
 Key-Value Stores 
 Search-oriented services 
 Visualization: 
 3rd party services 
 Tableau 
 HTML5 / JavaScript Dashboards 
 Programming languages / Visualization libraries
Data Path / Query 
 Analysis 
 Reports 
 Trends / Predictions 
 Real-time analytics 
 Data-driven Control Plane 
 Classical Business Intelligence 
 Machine Learning (Mahout) 
 Data Science (usually a fancy term for Statistics)
Big Data & Monitoring 
 Infrastructure Monitoring 
 Well understood 
 Many products 
 Full-Stack Application Monitoring 
 Technical challenges 
 No “one size fits all” solutions 
 Data Quality Monitoring 
 Emerging technologies 
 Home-grown solutions
Big Data & Monitoring 
 Infrastructure Monitoring
Big Data & Monitoring 
 Application Monitoring
Big Data & Monitoring 
 Data Quality Monitoring
… a bag of acronyms … 
 Flume, Scribe, Chukwa, Sqoop, MapReduce, YARN, HDFS, 
Hbase, Pig Latin, Hive, HAWQ, Impala, Presto, Phoenix, 
Spire, Drill, Storm, Samza, Malhar, Cassandra, Redis, 
Voldemort, Accumulo, Oozie, Azkaban, Lipstick, Hue, 
OpenTSDB, Mahout, Giraph, Lily, Zookeeper, Datameer, 
Tableau, Pentaho, SumoLogic, MongoDB, CouchDB, 
Riak, Pregel, Lucene, Solr, ElasticSearch, Neo4J, OrientDB, 
Memcache, Foundation DB, … 
 AWS: Data Pipeline, EMR, Kinesis, DinamoDB, S3, RedShift, 
ElasticCache, SQS, SWF 
 Joyent: Manta
Piece of advice … 
 Collect relevant data! 
Collecting data for data’s sake only costs money … 
 Use the processing technology that best matches your 
business case! 
Hadoop is pointless if your clients only want fast 
geospatial searches … 
 Consume wisely! 
Knowing that 100% of X is Y means nothing when there 
is only one X …
Conclusion 
Q & 
A

More Related Content

What's hot

Future of Data - Big Data
Future of Data - Big DataFuture of Data - Big Data
Future of Data - Big Data
Shankar R
 
Big data – a brief overview
Big data – a brief overviewBig data – a brief overview
Big data – a brief overview
Dorai Thodla
 

What's hot (20)

Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 
DW Appliance
DW ApplianceDW Appliance
DW Appliance
 
Big data 101
Big data 101Big data 101
Big data 101
 
data warehouse vs data lake
data warehouse vs data lakedata warehouse vs data lake
data warehouse vs data lake
 
Hadoop - An Introduction
Hadoop - An IntroductionHadoop - An Introduction
Hadoop - An Introduction
 
Future of Data - Big Data
Future of Data - Big DataFuture of Data - Big Data
Future of Data - Big Data
 
BigData
BigDataBigData
BigData
 
Bigdata
BigdataBigdata
Bigdata
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
DOAG Big Data Days 2017 - Cloud Journey
DOAG Big Data Days 2017 - Cloud JourneyDOAG Big Data Days 2017 - Cloud Journey
DOAG Big Data Days 2017 - Cloud Journey
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solr
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
 
Attributes of a Modern Data Warehouse - Gartner Catalyst
Attributes of a Modern Data Warehouse - Gartner CatalystAttributes of a Modern Data Warehouse - Gartner Catalyst
Attributes of a Modern Data Warehouse - Gartner Catalyst
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Big Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace ImagesBig Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace Images
 
Exploring Big Data Analytics Tools
Exploring Big Data Analytics ToolsExploring Big Data Analytics Tools
Exploring Big Data Analytics Tools
 
Big data – a brief overview
Big data – a brief overviewBig data – a brief overview
Big data – a brief overview
 
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricUsing a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 

Similar to Big Data Ecosystem

AWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution ShowcaseAWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution Showcase
Amazon Web Services
 

Similar to Big Data Ecosystem (20)

Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 
Managing Data with Voume Velocity, and Variety with Amazon ElastiCache for Redis
Managing Data with Voume Velocity, and Variety with Amazon ElastiCache for RedisManaging Data with Voume Velocity, and Variety with Amazon ElastiCache for Redis
Managing Data with Voume Velocity, and Variety with Amazon ElastiCache for Redis
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
 
AWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution ShowcaseAWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution Showcase
 
Accion Labs - Big Data Services
Accion Labs - Big Data ServicesAccion Labs - Big Data Services
Accion Labs - Big Data Services
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Big Data Ecosystem

  • 1. Big Data Ecosystem Ivo Vachkov Xi Group Ltd.
  • 2. Big Data ???  Definition  The 3Vs:  Volume  Velocity  Variety  Added later:  Veracity  Variability  Complexity
  • 3. Processing Paradigms  Batch Processing  Large volumes  Lower volatility  Incremental updates  Real-time Processing  Smaller volumes  Higher volatility  Possible full regeneration
  • 4. The Data Path  From Collection …  … to Processing …  … to Query:  Consumption  Visualization  [Predictive] Analysis  Monitoring / Validation  ETL, anyone?!
  • 6. Data Path / Collection  Multiple sources (RDBMS, Logs, activity streams, message queues, time series, etc.)  Multiple types (structured, unstructured, free text, bags of words, raw, normalized, etc.)  Collection starts with raw data and produces digital artifacts suitable for machine processing.
  • 7. Data Path / Collection  Wide variety of components and technologies:  Flat files, binary formats (AVRO, CSV, etc.) on a typical file system  Cluster-specific file systems  RDBMS/SQL, NoSQL, NewSQL, MPP DBs, Graph Databases, Document Databases  Column Stores  Key-Value Stores  Time Series Stores  Streaming and transformation engines
  • 8. Data Path / Processing  Different processing paradigms:  Batch Processing  Real-time Processing  Multiple expected outcomes:  Data  Action  Different destinations:  Data stores  Data-driven Control Planes
  • 9. Data Path / Processing  Smaller number of technologies:  Map / Reduce (Hadoop, CouchDB, MongoDB, Riak)  Cluster Computing (PMV, MPI, LAM, OpenMP, etc.)  HPC / Supercomputing  Data parallelism is the key!  Data locality is important!
  • 10. Data Path / Processing  The importance of M/R  Self-hosted solutions:  Apache Hadoop  Cloudera, HortonWorks, etc.  Cloud-based solutions:  AWS EMR (+Data Pipeline, +Kinesis, +S3, +Dynamo)  Joyent Manta  … many others …
  • 11. Data Path / Query  Processing will create digital artifact  Extremely high variety of technologies, components, services to deal with those artifacts:  SQL interfaces on top of NoSQL stores  NoSQL to NoSQL  NoSQL to RDBMS  Output to 3rd party API services  Output to proprietary interfaces  … a lot more …
  • 12. Data Path / Query  “Query-friendly” stores:  Classical RDBMS, NewSQL  Big Table & Column Stores  Key-Value Stores  Search-oriented services  Visualization:  3rd party services  Tableau  HTML5 / JavaScript Dashboards  Programming languages / Visualization libraries
  • 13. Data Path / Query  Analysis  Reports  Trends / Predictions  Real-time analytics  Data-driven Control Plane  Classical Business Intelligence  Machine Learning (Mahout)  Data Science (usually a fancy term for Statistics)
  • 14. Big Data & Monitoring  Infrastructure Monitoring  Well understood  Many products  Full-Stack Application Monitoring  Technical challenges  No “one size fits all” solutions  Data Quality Monitoring  Emerging technologies  Home-grown solutions
  • 15. Big Data & Monitoring  Infrastructure Monitoring
  • 16. Big Data & Monitoring  Application Monitoring
  • 17. Big Data & Monitoring  Data Quality Monitoring
  • 18. … a bag of acronyms …  Flume, Scribe, Chukwa, Sqoop, MapReduce, YARN, HDFS, Hbase, Pig Latin, Hive, HAWQ, Impala, Presto, Phoenix, Spire, Drill, Storm, Samza, Malhar, Cassandra, Redis, Voldemort, Accumulo, Oozie, Azkaban, Lipstick, Hue, OpenTSDB, Mahout, Giraph, Lily, Zookeeper, Datameer, Tableau, Pentaho, SumoLogic, MongoDB, CouchDB, Riak, Pregel, Lucene, Solr, ElasticSearch, Neo4J, OrientDB, Memcache, Foundation DB, …  AWS: Data Pipeline, EMR, Kinesis, DinamoDB, S3, RedShift, ElasticCache, SQS, SWF  Joyent: Manta
  • 19. Piece of advice …  Collect relevant data! Collecting data for data’s sake only costs money …  Use the processing technology that best matches your business case! Hadoop is pointless if your clients only want fast geospatial searches …  Consume wisely! Knowing that 100% of X is Y means nothing when there is only one X …

Editor's Notes

  1. Intro, Abstract, Who am I
  2. Big Data = Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. If Gartner’s definition (the 3Vs) is still widely used, the growing maturity of the concept fosters a more sound difference between big data and Business Intelligence, regarding data and their use:[18] Business Intelligence uses descriptive statistics with data with high information density to measure things, detect trends etc.; Big data uses inductive statistics and concepts from nonlinear system identification [19] to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density[20] to reveal relationships, dependencies and perform predictions of outcomes and behaviors.[19][21] Big data can also be defined as "Big data is a large volume unstructured data which can not be handled by standard database management systems like DBMS, RDBMS or ORDBMS".
  3. Two distinct processing paradigm that drive different technologies Why one? Why the other? Use cases …
  4. Comes from ETL after all, specific but known.