SlideShare a Scribd company logo
1 of 42
Introduction to
Big Data
AHMED SHOUMAN
Our agenda
 Demystify the term "Big Data"
 Find out what is Hadoop
 Explore the realms of batch and real-time big data processing
 Explore challenges of size, speed and scale in databases
 Skim the surface of big-data technologies
 Provide ways into the big-data world
Big Data
Demystified
What is big data?
 Big data is a collective term for a set technologies designed
for storage, querying and analysis of extremely large data sets,
sources and volumes.
 Big data technologies come in where traditional off-the-shelf
databases, data warehousing systems and analysis tools fall
short.
How did we end up with so much data?
 Data Generation: Human (Internal) ↦ Human (Social) ↦ Machine
 Data Processing: Single Core ↦ Multi-Core ↦ Cluster / Cloud
 An Important Side Note
Big Data technologies are based on the concept of clustering - Many computers
working in sync to process chunks of our data.
Not just size
 Big data isn't just about data size, but also about data volume,
diversity and inter-connectedness.
Big data is
 Any attribute of our data that challenges either technological capabilities
or business needs, like:
 Scaling, moving, storage and retrieval of ever-growing generated data
 Processing many small data points in real-time
 Analysing diverse semi-structured data from multiple sources
 Querying multiple, diverse data sources in real-time
Breath... Let's recap
 Lot's of data due to technological capabilities and social paradigms
 Not just size! Diversity, volume and inter-connectedness also count
 Scale, speed, processing, querying and analysis
 Challenges technological capabilities or business needs
Hadoop The Elephant in the Room
Everyone talks about Hadoop
 Hadoop is a powerful platform for batch analysis of large
volumes of both structured and unstructured data.
From: Conquering Hadoop with Haskell
Hadoop explained
 Hadoop is a horizontally scalable, fault-tolerant, open-source file system
and batch-analysis platform capable of processing large amounts of data.
 HDFS - Hadoop File System
 M/R - Hadoop Map-Reduce platform
Hadoop explained
 HDFS is an ever-growing file system. We can store lots and
lots of data on it for later use.
 HDFS is used as the underlying platform for other
technologies likeHadoop M/R, Apache Mahout or HBase.
Hadoop explained
 Imagine we want to look at 30 days worth of access logs to identify site
usage patterns at a volume of 30M log entries per day.
 Hadoop M/R is a platform that allows us to query HDFS data in parallel for
the purpose of batch (offline) data processing and analysis.
Why is Hadoop so important?
 Scalable and fault-tolerant
 Handles massive amounts of data
 Truly parallel processing
 Data can be semi-structured or unstructured (schemaless)
 Serves as basis for other technologies (Hbase, Mahout, Impala, Shark)
Hadoop - Words of caution
 Complex
 Not for real-time
 Choose a distribution (Cloudera, HW, MapR) for better interoperability
 Requires trained DevOps for day-to-day operations
Breath....
 We demystified the term Big Data and glimpsed at Hadoop. Now What?
 How do I really get into the Big Data world?
The world of big data
 Batch & Data Science
 DBs
 Real-Time
Batch Processing
Hadoop M/R
Batch processing of large data sets
 We collect data for the purpose of providing end-users with better
experience in our business domain. This means we have to constantly
query our data and divine new insights and relevant information.
 The problem is doing that in very large scales is a painful, slow challenge.
How do we do this on Hadoop data?
Source: https://cwiki.apache.org/confluence/display/Hive/Tutorial
Batch processing of large data sets
 Hadoop gives us the basic tools for large data processing in
the form of M/R.
However, Hadoop M/R is pretty annoying to work with
directly as it lacks a lot of relevant tools for the job (statistical
analysis, machine learning etc.)
Source: http://xiaochongzhang.me/blog/?p=338
Hadoop querying and data science
tools
 Tool Purpose
 Hive Write SQL-like M/R queries on top of Hadoop
 Shark Hive-compatible, distributed SQL query engine for Hadoop
 Pig Write scripted M/R queries on top of Hadoop
 Impala Real-time SQL-like queries of Hadoop
 Mahout Scalable machine-learning on top of Hadoop M/R
The gentle way in
 Hive or Shark are a great place to start due to their SQL-like nature
 Shark is faster than Hive - less frustration
 You need some Hadoop data to work with (consider Avro)
 Remember - it's SQL-like, not SQL
 Start small, locally and grow to production later
 Check out Apache Sqoop for moving processed Hadoop data to your DB
Databases In the big data world
Databases in the big data world
 The Problem: Traditional RDBMS were not designed for storing, indexing
and querying growing amounts and volumes of data.
 The 3S Challenge:
 Size - How much data is written and read
 Speed - How fast can we write and read data
 Scale - How easily can our DB scale to accommodate more data
The 3S Challenge
 There's no single, simple solution to the 3S challenge. Instead,
solutions focus on making an informed sacrifice in one area in
order to gain in another area.
NoSQL and C.A.P.
 NoSQL is a term referring to a family of DBMS that attempt to resolve the
3S challenge by sacrificing one of three areas:
 Consistency - All clients have the same view of data
 Availability - Each client can always read and write
 Partition Tolerance - System works despite physical network failures
NoSQL and C.A.P.
 C.A.P. means you have to make an informed choice (and sacrifice)
 No single perfect solution
 Opt for mixed solutions per use-case
 Remember we're talking about read/write volume, not just size
Confused? Let's take a breath and focus
OK, so where do I go from here?
 Identify your needs and limitations
 Choose a few candidates
 Research & Prototype
 Read about NewSQL - VoltDB, InfiniDB, MariaDB, HyperDex, FoundationDB
(omitted due to time constraints).
Real-Time Big Data Now!
Real-Time big data processing
 Processing big data in real-time is about data volumes rather than just size.
For example, given a rate of 100K ops/sec, how do I do the following in
real-time?:
 Find anomalies in a data stream (spam)
 Group check-ins by geo
 Identify trending pages / topics
Hadoop isn't for real-time processing
 When it comes to data processing and analysis, Hadoop's M/R framework
is wonderful for batch (offline) processing.
 However, processing, analysing and querying Hadoop data in real-time is
quite difficult.
Apache Storm and Apache Spark
 Apache Storm and Apache Spark are two frameworks for large-scale,
distributed data processing in real-time.
 One could say that both Storm and Spark are for real-time data processing
what is Hadoop M/R for batch data processing.
Apache Storm - Highlights
 Runs on the JVM (Clojure / Java mix)
 Fully distributed and fault-tolerant
 Highly-scalable and extremely fast
 Interoperability with popular languages (Scala, Python etc.)
 Mature and production ready
 Hadoop interoperability via Storm-YARN
 Stateless / Non-Persistent (Data brought to processors)
Apache Spark - Highlights
 Fully distributed and extremely fast
 Write applications in Java Scala and Python
 Perfect for both batch and real-time
 Combine Hadoop SQL (Shark), Machine Learning and Data streaming
 Native Hadoop interoperability
 HDFS, HBase, Cassandra, Flume as data sources
 Stateful / Persistent (Processors brought to data)
Storm & Spark - Use Cases
 Continuous/Cyclic Computation
 Real-time analytics
 Machine Learning (eg. recommendations, personalisation)
 Graph Processing (eg. social networks) - Only Spark
 Data Warehouse ETL (Extract, Transform, Load)
Recap
Term Purpose
 Big Data Collective term for data-processing solutions at scale
 Hadoop Scalable file-system and batch processing platform
 Batch Processing Sifting and analysing data offline / in background
 M/R Parallel, batch data-processing algorithm
 3S Challenge Size, Speed, Scale of DBs
 C.A.P Consistency, Availability, Partition Tolerance
 NoSQL Family of DBMS that grew due to the 3S Challenge
 NewSQL Family of DBMS that provide ACID at scale
Questions?!
Feel free to drop my a line:
Email: ahmed.sayed.shouman@gmail.com

More Related Content

What's hot

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Hritika Raj
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

What's hot (20)

Hadoop
Hadoop Hadoop
Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Chapter 1 big data
Chapter 1 big dataChapter 1 big data
Chapter 1 big data
 
Big data
Big dataBig data
Big data
 
Big data, Big decision
Big data, Big decisionBig data, Big decision
Big data, Big decision
 
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data
Big dataBig data
Big data
 
Big Data
Big DataBig Data
Big Data
 

Viewers also liked

Big data
Big dataBig data
Big datahsn99
 
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
Big Data in Distributed Analytics,Cybersecurity And Digital ForensicsBig Data in Distributed Analytics,Cybersecurity And Digital Forensics
Big Data in Distributed Analytics,Cybersecurity And Digital ForensicsSherinMariamReji05
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation17aroumougamh
 
How to implement hadoop successfuly
How to implement hadoop successfulyHow to implement hadoop successfuly
How to implement hadoop successfulyAdir Sharabi
 
Extreme Salesforce Data Volumes Webinar (with Speaker Notes)
Extreme Salesforce Data Volumes Webinar (with Speaker Notes)Extreme Salesforce Data Volumes Webinar (with Speaker Notes)
Extreme Salesforce Data Volumes Webinar (with Speaker Notes)Salesforce Developers
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big dataRaul Chong
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUBAhmed Salman
 
High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)Jose Luis Lopez Pino
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)Jose Luis Lopez Pino
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysisPoonam Kshirsagar
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big DataBernard Marr
 

Viewers also liked (20)

Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Big data
Big dataBig data
Big data
 
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
Big Data in Distributed Analytics,Cybersecurity And Digital ForensicsBig Data in Distributed Analytics,Cybersecurity And Digital Forensics
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
What is big data?
What is big data?What is big data?
What is big data?
 
How to implement hadoop successfuly
How to implement hadoop successfulyHow to implement hadoop successfuly
How to implement hadoop successfuly
 
Extreme Salesforce Data Volumes Webinar (with Speaker Notes)
Extreme Salesforce Data Volumes Webinar (with Speaker Notes)Extreme Salesforce Data Volumes Webinar (with Speaker Notes)
Extreme Salesforce Data Volumes Webinar (with Speaker Notes)
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
 
High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)
 
Privacy in the Age of Big Data
Privacy in the Age of Big DataPrivacy in the Age of Big Data
Privacy in the Age of Big Data
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)
 
Big Data World
Big Data WorldBig Data World
Big Data World
 
100 sql queries
100 sql queries100 sql queries
100 sql queries
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysis
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
Big Data v Data Mining
Big Data v Data MiningBig Data v Data Mining
Big Data v Data Mining
 

Similar to Big Data Concepts

Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopArchana Gopinath
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introductionsaisreealekhya
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoopOmar Jaber
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paperSupratim Ray
 

Similar to Big Data Concepts (20)

Hadoop
HadoopHadoop
Hadoop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
finap ppt conference.pptx
finap ppt conference.pptxfinap ppt conference.pptx
finap ppt conference.pptx
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Case study on big data
Case study on big dataCase study on big data
Case study on big data
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Big data
Big dataBig data
Big data
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 

More from Ahmed Salman

Faas__Food_as_a_Service__project
Faas__Food_as_a_Service__projectFaas__Food_as_a_Service__project
Faas__Food_as_a_Service__projectAhmed Salman
 
Project_Overview_-_final
Project_Overview_-_finalProject_Overview_-_final
Project_Overview_-_finalAhmed Salman
 
TECRM 20 Presentation
TECRM 20 PresentationTECRM 20 Presentation
TECRM 20 PresentationAhmed Salman
 
TCRM10 Pesentation
TCRM10 PesentationTCRM10 Pesentation
TCRM10 PesentationAhmed Salman
 
Introduction to Dig Data& Hadoop
Introduction to Dig Data& HadoopIntroduction to Dig Data& Hadoop
Introduction to Dig Data& HadoopAhmed Salman
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB WorkshopAhmed Salman
 
Hadoop Installation
Hadoop InstallationHadoop Installation
Hadoop InstallationAhmed Salman
 

More from Ahmed Salman (10)

IBM Netezza
IBM NetezzaIBM Netezza
IBM Netezza
 
DR_PRESENT 1
DR_PRESENT 1DR_PRESENT 1
DR_PRESENT 1
 
Faas__Food_as_a_Service__project
Faas__Food_as_a_Service__projectFaas__Food_as_a_Service__project
Faas__Food_as_a_Service__project
 
Project_Overview_-_final
Project_Overview_-_finalProject_Overview_-_final
Project_Overview_-_final
 
Cloudera
ClouderaCloudera
Cloudera
 
TECRM 20 Presentation
TECRM 20 PresentationTECRM 20 Presentation
TECRM 20 Presentation
 
TCRM10 Pesentation
TCRM10 PesentationTCRM10 Pesentation
TCRM10 Pesentation
 
Introduction to Dig Data& Hadoop
Introduction to Dig Data& HadoopIntroduction to Dig Data& Hadoop
Introduction to Dig Data& Hadoop
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB Workshop
 
Hadoop Installation
Hadoop InstallationHadoop Installation
Hadoop Installation
 

Recently uploaded

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 

Recently uploaded (20)

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 

Big Data Concepts

  • 2. Our agenda  Demystify the term "Big Data"  Find out what is Hadoop  Explore the realms of batch and real-time big data processing  Explore challenges of size, speed and scale in databases  Skim the surface of big-data technologies  Provide ways into the big-data world
  • 4. What is big data?  Big data is a collective term for a set technologies designed for storage, querying and analysis of extremely large data sets, sources and volumes.  Big data technologies come in where traditional off-the-shelf databases, data warehousing systems and analysis tools fall short.
  • 5. How did we end up with so much data?  Data Generation: Human (Internal) ↦ Human (Social) ↦ Machine  Data Processing: Single Core ↦ Multi-Core ↦ Cluster / Cloud  An Important Side Note Big Data technologies are based on the concept of clustering - Many computers working in sync to process chunks of our data.
  • 6. Not just size  Big data isn't just about data size, but also about data volume, diversity and inter-connectedness.
  • 7. Big data is  Any attribute of our data that challenges either technological capabilities or business needs, like:  Scaling, moving, storage and retrieval of ever-growing generated data  Processing many small data points in real-time  Analysing diverse semi-structured data from multiple sources  Querying multiple, diverse data sources in real-time
  • 8. Breath... Let's recap  Lot's of data due to technological capabilities and social paradigms  Not just size! Diversity, volume and inter-connectedness also count  Scale, speed, processing, querying and analysis  Challenges technological capabilities or business needs
  • 10. Everyone talks about Hadoop  Hadoop is a powerful platform for batch analysis of large volumes of both structured and unstructured data. From: Conquering Hadoop with Haskell
  • 11. Hadoop explained  Hadoop is a horizontally scalable, fault-tolerant, open-source file system and batch-analysis platform capable of processing large amounts of data.  HDFS - Hadoop File System  M/R - Hadoop Map-Reduce platform
  • 12. Hadoop explained  HDFS is an ever-growing file system. We can store lots and lots of data on it for later use.  HDFS is used as the underlying platform for other technologies likeHadoop M/R, Apache Mahout or HBase.
  • 13. Hadoop explained  Imagine we want to look at 30 days worth of access logs to identify site usage patterns at a volume of 30M log entries per day.  Hadoop M/R is a platform that allows us to query HDFS data in parallel for the purpose of batch (offline) data processing and analysis.
  • 14. Why is Hadoop so important?  Scalable and fault-tolerant  Handles massive amounts of data  Truly parallel processing  Data can be semi-structured or unstructured (schemaless)  Serves as basis for other technologies (Hbase, Mahout, Impala, Shark)
  • 15. Hadoop - Words of caution  Complex  Not for real-time  Choose a distribution (Cloudera, HW, MapR) for better interoperability  Requires trained DevOps for day-to-day operations
  • 16. Breath....  We demystified the term Big Data and glimpsed at Hadoop. Now What?  How do I really get into the Big Data world?
  • 17. The world of big data  Batch & Data Science  DBs  Real-Time
  • 19. Batch processing of large data sets  We collect data for the purpose of providing end-users with better experience in our business domain. This means we have to constantly query our data and divine new insights and relevant information.  The problem is doing that in very large scales is a painful, slow challenge.
  • 20. How do we do this on Hadoop data? Source: https://cwiki.apache.org/confluence/display/Hive/Tutorial
  • 21. Batch processing of large data sets  Hadoop gives us the basic tools for large data processing in the form of M/R. However, Hadoop M/R is pretty annoying to work with directly as it lacks a lot of relevant tools for the job (statistical analysis, machine learning etc.)
  • 23. Hadoop querying and data science tools  Tool Purpose  Hive Write SQL-like M/R queries on top of Hadoop  Shark Hive-compatible, distributed SQL query engine for Hadoop  Pig Write scripted M/R queries on top of Hadoop  Impala Real-time SQL-like queries of Hadoop  Mahout Scalable machine-learning on top of Hadoop M/R
  • 24. The gentle way in  Hive or Shark are a great place to start due to their SQL-like nature  Shark is faster than Hive - less frustration  You need some Hadoop data to work with (consider Avro)  Remember - it's SQL-like, not SQL  Start small, locally and grow to production later  Check out Apache Sqoop for moving processed Hadoop data to your DB
  • 25. Databases In the big data world
  • 26. Databases in the big data world  The Problem: Traditional RDBMS were not designed for storing, indexing and querying growing amounts and volumes of data.  The 3S Challenge:  Size - How much data is written and read  Speed - How fast can we write and read data  Scale - How easily can our DB scale to accommodate more data
  • 27. The 3S Challenge  There's no single, simple solution to the 3S challenge. Instead, solutions focus on making an informed sacrifice in one area in order to gain in another area.
  • 28. NoSQL and C.A.P.  NoSQL is a term referring to a family of DBMS that attempt to resolve the 3S challenge by sacrificing one of three areas:  Consistency - All clients have the same view of data  Availability - Each client can always read and write  Partition Tolerance - System works despite physical network failures
  • 29. NoSQL and C.A.P.  C.A.P. means you have to make an informed choice (and sacrifice)  No single perfect solution  Opt for mixed solutions per use-case  Remember we're talking about read/write volume, not just size
  • 30. Confused? Let's take a breath and focus
  • 31.
  • 32. OK, so where do I go from here?  Identify your needs and limitations  Choose a few candidates  Research & Prototype  Read about NewSQL - VoltDB, InfiniDB, MariaDB, HyperDex, FoundationDB (omitted due to time constraints).
  • 34. Real-Time big data processing  Processing big data in real-time is about data volumes rather than just size. For example, given a rate of 100K ops/sec, how do I do the following in real-time?:  Find anomalies in a data stream (spam)  Group check-ins by geo  Identify trending pages / topics
  • 35. Hadoop isn't for real-time processing  When it comes to data processing and analysis, Hadoop's M/R framework is wonderful for batch (offline) processing.  However, processing, analysing and querying Hadoop data in real-time is quite difficult.
  • 36. Apache Storm and Apache Spark  Apache Storm and Apache Spark are two frameworks for large-scale, distributed data processing in real-time.  One could say that both Storm and Spark are for real-time data processing what is Hadoop M/R for batch data processing.
  • 37. Apache Storm - Highlights  Runs on the JVM (Clojure / Java mix)  Fully distributed and fault-tolerant  Highly-scalable and extremely fast  Interoperability with popular languages (Scala, Python etc.)  Mature and production ready  Hadoop interoperability via Storm-YARN  Stateless / Non-Persistent (Data brought to processors)
  • 38. Apache Spark - Highlights  Fully distributed and extremely fast  Write applications in Java Scala and Python  Perfect for both batch and real-time  Combine Hadoop SQL (Shark), Machine Learning and Data streaming  Native Hadoop interoperability  HDFS, HBase, Cassandra, Flume as data sources  Stateful / Persistent (Processors brought to data)
  • 39. Storm & Spark - Use Cases  Continuous/Cyclic Computation  Real-time analytics  Machine Learning (eg. recommendations, personalisation)  Graph Processing (eg. social networks) - Only Spark  Data Warehouse ETL (Extract, Transform, Load)
  • 41. Term Purpose  Big Data Collective term for data-processing solutions at scale  Hadoop Scalable file-system and batch processing platform  Batch Processing Sifting and analysing data offline / in background  M/R Parallel, batch data-processing algorithm  3S Challenge Size, Speed, Scale of DBs  C.A.P Consistency, Availability, Partition Tolerance  NoSQL Family of DBMS that grew due to the 3S Challenge  NewSQL Family of DBMS that provide ACID at scale
  • 42. Questions?! Feel free to drop my a line: Email: ahmed.sayed.shouman@gmail.com