SlideShare une entreprise Scribd logo
1  sur  26
Real time big data with Apache Kafka,
Spark Streaming, Scala, Elastic search.
By
S Annu Ahmed(122N1A0573)
V Indu Priyanka(122N1A0532)
S Ravindra(122N1A0572)
M Imran Basha(122N1A0556)
P B Sravanthi(122N1A0558)
B Baby Likhitha(122N1A0514)
Contents:
• Overview of project
• What is Big Data ?
• Hadoop
• Apache Kafka
• Scala
• Spark Streaming
• Elastic Search
What is Data ?
“A set of values that may be Qualitative or Quantitate in
nature”
What is Big Data ?
“Data so large and voluminous that it overwhelms the
existing data storage and processing infrastructure, is said to
be big enough to be called as-Big data”
 What is Real time Big Data ?
“ Hadoop is engineered for big data analytics, but it's
not real time. NoSQL is engineered for real-time big data,
but it's operational rather than analytical. NoSQL together
with Hadoop is the key to real time big data”
Parameters to use big data:
 Huge amount of data
 Complex data which consists of lots of unstructured data
 Speed of generating data
What We Need ?
Fault Tolerant
Failure Detection
Fast - low latency, distributed, data locality
Masterless, Decentralized Cluster Membership
DataCenters
Partition-Aware
Elasticity
•Parallelism
Apache Hadoop is an open source framework for distributed storage
and processing of large sets of data on commodity hardware.
ECO SYSTEM:
Hadoop
Let’s recall basic concepts of
Messaging System
Point to Point Messaging
(Queue)
Publish-Subscribe Messaging
(Topic)
Apache Kafka
Overview
 An apache project initially developed at LinkedIn
 Distributed publish-subscribe messaging system
• Designed for processing of real time activity stream data e.g.
logs, metrics collections
• Written in Scala
 Features
 Persistent messaging
 High-throughput
 Supports both queue and topic semantics
 Uses Zookeeper for forming a cluster of nodes
(producer/consumer/broker)and many more…
Real time transfer
Consumer3
(Group2)
Kafka
Broker
Consumer4
(Group2)
Producer
Zookeeper
Consumer2
(Group1)
Consumer1
(Group1)
Update Consumed
Message offset
Queue
Topology
Topic
Topology
Kafka
Broker
Broker does not Push messages to Consumer, Consumer Polls messages from Broker.
About Apache Spark
 Initially started at UC Berkeley in 2009
 Fast and general purpose cluster computing system
 10x (on disk) - 100x (In-Memory) faster
 Most popular for running Iterative Machine Learning Algorithms.
 Provides high level APIs in
 Java
 Scala
 Python
 Integration with Hadoop and its eco-system and can read existing data.
So Why Spark ?
Hadoop execution flow
Spark execution flow
• Most of Machine Learning Algorithms are iterative because each iteration
can improve the results
• With Disk based approach each iteration’s output is written to disk making
it slow
Resilient Distributed Dataset (RDD)
•Fault Tolerant: can recalculate from any point of failure
•Created through transformations on data (map,filter..) or other
RDDs
•Immutable
•Partitioned
•Indicate which RDD elements to partition across machines
based on a key in each record
•Can be reused
Spark Streaming
Makes it easy to build scalable fault-tolerant
streaming applications.
Ease of Use
Fault Tolerance
Combine streaming with batch and interactive
queries.
zillions of bytes gigabytes per second
Spark Streaming
Input & Output Sources
Spark Streaming
Kinesis, S3
• Functional
• Object oriented programming
• On the JVM
• Static typing - easier to control performance
Why Scala?
Scala
 Scala has been created by Martin Odersky and he
released the first version in 2001
 Scala is the language that addresses the major needs of
the modern developer.
 It is a statically typed, mixed-paradigm, JVM language
with a succinct, elegant, and flexible syntax, a
sophisticated type system, and idioms that promote
scalability from small , interpreted scripts to large,
sophisticated applications.
Continued….
 Scala is compelling because it feels like a dynamically
typed scripting language, due to its succinct syntax and
type inference.
 Yet Scala gives you all the benefits of static typing, a
modern object model, functional programming, and an
advanced type system.
 Scala's aim to provide advanced constructs for the
abstraction and composition of components is shared by
several recent research efforts.
What is elasticsearch?
 In short, it can be thought of as “search engine software”
 It provides the realistic potential for you to run your own search engine
service (like a Bing or a Google) but with say, private, sensitive, or
confidential data/documents that you don’t want on the public web
 great extra capability for your company, enterprise, app, startup, client
 elasticsearch is an open-source, distributed web application that runs on
top of Lucene, and it is written in Java, and it sports a REST API
 Apache Lucene is the best open-source search engine, and probably one
of the best search engines available, and holds its own even when
compared against the most expensive commercial alternatives
 very fast search
Where did elasticsearch come from?
 Originally there was a search application project called Apache
Compass, which was primarily worked on by @kimchy
 Compass also relied on Lucene, but was not distributed
 kimchy decided to write elasticsearch to be distributed from the
get go, and so you could say it was built with the cloud in mind
 Add more servers and they play together nicely, and they know
how to work together to split up the work load (and search
queries can be resource intensive and expensive in terms of
memory/disk requirements)
Elastic search is an advanced distributed app
 It has some very cool properties and abilities when it
comes to operations that involve lots of nodes
 It scales extremely gracefully
 It has its own optimized binary protocol and makes its
own “internal network”
 …as long as you know what you are doing when it
comes to configuration
 It is open source
963

Contenu connexe

Tendances

Part 3 - Modern Data Warehouse with Azure Synapse
Part 3 - Modern Data Warehouse with Azure SynapsePart 3 - Modern Data Warehouse with Azure Synapse
Part 3 - Modern Data Warehouse with Azure SynapseNilesh Gule
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
 
Is there a way that we can build our Azure Data Factory all with parameters b...
Is there a way that we can build our Azure Data Factory all with parameters b...Is there a way that we can build our Azure Data Factory all with parameters b...
Is there a way that we can build our Azure Data Factory all with parameters b...Erwin de Kreuk
 
Spark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 MonthsSpark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 Monthstsliwowicz
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in NetflixDanny Yuan
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElasticsearch
 
Dealing with different Synapse Roles in Azure Synapse Analytics Erwin de Kreuk
Dealing with different Synapse Roles in Azure Synapse Analytics Erwin de KreukDealing with different Synapse Roles in Azure Synapse Analytics Erwin de Kreuk
Dealing with different Synapse Roles in Azure Synapse Analytics Erwin de KreukErwin de Kreuk
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillDatabricks
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit
 
Meetup070416 Presentations
Meetup070416 PresentationsMeetup070416 Presentations
Meetup070416 PresentationsAna Rebelo
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
 
Performance evaluation of cloud-based log file analysis with Apache Hadoop an...
Performance evaluation of cloud-based log file analysis with Apache Hadoop an...Performance evaluation of cloud-based log file analysis with Apache Hadoop an...
Performance evaluation of cloud-based log file analysis with Apache Hadoop an...Kishor Datta Gupta
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...Spark Summit
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa
 
Cis 528presentation final
Cis 528presentation finalCis 528presentation final
Cis 528presentation finalpriyalmistry4
 

Tendances (20)

Part 3 - Modern Data Warehouse with Azure Synapse
Part 3 - Modern Data Warehouse with Azure SynapsePart 3 - Modern Data Warehouse with Azure Synapse
Part 3 - Modern Data Warehouse with Azure Synapse
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
 
Is there a way that we can build our Azure Data Factory all with parameters b...
Is there a way that we can build our Azure Data Factory all with parameters b...Is there a way that we can build our Azure Data Factory all with parameters b...
Is there a way that we can build our Azure Data Factory all with parameters b...
 
Spark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 MonthsSpark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 Months
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep dive
 
Dealing with different Synapse Roles in Azure Synapse Analytics Erwin de Kreuk
Dealing with different Synapse Roles in Azure Synapse Analytics Erwin de KreukDealing with different Synapse Roles in Azure Synapse Analytics Erwin de Kreuk
Dealing with different Synapse Roles in Azure Synapse Analytics Erwin de Kreuk
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep Gill
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Meetup070416 Presentations
Meetup070416 PresentationsMeetup070416 Presentations
Meetup070416 Presentations
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
Performance evaluation of cloud-based log file analysis with Apache Hadoop an...
Performance evaluation of cloud-based log file analysis with Apache Hadoop an...Performance evaluation of cloud-based log file analysis with Apache Hadoop an...
Performance evaluation of cloud-based log file analysis with Apache Hadoop an...
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
Cis 528presentation final
Cis 528presentation finalCis 528presentation final
Cis 528presentation final
 
Cis 528 big data
Cis 528 big dataCis 528 big data
Cis 528 big data
 

En vedette

Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...Spark Summit
 
Integrating Elastic and Apache Spark - Elastic London Meetup (2015-09-24)
Integrating Elastic and Apache Spark - Elastic London Meetup (2015-09-24)Integrating Elastic and Apache Spark - Elastic London Meetup (2015-09-24)
Integrating Elastic and Apache Spark - Elastic London Meetup (2015-09-24)Neil Andrassy
 
How to Build a Data-Driven Company: From Infrastructure to Insights
How to Build a Data-Driven Company: From Infrastructure to InsightsHow to Build a Data-Driven Company: From Infrastructure to Insights
How to Build a Data-Driven Company: From Infrastructure to InsightsLooker
 
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architectureJoseph D'Antoni
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudDataWorks Summit
 
Hadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageHadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageSandeep Patil
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...DB Tsai
 
Architecture for Real-Time and Batch Big Data Analytics
Architecture for Real-Time and Batch Big Data AnalyticsArchitecture for Real-Time and Batch Big Data Analytics
Architecture for Real-Time and Batch Big Data AnalyticsNir Rubinstein
 
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic searchHenry Saputra
 
ElasticSearch on AWS
ElasticSearch on AWSElasticSearch on AWS
ElasticSearch on AWSPhilipp Garbe
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
 
Nested and Parent/Child Docs in ElasticSearch
Nested and Parent/Child Docs in ElasticSearchNested and Parent/Child Docs in ElasticSearch
Nested and Parent/Child Docs in ElasticSearchBeyondTrees
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchAbhishek Andhavarapu
 
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Spark Summit
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchSigmoid
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedBeyondTrees
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locatorAlberto Paro
 
Apache Flume
Apache FlumeApache Flume
Apache FlumeGetInData
 
ElasticSearch : Architecture et Développement
ElasticSearch : Architecture et DéveloppementElasticSearch : Architecture et Développement
ElasticSearch : Architecture et DéveloppementMohamed hedi Abidi
 

En vedette (20)

Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
 
Integrating Elastic and Apache Spark - Elastic London Meetup (2015-09-24)
Integrating Elastic and Apache Spark - Elastic London Meetup (2015-09-24)Integrating Elastic and Apache Spark - Elastic London Meetup (2015-09-24)
Integrating Elastic and Apache Spark - Elastic London Meetup (2015-09-24)
 
How to Build a Data-Driven Company: From Infrastructure to Insights
How to Build a Data-Driven Company: From Infrastructure to InsightsHow to Build a Data-Driven Company: From Infrastructure to Insights
How to Build a Data-Driven Company: From Infrastructure to Insights
 
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architecture
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloud
 
Hadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageHadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better Storage
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
 
Architecture for Real-Time and Batch Big Data Analytics
Architecture for Real-Time and Batch Big Data AnalyticsArchitecture for Real-Time and Batch Big Data Analytics
Architecture for Real-Time and Batch Big Data Analytics
 
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic search
 
ElasticSearch on AWS
ElasticSearch on AWSElasticSearch on AWS
ElasticSearch on AWS
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 
Nested and Parent/Child Docs in ElasticSearch
Nested and Parent/Child Docs in ElasticSearchNested and Parent/Child Docs in ElasticSearch
Nested and Parent/Child Docs in ElasticSearch
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
 
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and Elasticsearch
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug Grall
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
ElasticSearch : Architecture et Développement
ElasticSearch : Architecture et DéveloppementElasticSearch : Architecture et Développement
ElasticSearch : Architecture et Développement
 

Similaire à 963

USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventTrivadis
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologiesneeraj rathore
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for AnalyticsJen Stirrup
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Jujuseoul_engineer
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesQubole
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingCascading
 
Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018Nathan Bijnens
 

Similaire à 963 (20)

USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 
Big data clustering
Big data clusteringBig data clustering
Big data clustering
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log Processing
 
Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018
 
Hadoop
HadoopHadoop
Hadoop
 

963

  • 1. Real time big data with Apache Kafka, Spark Streaming, Scala, Elastic search. By S Annu Ahmed(122N1A0573) V Indu Priyanka(122N1A0532) S Ravindra(122N1A0572) M Imran Basha(122N1A0556) P B Sravanthi(122N1A0558) B Baby Likhitha(122N1A0514)
  • 2. Contents: • Overview of project • What is Big Data ? • Hadoop • Apache Kafka • Scala • Spark Streaming • Elastic Search
  • 3. What is Data ? “A set of values that may be Qualitative or Quantitate in nature” What is Big Data ? “Data so large and voluminous that it overwhelms the existing data storage and processing infrastructure, is said to be big enough to be called as-Big data”  What is Real time Big Data ? “ Hadoop is engineered for big data analytics, but it's not real time. NoSQL is engineered for real-time big data, but it's operational rather than analytical. NoSQL together with Hadoop is the key to real time big data”
  • 4. Parameters to use big data:  Huge amount of data  Complex data which consists of lots of unstructured data  Speed of generating data
  • 5. What We Need ? Fault Tolerant Failure Detection Fast - low latency, distributed, data locality Masterless, Decentralized Cluster Membership DataCenters Partition-Aware Elasticity •Parallelism
  • 6. Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. ECO SYSTEM: Hadoop
  • 7. Let’s recall basic concepts of Messaging System
  • 8. Point to Point Messaging (Queue)
  • 11. Overview  An apache project initially developed at LinkedIn  Distributed publish-subscribe messaging system • Designed for processing of real time activity stream data e.g. logs, metrics collections • Written in Scala  Features  Persistent messaging  High-throughput  Supports both queue and topic semantics  Uses Zookeeper for forming a cluster of nodes (producer/consumer/broker)and many more…
  • 12. Real time transfer Consumer3 (Group2) Kafka Broker Consumer4 (Group2) Producer Zookeeper Consumer2 (Group1) Consumer1 (Group1) Update Consumed Message offset Queue Topology Topic Topology Kafka Broker Broker does not Push messages to Consumer, Consumer Polls messages from Broker.
  • 13. About Apache Spark  Initially started at UC Berkeley in 2009  Fast and general purpose cluster computing system  10x (on disk) - 100x (In-Memory) faster  Most popular for running Iterative Machine Learning Algorithms.  Provides high level APIs in  Java  Scala  Python  Integration with Hadoop and its eco-system and can read existing data.
  • 14. So Why Spark ? Hadoop execution flow Spark execution flow • Most of Machine Learning Algorithms are iterative because each iteration can improve the results • With Disk based approach each iteration’s output is written to disk making it slow
  • 15. Resilient Distributed Dataset (RDD) •Fault Tolerant: can recalculate from any point of failure •Created through transformations on data (map,filter..) or other RDDs •Immutable •Partitioned •Indicate which RDD elements to partition across machines based on a key in each record •Can be reused
  • 16. Spark Streaming Makes it easy to build scalable fault-tolerant streaming applications. Ease of Use Fault Tolerance Combine streaming with batch and interactive queries.
  • 17. zillions of bytes gigabytes per second Spark Streaming
  • 18. Input & Output Sources
  • 20. • Functional • Object oriented programming • On the JVM • Static typing - easier to control performance Why Scala?
  • 21. Scala  Scala has been created by Martin Odersky and he released the first version in 2001  Scala is the language that addresses the major needs of the modern developer.  It is a statically typed, mixed-paradigm, JVM language with a succinct, elegant, and flexible syntax, a sophisticated type system, and idioms that promote scalability from small , interpreted scripts to large, sophisticated applications.
  • 22. Continued….  Scala is compelling because it feels like a dynamically typed scripting language, due to its succinct syntax and type inference.  Yet Scala gives you all the benefits of static typing, a modern object model, functional programming, and an advanced type system.  Scala's aim to provide advanced constructs for the abstraction and composition of components is shared by several recent research efforts.
  • 23. What is elasticsearch?  In short, it can be thought of as “search engine software”  It provides the realistic potential for you to run your own search engine service (like a Bing or a Google) but with say, private, sensitive, or confidential data/documents that you don’t want on the public web  great extra capability for your company, enterprise, app, startup, client  elasticsearch is an open-source, distributed web application that runs on top of Lucene, and it is written in Java, and it sports a REST API  Apache Lucene is the best open-source search engine, and probably one of the best search engines available, and holds its own even when compared against the most expensive commercial alternatives  very fast search
  • 24. Where did elasticsearch come from?  Originally there was a search application project called Apache Compass, which was primarily worked on by @kimchy  Compass also relied on Lucene, but was not distributed  kimchy decided to write elasticsearch to be distributed from the get go, and so you could say it was built with the cloud in mind  Add more servers and they play together nicely, and they know how to work together to split up the work load (and search queries can be resource intensive and expensive in terms of memory/disk requirements)
  • 25. Elastic search is an advanced distributed app  It has some very cool properties and abilities when it comes to operations that involve lots of nodes  It scales extremely gracefully  It has its own optimized binary protocol and makes its own “internal network”  …as long as you know what you are doing when it comes to configuration  It is open source