SlideShare une entreprise Scribd logo
1  sur  21
Real-time Analytics
Kafka, Apache Samza, Hadoop Yarn, Druid, Tranquility and Metabase.
Leandro Totino Pereira
Devops/Cloud Engineer
Agenda
 What is Analytics?
 How can we get pattern data?
 Ad-hoc solution
 ETL’s types
 Real-Time Streaming
 What is Kafka?
 Apache Hadoop YARN
 Druid
 Tranquility
 Business intelligence web application
What is analytics?
Data-driven decisions
Forecast future results
Reporting
Machine Learning
Metrics/Monitoring
Optimize data
Analytics is the discovery, interpretation, and communication of meaningful patterns in data and
it can be used in the following scenarios.
How can we get pattern data?
In computing, extract, transform, load (ETL) refers to a process in database usage and especially
in data warehousing or you can get by interactive Ad-hoc analysis where a unique solution does
ETL from multiples data source.
Ad-hoc solution
Presto – Multiple Database Support - Mysql,PostgreSQL,S3, Cassandra,
HDFS, etc.
Apache Drill – Multiple NoSQL database support – MongoDB, HBase,
HDFS, S3 and etc.
• Do all ETL steps at once
• Data Cleasing is complex
• Extract information from production servers
Disadvantages
• Don´t need to create complex infrastracture for Analytics
• Don´t nedd to extract informations to other systemsAdvantages:
ETL’s types
Conclusion
In my perpective Batch mode
is totally for legacy system
which cannot migrate to real-
time stream or for small ones.
Batch mode extracts data using copy tools through jobs to populate data warehouse such as
HDFS and finally we can create business analiytcs on the another hand real-time streaming ETL
in real-time.
Real-Time Streaming
Real-Time Streaming topology
You can extract data with a
tool called flume or by your
applications directly. Flume
is able to send data from
various types of sources
and output them to Kafka
and HDFS.
What is Kafka?
Kafka is a distributed messaging system providing fast, highly
scalable and redundant messaging through a pub-sub model
Topic is the container with
which messages are
associated. It´s divided into a
number of partitions.
Each node in the cluster is
called a Kafka broker.
Consumers is responsible for
getting messages from a
topic
Producers is responsible for
publishing data/messages
into a topic
The basic architecture of Kafka is
organized around a few key terms:
topics, producers, consumers, and
brokers.
Apache Hadoop YARN
(Yet Another Resource Negotiator) Client
Submit an application/job.
Node Manager
Provide computacional resources and
Manage application containers
Application Master
Monitor the containers and their resource
consumption
Negotiates appropriate resource for containers
Container
Run the application spawned by
application master
Resource manager
Check Node Manager and available
resources in the cluster. Monitor
application masters.
What is Samza?
Apache Samza is a distributed stream processing framework (application manager into Yarn).
It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance,
processor isolation, security, and resource management. it's commonly used to transform,
cleanup, normalize data before save to data warehouse
You can tranform/cleanup data
between job forward it through
Kafka topics. For example if
the message “I´m Leandro
and I´m system engineer”
got to samza job1 it can
normalize like “name:
Leandro, and I´m system
engineer” and the job samza2
tranform to “name: Leandro,
job: “system engineer”.
Samza Hadoop Integration
We can see in a Yarn Web UI a lot of information about your cluster such as: resource usage and
available, number of Jobs and their status, information about application máster and containers.
Samza work-Flow
You should start a job on the Yarn grid running the samza script run-
job.sh with a specific configuration file for each job. You must setup in
the config file “job name”, the location of yarn package file, the task
class location to find a process method, kafka input topic name,etc..
Druid – Real-time and historical data Data Warehouse
Druid provides low latency (real-time) data ingestion, flexible data exploration, and fast data
aggregation. Existing Druid deployments have scaled to trillions of events and petabytes of
data. Druid is most commonly used to power user-facing analytic applications.
Sub-second OLAP
Queries
Druid’s unique
architecture enables
rapid multi-dimensional
filtering, ad-hoc attribute
groupings, and extremely
fast aggregations.
Real-time Streaming
Ingestion
Druid employs lock-free
ingestion to allow for
simultaneous ingestion
and querying of high
dimensional, high volume
data sets. Explore events
immediately after they
occur.
Power Analytic
Applications
Druid has numerous
features built for multi-
tenancy. Power user-
facing analytic
applications designed to
be used by thousands of
concurrent users.
Cost Effective
Druid is extremely cost
effective at scale and has
numerous features built
in for cost reduction.
Trade off cost and
performance with simple
configuration knobs.
Highly Available
Druid is used to back
SaaS implementations
that need to be up all the
time. Druid supports rolling
updates so your data is
still available and
queryable during software
updates. Scale up or down
without data loss.
Scalable
Existing Druid
deployments handle
trillions of events,
petabytes of data, and
thousands of queries
every second.
Source: http://druid.io/druid.htm
Druid architecture
Druid Components
Historical nodes commonly form the backbone of a Druid cluster. Historical nodes download immutable segments locally and serve
queries over those segments. The nodes have a shared nothing architecture and know how to load segments, drop segments, and
serve queries on segments.
Broker nodes are what clients and applications query to get data from Druid. Broker nodes are responsible for scattering
queries and gathering and merging results. Broker nodes know what segments live where.
Coordinator nodes manage segments on historical nodes in a cluster. Coordinator nodes tell historical nodes to load new
segments, drop old segments, and move segments to load balance.
Real-time processing in Druid can currently be done using standalone realtime nodes or using the indexing service. The real-time logic is
common between these two services. Real-time processing involves ingesting data, indexing the data (creating segments), and handing
segments off to historical nodes. Data is queryable as soon as it is ingested by the realtime processing logic. The hand-off process is also
lossless; data remains queryable throughout the entire process.
Querying Druid data
Request and output is json
format. We are getting values
from field metrics from host
compute-3.
Tranquility – Sending events to Druid
Tranquility is a tool which gets the
final processed data from Kafka
Topics writing it into druid
database/datasources
You must know what data structure is
coming and how it´s going to save into
druid datasource therefore you must
map dimension metrics in tranquility
configuration file.
Business intelligence web application
Business intelligence web applications permits user to explore and visualize into data
warehouse and create reports easily.
Superset – It´s a amazing tool developed by airbnb which permits user create awesome
reports but we got some limitations about querying raw data and not aggregation data.It´s
required on installation many python pip modules.
Tableau – We didn´t have a oportunity to test but It´s a enterprise/comercial solution and
looks like the most complete.
Metabase – It´s easy to install and operate.Setting up reports is pretty straightfoward.
Metabase - Open source business intelligence tool
Get the jar file , run, access it.
https://<Address>:3000
Add database/datasource
connection on web UI.
Ask Question to build
report/analysis.
Thank you!
Questions?
More information:
Linkedin:
https://www.linkedin.com/in/leandro-totino-pereira
Facebook:
https://www.facebook.com/leandro.totinopereira

Contenu connexe

Tendances

Modern Data Flow
Modern Data FlowModern Data Flow
Modern Data Flowconfluent
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaKai Wähner
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksGuido Schmutz
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisVincenzo Gulisano
 
Observability in the world of microservices
Observability in the world of microservicesObservability in the world of microservices
Observability in the world of microservicesChandresh Pancholi
 

Tendances (20)

Modern Data Flow
Modern Data FlowModern Data Flow
Modern Data Flow
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and Frameworks
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
MongoDB
MongoDBMongoDB
MongoDB
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
 
Observability in the world of microservices
Observability in the world of microservicesObservability in the world of microservices
Observability in the world of microservices
 

Similaire à Real time analytics

Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworksIJDKP
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionEtu Solution
 
data analytics lecture4.pptx
data analytics lecture4.pptxdata analytics lecture4.pptx
data analytics lecture4.pptxNamrataBhatt8
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
IRJET- Secured Hadoop Environment
IRJET- Secured Hadoop EnvironmentIRJET- Secured Hadoop Environment
IRJET- Secured Hadoop EnvironmentIRJET Journal
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 

Similaire à Real time analytics (20)

Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
 
data analytics lecture4.pptx
data analytics lecture4.pptxdata analytics lecture4.pptx
data analytics lecture4.pptx
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
paper
paperpaper
paper
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
IRJET- Secured Hadoop Environment
IRJET- Secured Hadoop EnvironmentIRJET- Secured Hadoop Environment
IRJET- Secured Hadoop Environment
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop
HadoopHadoop
Hadoop
 

Plus de Leandro Totino Pereira

Backup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipesBackup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipesLeandro Totino Pereira
 
Discover/Register Everything in consul
Discover/Register Everything in consulDiscover/Register Everything in consul
Discover/Register Everything in consulLeandro Totino Pereira
 
Monitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDB
Monitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDBMonitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDB
Monitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDBLeandro Totino Pereira
 
Gocd – Kubernetes/Nomad Continuous Deployment
Gocd – Kubernetes/Nomad Continuous DeploymentGocd – Kubernetes/Nomad Continuous Deployment
Gocd – Kubernetes/Nomad Continuous DeploymentLeandro Totino Pereira
 
Linkerd – Service mesh with service Discovery backend
Linkerd – Service mesh with service Discovery backendLinkerd – Service mesh with service Discovery backend
Linkerd – Service mesh with service Discovery backendLeandro Totino Pereira
 
DynomiteDB - No spof High-availability Redis cluster solution
DynomiteDB -  No spof High-availability Redis cluster solutionDynomiteDB -  No spof High-availability Redis cluster solution
DynomiteDB - No spof High-availability Redis cluster solutionLeandro Totino Pereira
 
DalmatinerDB and cockroachDB monitoring plataform
DalmatinerDB and cockroachDB monitoring plataformDalmatinerDB and cockroachDB monitoring plataform
DalmatinerDB and cockroachDB monitoring plataformLeandro Totino Pereira
 

Plus de Leandro Totino Pereira (9)

Backup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipesBackup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipes
 
Zabbix at scale with Elasticsearch
Zabbix at scale with ElasticsearchZabbix at scale with Elasticsearch
Zabbix at scale with Elasticsearch
 
Discover/Register Everything in consul
Discover/Register Everything in consulDiscover/Register Everything in consul
Discover/Register Everything in consul
 
Monitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDB
Monitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDBMonitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDB
Monitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDB
 
Automate schedule
Automate scheduleAutomate schedule
Automate schedule
 
Gocd – Kubernetes/Nomad Continuous Deployment
Gocd – Kubernetes/Nomad Continuous DeploymentGocd – Kubernetes/Nomad Continuous Deployment
Gocd – Kubernetes/Nomad Continuous Deployment
 
Linkerd – Service mesh with service Discovery backend
Linkerd – Service mesh with service Discovery backendLinkerd – Service mesh with service Discovery backend
Linkerd – Service mesh with service Discovery backend
 
DynomiteDB - No spof High-availability Redis cluster solution
DynomiteDB -  No spof High-availability Redis cluster solutionDynomiteDB -  No spof High-availability Redis cluster solution
DynomiteDB - No spof High-availability Redis cluster solution
 
DalmatinerDB and cockroachDB monitoring plataform
DalmatinerDB and cockroachDB monitoring plataformDalmatinerDB and cockroachDB monitoring plataform
DalmatinerDB and cockroachDB monitoring plataform
 

Dernier

TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxStephen Sitton
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESkarthi keyan
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdfAkritiPradhan2
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdfsahilsajad201
 
OOP concepts -in-Python programming language
OOP concepts -in-Python programming languageOOP concepts -in-Python programming language
OOP concepts -in-Python programming languageSmritiSharma901052
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
signals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsignals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsapna80328
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfManish Kumar
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodManicka Mamallan Andavar
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 

Dernier (20)

TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptx
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdf
 
OOP concepts -in-Python programming language
OOP concepts -in-Python programming languageOOP concepts -in-Python programming language
OOP concepts -in-Python programming language
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
signals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsignals in triangulation .. ...Surveying
signals in triangulation .. ...Surveying
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument method
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitos
 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptx
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 

Real time analytics

  • 1. Real-time Analytics Kafka, Apache Samza, Hadoop Yarn, Druid, Tranquility and Metabase. Leandro Totino Pereira Devops/Cloud Engineer
  • 2. Agenda  What is Analytics?  How can we get pattern data?  Ad-hoc solution  ETL’s types  Real-Time Streaming  What is Kafka?  Apache Hadoop YARN  Druid  Tranquility  Business intelligence web application
  • 3. What is analytics? Data-driven decisions Forecast future results Reporting Machine Learning Metrics/Monitoring Optimize data Analytics is the discovery, interpretation, and communication of meaningful patterns in data and it can be used in the following scenarios.
  • 4. How can we get pattern data? In computing, extract, transform, load (ETL) refers to a process in database usage and especially in data warehousing or you can get by interactive Ad-hoc analysis where a unique solution does ETL from multiples data source.
  • 5. Ad-hoc solution Presto – Multiple Database Support - Mysql,PostgreSQL,S3, Cassandra, HDFS, etc. Apache Drill – Multiple NoSQL database support – MongoDB, HBase, HDFS, S3 and etc. • Do all ETL steps at once • Data Cleasing is complex • Extract information from production servers Disadvantages • Don´t need to create complex infrastracture for Analytics • Don´t nedd to extract informations to other systemsAdvantages:
  • 6. ETL’s types Conclusion In my perpective Batch mode is totally for legacy system which cannot migrate to real- time stream or for small ones. Batch mode extracts data using copy tools through jobs to populate data warehouse such as HDFS and finally we can create business analiytcs on the another hand real-time streaming ETL in real-time.
  • 8. Real-Time Streaming topology You can extract data with a tool called flume or by your applications directly. Flume is able to send data from various types of sources and output them to Kafka and HDFS.
  • 9. What is Kafka? Kafka is a distributed messaging system providing fast, highly scalable and redundant messaging through a pub-sub model Topic is the container with which messages are associated. It´s divided into a number of partitions. Each node in the cluster is called a Kafka broker. Consumers is responsible for getting messages from a topic Producers is responsible for publishing data/messages into a topic The basic architecture of Kafka is organized around a few key terms: topics, producers, consumers, and brokers.
  • 10. Apache Hadoop YARN (Yet Another Resource Negotiator) Client Submit an application/job. Node Manager Provide computacional resources and Manage application containers Application Master Monitor the containers and their resource consumption Negotiates appropriate resource for containers Container Run the application spawned by application master Resource manager Check Node Manager and available resources in the cluster. Monitor application masters.
  • 11. What is Samza? Apache Samza is a distributed stream processing framework (application manager into Yarn). It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. it's commonly used to transform, cleanup, normalize data before save to data warehouse You can tranform/cleanup data between job forward it through Kafka topics. For example if the message “I´m Leandro and I´m system engineer” got to samza job1 it can normalize like “name: Leandro, and I´m system engineer” and the job samza2 tranform to “name: Leandro, job: “system engineer”.
  • 12. Samza Hadoop Integration We can see in a Yarn Web UI a lot of information about your cluster such as: resource usage and available, number of Jobs and their status, information about application máster and containers.
  • 13. Samza work-Flow You should start a job on the Yarn grid running the samza script run- job.sh with a specific configuration file for each job. You must setup in the config file “job name”, the location of yarn package file, the task class location to find a process method, kafka input topic name,etc..
  • 14. Druid – Real-time and historical data Data Warehouse Druid provides low latency (real-time) data ingestion, flexible data exploration, and fast data aggregation. Existing Druid deployments have scaled to trillions of events and petabytes of data. Druid is most commonly used to power user-facing analytic applications. Sub-second OLAP Queries Druid’s unique architecture enables rapid multi-dimensional filtering, ad-hoc attribute groupings, and extremely fast aggregations. Real-time Streaming Ingestion Druid employs lock-free ingestion to allow for simultaneous ingestion and querying of high dimensional, high volume data sets. Explore events immediately after they occur. Power Analytic Applications Druid has numerous features built for multi- tenancy. Power user- facing analytic applications designed to be used by thousands of concurrent users. Cost Effective Druid is extremely cost effective at scale and has numerous features built in for cost reduction. Trade off cost and performance with simple configuration knobs. Highly Available Druid is used to back SaaS implementations that need to be up all the time. Druid supports rolling updates so your data is still available and queryable during software updates. Scale up or down without data loss. Scalable Existing Druid deployments handle trillions of events, petabytes of data, and thousands of queries every second. Source: http://druid.io/druid.htm
  • 16. Druid Components Historical nodes commonly form the backbone of a Druid cluster. Historical nodes download immutable segments locally and serve queries over those segments. The nodes have a shared nothing architecture and know how to load segments, drop segments, and serve queries on segments. Broker nodes are what clients and applications query to get data from Druid. Broker nodes are responsible for scattering queries and gathering and merging results. Broker nodes know what segments live where. Coordinator nodes manage segments on historical nodes in a cluster. Coordinator nodes tell historical nodes to load new segments, drop old segments, and move segments to load balance. Real-time processing in Druid can currently be done using standalone realtime nodes or using the indexing service. The real-time logic is common between these two services. Real-time processing involves ingesting data, indexing the data (creating segments), and handing segments off to historical nodes. Data is queryable as soon as it is ingested by the realtime processing logic. The hand-off process is also lossless; data remains queryable throughout the entire process.
  • 17. Querying Druid data Request and output is json format. We are getting values from field metrics from host compute-3.
  • 18. Tranquility – Sending events to Druid Tranquility is a tool which gets the final processed data from Kafka Topics writing it into druid database/datasources You must know what data structure is coming and how it´s going to save into druid datasource therefore you must map dimension metrics in tranquility configuration file.
  • 19. Business intelligence web application Business intelligence web applications permits user to explore and visualize into data warehouse and create reports easily. Superset – It´s a amazing tool developed by airbnb which permits user create awesome reports but we got some limitations about querying raw data and not aggregation data.It´s required on installation many python pip modules. Tableau – We didn´t have a oportunity to test but It´s a enterprise/comercial solution and looks like the most complete. Metabase – It´s easy to install and operate.Setting up reports is pretty straightfoward.
  • 20. Metabase - Open source business intelligence tool Get the jar file , run, access it. https://<Address>:3000 Add database/datasource connection on web UI. Ask Question to build report/analysis.