Getting real-time analytics for devices/application/business monitoring from trillions of events and petabytes of data like companies Netflix, Uber, Alibaba, Paypal, Ebay, Metamarkets do.
2. Agenda
What is Analytics?
How can we get pattern data?
Ad-hoc solution
ETL’s types
Real-Time Streaming
What is Kafka?
Apache Hadoop YARN
Druid
Tranquility
Business intelligence web application
3. What is analytics?
Data-driven decisions
Forecast future results
Reporting
Machine Learning
Metrics/Monitoring
Optimize data
Analytics is the discovery, interpretation, and communication of meaningful patterns in data and
it can be used in the following scenarios.
4. How can we get pattern data?
In computing, extract, transform, load (ETL) refers to a process in database usage and especially
in data warehousing or you can get by interactive Ad-hoc analysis where a unique solution does
ETL from multiples data source.
5. Ad-hoc solution
Presto – Multiple Database Support - Mysql,PostgreSQL,S3, Cassandra,
HDFS, etc.
Apache Drill – Multiple NoSQL database support – MongoDB, HBase,
HDFS, S3 and etc.
• Do all ETL steps at once
• Data Cleasing is complex
• Extract information from production servers
Disadvantages
• Don´t need to create complex infrastracture for Analytics
• Don´t nedd to extract informations to other systemsAdvantages:
6. ETL’s types
Conclusion
In my perpective Batch mode
is totally for legacy system
which cannot migrate to real-
time stream or for small ones.
Batch mode extracts data using copy tools through jobs to populate data warehouse such as
HDFS and finally we can create business analiytcs on the another hand real-time streaming ETL
in real-time.
8. Real-Time Streaming topology
You can extract data with a
tool called flume or by your
applications directly. Flume
is able to send data from
various types of sources
and output them to Kafka
and HDFS.
9. What is Kafka?
Kafka is a distributed messaging system providing fast, highly
scalable and redundant messaging through a pub-sub model
Topic is the container with
which messages are
associated. It´s divided into a
number of partitions.
Each node in the cluster is
called a Kafka broker.
Consumers is responsible for
getting messages from a
topic
Producers is responsible for
publishing data/messages
into a topic
The basic architecture of Kafka is
organized around a few key terms:
topics, producers, consumers, and
brokers.
10. Apache Hadoop YARN
(Yet Another Resource Negotiator) Client
Submit an application/job.
Node Manager
Provide computacional resources and
Manage application containers
Application Master
Monitor the containers and their resource
consumption
Negotiates appropriate resource for containers
Container
Run the application spawned by
application master
Resource manager
Check Node Manager and available
resources in the cluster. Monitor
application masters.
11. What is Samza?
Apache Samza is a distributed stream processing framework (application manager into Yarn).
It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance,
processor isolation, security, and resource management. it's commonly used to transform,
cleanup, normalize data before save to data warehouse
You can tranform/cleanup data
between job forward it through
Kafka topics. For example if
the message “I´m Leandro
and I´m system engineer”
got to samza job1 it can
normalize like “name:
Leandro, and I´m system
engineer” and the job samza2
tranform to “name: Leandro,
job: “system engineer”.
12. Samza Hadoop Integration
We can see in a Yarn Web UI a lot of information about your cluster such as: resource usage and
available, number of Jobs and their status, information about application máster and containers.
13. Samza work-Flow
You should start a job on the Yarn grid running the samza script run-
job.sh with a specific configuration file for each job. You must setup in
the config file “job name”, the location of yarn package file, the task
class location to find a process method, kafka input topic name,etc..
14. Druid – Real-time and historical data Data Warehouse
Druid provides low latency (real-time) data ingestion, flexible data exploration, and fast data
aggregation. Existing Druid deployments have scaled to trillions of events and petabytes of
data. Druid is most commonly used to power user-facing analytic applications.
Sub-second OLAP
Queries
Druid’s unique
architecture enables
rapid multi-dimensional
filtering, ad-hoc attribute
groupings, and extremely
fast aggregations.
Real-time Streaming
Ingestion
Druid employs lock-free
ingestion to allow for
simultaneous ingestion
and querying of high
dimensional, high volume
data sets. Explore events
immediately after they
occur.
Power Analytic
Applications
Druid has numerous
features built for multi-
tenancy. Power user-
facing analytic
applications designed to
be used by thousands of
concurrent users.
Cost Effective
Druid is extremely cost
effective at scale and has
numerous features built
in for cost reduction.
Trade off cost and
performance with simple
configuration knobs.
Highly Available
Druid is used to back
SaaS implementations
that need to be up all the
time. Druid supports rolling
updates so your data is
still available and
queryable during software
updates. Scale up or down
without data loss.
Scalable
Existing Druid
deployments handle
trillions of events,
petabytes of data, and
thousands of queries
every second.
Source: http://druid.io/druid.htm
16. Druid Components
Historical nodes commonly form the backbone of a Druid cluster. Historical nodes download immutable segments locally and serve
queries over those segments. The nodes have a shared nothing architecture and know how to load segments, drop segments, and
serve queries on segments.
Broker nodes are what clients and applications query to get data from Druid. Broker nodes are responsible for scattering
queries and gathering and merging results. Broker nodes know what segments live where.
Coordinator nodes manage segments on historical nodes in a cluster. Coordinator nodes tell historical nodes to load new
segments, drop old segments, and move segments to load balance.
Real-time processing in Druid can currently be done using standalone realtime nodes or using the indexing service. The real-time logic is
common between these two services. Real-time processing involves ingesting data, indexing the data (creating segments), and handing
segments off to historical nodes. Data is queryable as soon as it is ingested by the realtime processing logic. The hand-off process is also
lossless; data remains queryable throughout the entire process.
17. Querying Druid data
Request and output is json
format. We are getting values
from field metrics from host
compute-3.
18. Tranquility – Sending events to Druid
Tranquility is a tool which gets the
final processed data from Kafka
Topics writing it into druid
database/datasources
You must know what data structure is
coming and how it´s going to save into
druid datasource therefore you must
map dimension metrics in tranquility
configuration file.
19. Business intelligence web application
Business intelligence web applications permits user to explore and visualize into data
warehouse and create reports easily.
Superset – It´s a amazing tool developed by airbnb which permits user create awesome
reports but we got some limitations about querying raw data and not aggregation data.It´s
required on installation many python pip modules.
Tableau – We didn´t have a oportunity to test but It´s a enterprise/comercial solution and
looks like the most complete.
Metabase – It´s easy to install and operate.Setting up reports is pretty straightfoward.
20. Metabase - Open source business intelligence tool
Get the jar file , run, access it.
https://<Address>:3000
Add database/datasource
connection on web UI.
Ask Question to build
report/analysis.