SlideShare une entreprise Scribd logo
1  sur  22
COMMERCIAL IN CONFIDENCE Copyright 2018 FUJITSU LIMITED
Real time web server
log Analytics Using
Apache Spark - Kafka
Ankit Gupta
Data, Big Data & Modern Big Data Approaches
CONCEPT TRADITIONAL DATA TRADITIONAL BIG
DATA
MODERN BIG
DATA(Spark)
Data Sources • Relational
• Files
• Message queues
• Relational
• Files
• Message queues
• Data service
• Relational
• Files
• Message queues
• Data service
• NoSQL
Integration Analysis • Minimal • Medium • Faster time to market
• Modeled by analytical
transformations
Real-time • Minimal real time • Minimal real time • In real time or die
Data Access • Primarily batch • Batch • Micro batch (Spark-
Streaming)
Open Source
Technologies
• Fully embraced • Minimal • TCO rules
Need of Real time Analytics
When referring to “analytics,” people often think of manipulating an existing set of structured data to yield insights. “Real-time
analytics” takes this definition a step further by accounting for the constant appending of new data to the existing data set and
continuously re-analyzing the new dataset for new insights. But for analytics to be real-time, data needs to be ingested
immediately upon creation, delivering results in a matter of seconds, enabling those interpreting the data to react right away.
• Use cases that exemplify why real-time analytics are critical to performance and user experience, highlighting key capabilities
that enable real-time analytics in each layer of your system or application:
• The Application Layer
With your developer team preparing for a big push to production, you’re worried about the possibility of unforeseen issues
immediately following the deployment. Testing in development will never provide an exact replica of what will happen in production.
Therefore, the more you are able to view and monitor your logs in real-time, the faster you will be able to address and rectify issues.
While big issues may be easy to spot, real-time analytics can also help you identify small issues building over time that could
eventually slow down your application and user experience. While batch-processed analytics could only ever give you a historical
analysis of your systems data, real-time analytics can enable you to identify anomalous patterns in your data as they occur. Using a
log analytics tool that offers “anomaly alerts” can help you identify early warning signs of larger issues.
Need of Real time Analytics …
• The Database Layer
Imagine over the course of several minutes, your popular e-commerce application hasn’t received any orders. Where’s the first
place you’d look for a possible issue? You may first check to see if your website is still reachable from a browser. Then, you
may check your server logs. Or perhaps you check your APM tool? Or a web analytics tool? Are they all saying the same thing?
Or nothing at all? When you notice there aren’t any errors in your code and traffic to your website appears to have remained
steady, you decide to investigate your database. Only then, after wasting time investigating other scenarios, do you see your
database was improperly configured in the last deployment and has reached its row limit. How many sales have you lost while
guessing where to investigate? Without log-based, real-time analytics, database errors can go undiscovered, often only realized
after a period of noticeable inactivity and investigating. When using a real-time aggregated log analytics service, database
errors stream into the same single view with the rest of your system’s log events as they occur. Alerts on database errors can
be generated just as easily as alerts for the rest of your environment. And tools that offer custom tagging of specific event types
can also help you spot database specific errors as they occur.
• Server/Hosting Layer
Let’s say your mobile app was just featured on Product Hunt and you’re suddenly experiencing a spike in traffic. Luckily, your
app runs in an auto scaling environment and handles the load without issue. When the traffic later subsides and your servers
scale back, you decide to analyze the distribution of 400 errors over time. But how will you access data from the servers that
scaled down? If you weren’t sending those log files to a central location in real-time, your data is forever lost. In this scenario,
centralizing your logs in real-time is crucial to capturing all relevant data.
Use Case Model -1 Web server Log Analysis / Potential Security Log
Sources
Web server log analysis and statistics generator we analyze the web server logs to compute the following statistics
for further data analysis and create reports and dashboards:
• Response counts by different HTTP response codes
• Response content size
• IP address of the clients to assess where the highest web traffic is coming from
• Top end point URLs to identify which services are accessed more than others
Successful user login “Accepted password”,“Accepted publickey”,
“session opened”
Failed user login “authentication failure”,“failed password”
User log-off “session closed”
User account change or deletion “password changed”,“new user”,
“delete user”
Sudo actions “sudo: … COMMAND=…”“FAILED su”
Service failure “failed” or “failure”
Use Case Model -2 Checklist for Security On windows
Look at both inbound and outbound activities.
Examples below show log excerpts from Cisco ASA logs; other devices have similar functionality.
Traffic allowed on firewall “Built … connection”,“access-list … permitted”
Traffic blocked on firewall “access-list … denied”,“deny inbound”,
“Deny … by”
Bytes transferred (large files?) “Teardown TCP connection … duration … bytes …”
Bandwidth and protocol usage “limit … exceeded”,“CPU utilization”
Detected attack activity “attack from”
User account changes “user added”,“user deleted”,
“User priv level changed”
Administrator access “AAA user …”,“User … locked out”,
“login failed”
Use Case- Background
• We'll look at a web server log analytics use case to show how Spark Streaming can help with running analytics on data
streams that are generated in a continuous manner(Stream) to compute the following statistics for further data analysis and
create reports and dashboards:-
• IP address of the clients to assess where the highest web traffic is coming from.
• Top end point URLs to identify which services are accessed more than others.
• Streaming Data Analytics - Spark Streaming is an extension of core Spark API, which makes it easy to build fault-tolerant
processing of real-time data streams. Streaming data is basically a continuous group of data records generated from
sources like sensors, server traffic and online searches. Some of the examples of streaming data are user activity on
websites, monitoring data, server logs, and other event data. Streaming data processing applications help with live
dashboards, real-time online recommendations, and instant fraud detection.
The way Spark Streaming works is it divides the live stream of data into batches (called micro batches) of a pre-defined interval
(‘N’ seconds) and then treats each batch of data as Resilient Distributed Datasets (RDDs). Then we can process these
RDDs using the operations like map, reduce, reduceByKey, join and window. The results of these RDD operations are returned
in batches. We usually store these results into a data store for further analytics and to generate reports and dashboards or
sending event based alerts.
Kafka-Spark Streaming Architecture
INGESTION-LAYER AGGREGATION-LAYER ANALYSIS-LAYER STORAGE-LAYER
DATA PRODUCER
Kafka-Mechanism
Applications(producers) send messages (records)
to a Kafka node (broker) and said messages are
processed by other applications called consumers.
Said messages get stored in a topic and consumers
subscribe to the topic to receive new messages.
Apache Kafka is a distributed streaming platform, Publish and subscribe to streams of records, similar to a message queue or
enterprise messaging system, Store streams of records in a fault-tolerant durable way. Process streams of records as they occur.
Spark-Mechanism
The main() method of the program runs in the driver. The driver is the process that runs the user code(called as Driver
Program) that creates RDDs, and performs transformation and action, and also creates SparkContext.
The driver program splits the Spark application into the task and schedules them to run on the executor. The task scheduler
resides in the driver and distributes task among workers. The two main key roles of drivers are:
-> Converting user program into the task.
-> Scheduling task on the executor.
Technologies Used
• Zookeeper
• Apache Kafka
• Kafka Clients- Producer/Consumer
• Kafka Connect
• Apache Spark Streaming
• Scala
• Power BI – Visualization
Environment- Cloudera 6-Node Cluster
Kafka Producer Configuration
Clickstream Data Generated from Weblog server
Submit jar file in Client mode to the Spark cluster
Visualization of Spark Streaming Applications
First visualization is the DAG (Direct Acyclic Graph)
Processed D-stream Batch
Statistics during the execution
When the data stream is being sent to Kafka and processed by Spark Streaming consumer, which include the input rate
showing the number of events per second, processing time in milliseconds.
Output Stored in HDFS
Dashboard – ClickStream Analytics on PowerBI
Apache Spark Streaming -Real time web server log analytics

Contenu connexe

Tendances

Finding Our Happy Place in the Internet of Things
Finding Our Happy Place in the Internet of ThingsFinding Our Happy Place in the Internet of Things
Finding Our Happy Place in the Internet of ThingsPamela Pavliscak
 
Taleemat-e-Islam تعلیمات اسلام Book for Bs ,Mbbs,A.D.A,BA
Taleemat-e-Islam تعلیمات اسلام Book for Bs ,Mbbs,A.D.A,BATaleemat-e-Islam تعلیمات اسلام Book for Bs ,Mbbs,A.D.A,BA
Taleemat-e-Islam تعلیمات اسلام Book for Bs ,Mbbs,A.D.A,BAKamran Abdullah
 
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioningLeveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioningDataWorks Summit
 
The Story of a Redesign - Aaron Weyenberg - SearchLove 2014
The Story of a Redesign - Aaron Weyenberg - SearchLove 2014The Story of a Redesign - Aaron Weyenberg - SearchLove 2014
The Story of a Redesign - Aaron Weyenberg - SearchLove 2014Distilled
 
HBase and Hadoop at Adobe
HBase and Hadoop at AdobeHBase and Hadoop at Adobe
HBase and Hadoop at AdobeCosmin Lehene
 
GT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL DatabaseGT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL DatabaseRob Tweed
 
Bypassing anti virus using powershell
Bypassing anti virus using powershellBypassing anti virus using powershell
Bypassing anti virus using powershellabend_cve_9999_0001
 
Social Media and Hashtag Activism
Social Media and Hashtag ActivismSocial Media and Hashtag Activism
Social Media and Hashtag ActivismDamian Radcliffe
 
Apache Ambari Stack Extensibility
Apache Ambari Stack ExtensibilityApache Ambari Stack Extensibility
Apache Ambari Stack ExtensibilityJayush Luniya
 
How to Write Clickass Presentations that Convert
How to Write Clickass Presentations that ConvertHow to Write Clickass Presentations that Convert
How to Write Clickass Presentations that ConvertBarry Feldman
 
The impact of social media
The impact of social mediaThe impact of social media
The impact of social mediaememdesign
 
Jumpstart: The Guide To Growing A Startup With Inbound Marketing
Jumpstart: The Guide To Growing A Startup With Inbound MarketingJumpstart: The Guide To Growing A Startup With Inbound Marketing
Jumpstart: The Guide To Growing A Startup With Inbound MarketingHubSpot
 
Managing your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache AmbariManaging your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache AmbariDataWorks Summit
 
Módulo 01 Introdução ao DSpace
Módulo 01 Introdução ao DSpaceMódulo 01 Introdução ao DSpace
Módulo 01 Introdução ao DSpaceRodrigo Prado
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 

Tendances (20)

Finding Our Happy Place in the Internet of Things
Finding Our Happy Place in the Internet of ThingsFinding Our Happy Place in the Internet of Things
Finding Our Happy Place in the Internet of Things
 
Taleemat-e-Islam تعلیمات اسلام Book for Bs ,Mbbs,A.D.A,BA
Taleemat-e-Islam تعلیمات اسلام Book for Bs ,Mbbs,A.D.A,BATaleemat-e-Islam تعلیمات اسلام Book for Bs ,Mbbs,A.D.A,BA
Taleemat-e-Islam تعلیمات اسلام Book for Bs ,Mbbs,A.D.A,BA
 
ROCK STAR BRANDING
ROCK STAR BRANDINGROCK STAR BRANDING
ROCK STAR BRANDING
 
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioningLeveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
 
The Story of a Redesign - Aaron Weyenberg - SearchLove 2014
The Story of a Redesign - Aaron Weyenberg - SearchLove 2014The Story of a Redesign - Aaron Weyenberg - SearchLove 2014
The Story of a Redesign - Aaron Weyenberg - SearchLove 2014
 
HBase and Hadoop at Adobe
HBase and Hadoop at AdobeHBase and Hadoop at Adobe
HBase and Hadoop at Adobe
 
GT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL DatabaseGT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL Database
 
Bypassing anti virus using powershell
Bypassing anti virus using powershellBypassing anti virus using powershell
Bypassing anti virus using powershell
 
Social Media and Hashtag Activism
Social Media and Hashtag ActivismSocial Media and Hashtag Activism
Social Media and Hashtag Activism
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
 
Slideology sample30
Slideology sample30Slideology sample30
Slideology sample30
 
Apache Ambari Stack Extensibility
Apache Ambari Stack ExtensibilityApache Ambari Stack Extensibility
Apache Ambari Stack Extensibility
 
How to Write Clickass Presentations that Convert
How to Write Clickass Presentations that ConvertHow to Write Clickass Presentations that Convert
How to Write Clickass Presentations that Convert
 
The impact of social media
The impact of social mediaThe impact of social media
The impact of social media
 
Jumpstart: The Guide To Growing A Startup With Inbound Marketing
Jumpstart: The Guide To Growing A Startup With Inbound MarketingJumpstart: The Guide To Growing A Startup With Inbound Marketing
Jumpstart: The Guide To Growing A Startup With Inbound Marketing
 
Atelier Inoreader
Atelier InoreaderAtelier Inoreader
Atelier Inoreader
 
Managing your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache AmbariManaging your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache Ambari
 
Módulo 01 Introdução ao DSpace
Módulo 01 Introdução ao DSpaceMódulo 01 Introdução ao DSpace
Módulo 01 Introdução ao DSpace
 
Social Media Analytics
Social Media Analytics Social Media Analytics
Social Media Analytics
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 

Similaire à Apache Spark Streaming -Real time web server log analytics

xGem Data Stream Processing
xGem Data Stream ProcessingxGem Data Stream Processing
xGem Data Stream ProcessingJorge Hirtz
 
Azure Monitoring Overview
Azure Monitoring OverviewAzure Monitoring Overview
Azure Monitoring Overviewgjuljo
 
Event Stream Processing SAP
Event Stream Processing SAPEvent Stream Processing SAP
Event Stream Processing SAPGaurav Ahluwalia
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesVasu S
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 
Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017Nitin Kumar
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSAWS User Group Kochi
 
Asynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAsynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAlexander Dean
 
Getting started with Amazon Kinesis
Getting started with Amazon KinesisGetting started with Amazon Kinesis
Getting started with Amazon KinesisAmazon Web Services
 
Getting started with amazon kinesis
Getting started with amazon kinesisGetting started with amazon kinesis
Getting started with amazon kinesisJampp
 
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...Amazon Web Services
 
Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...
Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...
Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...Amazon Web Services
 
Elevate your Splunk Deployment by Better Understanding your Value Breakfast S...
Elevate your Splunk Deployment by Better Understanding your Value Breakfast S...Elevate your Splunk Deployment by Better Understanding your Value Breakfast S...
Elevate your Splunk Deployment by Better Understanding your Value Breakfast S...Splunk
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time AnalyticsAmazon Web Services
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elkRushika Shah
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
 
Kafka Vs Spark - Comparison Guide
Kafka Vs Spark - Comparison GuideKafka Vs Spark - Comparison Guide
Kafka Vs Spark - Comparison GuideSprintzeal
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in SparkSnappyData
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain
 

Similaire à Apache Spark Streaming -Real time web server log analytics (20)

xGem Data Stream Processing
xGem Data Stream ProcessingxGem Data Stream Processing
xGem Data Stream Processing
 
Azure Monitoring Overview
Azure Monitoring OverviewAzure Monitoring Overview
Azure Monitoring Overview
 
Event Stream Processing SAP
Event Stream Processing SAPEvent Stream Processing SAP
Event Stream Processing SAP
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data Lakes
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
 
Asynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAsynchronous micro-services and the unified log
Asynchronous micro-services and the unified log
 
Getting started with Amazon Kinesis
Getting started with Amazon KinesisGetting started with Amazon Kinesis
Getting started with Amazon Kinesis
 
Getting started with amazon kinesis
Getting started with amazon kinesisGetting started with amazon kinesis
Getting started with amazon kinesis
 
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
 
Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...
Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...
Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...
 
Elevate your Splunk Deployment by Better Understanding your Value Breakfast S...
Elevate your Splunk Deployment by Better Understanding your Value Breakfast S...Elevate your Splunk Deployment by Better Understanding your Value Breakfast S...
Elevate your Splunk Deployment by Better Understanding your Value Breakfast S...
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elk
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
 
Kafka Vs Spark - Comparison Guide
Kafka Vs Spark - Comparison GuideKafka Vs Spark - Comparison Guide
Kafka Vs Spark - Comparison Guide
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 

Dernier

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 

Dernier (20)

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 

Apache Spark Streaming -Real time web server log analytics

  • 1. COMMERCIAL IN CONFIDENCE Copyright 2018 FUJITSU LIMITED Real time web server log Analytics Using Apache Spark - Kafka Ankit Gupta
  • 2. Data, Big Data & Modern Big Data Approaches CONCEPT TRADITIONAL DATA TRADITIONAL BIG DATA MODERN BIG DATA(Spark) Data Sources • Relational • Files • Message queues • Relational • Files • Message queues • Data service • Relational • Files • Message queues • Data service • NoSQL Integration Analysis • Minimal • Medium • Faster time to market • Modeled by analytical transformations Real-time • Minimal real time • Minimal real time • In real time or die Data Access • Primarily batch • Batch • Micro batch (Spark- Streaming) Open Source Technologies • Fully embraced • Minimal • TCO rules
  • 3. Need of Real time Analytics When referring to “analytics,” people often think of manipulating an existing set of structured data to yield insights. “Real-time analytics” takes this definition a step further by accounting for the constant appending of new data to the existing data set and continuously re-analyzing the new dataset for new insights. But for analytics to be real-time, data needs to be ingested immediately upon creation, delivering results in a matter of seconds, enabling those interpreting the data to react right away. • Use cases that exemplify why real-time analytics are critical to performance and user experience, highlighting key capabilities that enable real-time analytics in each layer of your system or application: • The Application Layer With your developer team preparing for a big push to production, you’re worried about the possibility of unforeseen issues immediately following the deployment. Testing in development will never provide an exact replica of what will happen in production. Therefore, the more you are able to view and monitor your logs in real-time, the faster you will be able to address and rectify issues. While big issues may be easy to spot, real-time analytics can also help you identify small issues building over time that could eventually slow down your application and user experience. While batch-processed analytics could only ever give you a historical analysis of your systems data, real-time analytics can enable you to identify anomalous patterns in your data as they occur. Using a log analytics tool that offers “anomaly alerts” can help you identify early warning signs of larger issues.
  • 4. Need of Real time Analytics … • The Database Layer Imagine over the course of several minutes, your popular e-commerce application hasn’t received any orders. Where’s the first place you’d look for a possible issue? You may first check to see if your website is still reachable from a browser. Then, you may check your server logs. Or perhaps you check your APM tool? Or a web analytics tool? Are they all saying the same thing? Or nothing at all? When you notice there aren’t any errors in your code and traffic to your website appears to have remained steady, you decide to investigate your database. Only then, after wasting time investigating other scenarios, do you see your database was improperly configured in the last deployment and has reached its row limit. How many sales have you lost while guessing where to investigate? Without log-based, real-time analytics, database errors can go undiscovered, often only realized after a period of noticeable inactivity and investigating. When using a real-time aggregated log analytics service, database errors stream into the same single view with the rest of your system’s log events as they occur. Alerts on database errors can be generated just as easily as alerts for the rest of your environment. And tools that offer custom tagging of specific event types can also help you spot database specific errors as they occur. • Server/Hosting Layer Let’s say your mobile app was just featured on Product Hunt and you’re suddenly experiencing a spike in traffic. Luckily, your app runs in an auto scaling environment and handles the load without issue. When the traffic later subsides and your servers scale back, you decide to analyze the distribution of 400 errors over time. But how will you access data from the servers that scaled down? If you weren’t sending those log files to a central location in real-time, your data is forever lost. In this scenario, centralizing your logs in real-time is crucial to capturing all relevant data.
  • 5. Use Case Model -1 Web server Log Analysis / Potential Security Log Sources Web server log analysis and statistics generator we analyze the web server logs to compute the following statistics for further data analysis and create reports and dashboards: • Response counts by different HTTP response codes • Response content size • IP address of the clients to assess where the highest web traffic is coming from • Top end point URLs to identify which services are accessed more than others Successful user login “Accepted password”,“Accepted publickey”, “session opened” Failed user login “authentication failure”,“failed password” User log-off “session closed” User account change or deletion “password changed”,“new user”, “delete user” Sudo actions “sudo: … COMMAND=…”“FAILED su” Service failure “failed” or “failure”
  • 6. Use Case Model -2 Checklist for Security On windows Look at both inbound and outbound activities. Examples below show log excerpts from Cisco ASA logs; other devices have similar functionality. Traffic allowed on firewall “Built … connection”,“access-list … permitted” Traffic blocked on firewall “access-list … denied”,“deny inbound”, “Deny … by” Bytes transferred (large files?) “Teardown TCP connection … duration … bytes …” Bandwidth and protocol usage “limit … exceeded”,“CPU utilization” Detected attack activity “attack from” User account changes “user added”,“user deleted”, “User priv level changed” Administrator access “AAA user …”,“User … locked out”, “login failed”
  • 7. Use Case- Background • We'll look at a web server log analytics use case to show how Spark Streaming can help with running analytics on data streams that are generated in a continuous manner(Stream) to compute the following statistics for further data analysis and create reports and dashboards:- • IP address of the clients to assess where the highest web traffic is coming from. • Top end point URLs to identify which services are accessed more than others. • Streaming Data Analytics - Spark Streaming is an extension of core Spark API, which makes it easy to build fault-tolerant processing of real-time data streams. Streaming data is basically a continuous group of data records generated from sources like sensors, server traffic and online searches. Some of the examples of streaming data are user activity on websites, monitoring data, server logs, and other event data. Streaming data processing applications help with live dashboards, real-time online recommendations, and instant fraud detection. The way Spark Streaming works is it divides the live stream of data into batches (called micro batches) of a pre-defined interval (‘N’ seconds) and then treats each batch of data as Resilient Distributed Datasets (RDDs). Then we can process these RDDs using the operations like map, reduce, reduceByKey, join and window. The results of these RDD operations are returned in batches. We usually store these results into a data store for further analytics and to generate reports and dashboards or sending event based alerts.
  • 8. Kafka-Spark Streaming Architecture INGESTION-LAYER AGGREGATION-LAYER ANALYSIS-LAYER STORAGE-LAYER DATA PRODUCER
  • 9. Kafka-Mechanism Applications(producers) send messages (records) to a Kafka node (broker) and said messages are processed by other applications called consumers. Said messages get stored in a topic and consumers subscribe to the topic to receive new messages. Apache Kafka is a distributed streaming platform, Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system, Store streams of records in a fault-tolerant durable way. Process streams of records as they occur.
  • 10. Spark-Mechanism The main() method of the program runs in the driver. The driver is the process that runs the user code(called as Driver Program) that creates RDDs, and performs transformation and action, and also creates SparkContext. The driver program splits the Spark application into the task and schedules them to run on the executor. The task scheduler resides in the driver and distributes task among workers. The two main key roles of drivers are: -> Converting user program into the task. -> Scheduling task on the executor.
  • 11. Technologies Used • Zookeeper • Apache Kafka • Kafka Clients- Producer/Consumer • Kafka Connect • Apache Spark Streaming • Scala • Power BI – Visualization
  • 14. Clickstream Data Generated from Weblog server
  • 15. Submit jar file in Client mode to the Spark cluster
  • 16. Visualization of Spark Streaming Applications
  • 17. First visualization is the DAG (Direct Acyclic Graph)
  • 19. Statistics during the execution When the data stream is being sent to Kafka and processed by Spark Streaming consumer, which include the input rate showing the number of events per second, processing time in milliseconds.
  • 21. Dashboard – ClickStream Analytics on PowerBI