Apache Spark Streaming -Real time web server log analytics
1. COMMERCIAL IN CONFIDENCE Copyright 2018 FUJITSU LIMITED
Real time web server
log Analytics Using
Apache Spark - Kafka
Ankit Gupta
2. Data, Big Data & Modern Big Data Approaches
CONCEPT TRADITIONAL DATA TRADITIONAL BIG
DATA
MODERN BIG
DATA(Spark)
Data Sources • Relational
• Files
• Message queues
• Relational
• Files
• Message queues
• Data service
• Relational
• Files
• Message queues
• Data service
• NoSQL
Integration Analysis • Minimal • Medium • Faster time to market
• Modeled by analytical
transformations
Real-time • Minimal real time • Minimal real time • In real time or die
Data Access • Primarily batch • Batch • Micro batch (Spark-
Streaming)
Open Source
Technologies
• Fully embraced • Minimal • TCO rules
3. Need of Real time Analytics
When referring to “analytics,” people often think of manipulating an existing set of structured data to yield insights. “Real-time
analytics” takes this definition a step further by accounting for the constant appending of new data to the existing data set and
continuously re-analyzing the new dataset for new insights. But for analytics to be real-time, data needs to be ingested
immediately upon creation, delivering results in a matter of seconds, enabling those interpreting the data to react right away.
• Use cases that exemplify why real-time analytics are critical to performance and user experience, highlighting key capabilities
that enable real-time analytics in each layer of your system or application:
• The Application Layer
With your developer team preparing for a big push to production, you’re worried about the possibility of unforeseen issues
immediately following the deployment. Testing in development will never provide an exact replica of what will happen in production.
Therefore, the more you are able to view and monitor your logs in real-time, the faster you will be able to address and rectify issues.
While big issues may be easy to spot, real-time analytics can also help you identify small issues building over time that could
eventually slow down your application and user experience. While batch-processed analytics could only ever give you a historical
analysis of your systems data, real-time analytics can enable you to identify anomalous patterns in your data as they occur. Using a
log analytics tool that offers “anomaly alerts” can help you identify early warning signs of larger issues.
4. Need of Real time Analytics …
• The Database Layer
Imagine over the course of several minutes, your popular e-commerce application hasn’t received any orders. Where’s the first
place you’d look for a possible issue? You may first check to see if your website is still reachable from a browser. Then, you
may check your server logs. Or perhaps you check your APM tool? Or a web analytics tool? Are they all saying the same thing?
Or nothing at all? When you notice there aren’t any errors in your code and traffic to your website appears to have remained
steady, you decide to investigate your database. Only then, after wasting time investigating other scenarios, do you see your
database was improperly configured in the last deployment and has reached its row limit. How many sales have you lost while
guessing where to investigate? Without log-based, real-time analytics, database errors can go undiscovered, often only realized
after a period of noticeable inactivity and investigating. When using a real-time aggregated log analytics service, database
errors stream into the same single view with the rest of your system’s log events as they occur. Alerts on database errors can
be generated just as easily as alerts for the rest of your environment. And tools that offer custom tagging of specific event types
can also help you spot database specific errors as they occur.
• Server/Hosting Layer
Let’s say your mobile app was just featured on Product Hunt and you’re suddenly experiencing a spike in traffic. Luckily, your
app runs in an auto scaling environment and handles the load without issue. When the traffic later subsides and your servers
scale back, you decide to analyze the distribution of 400 errors over time. But how will you access data from the servers that
scaled down? If you weren’t sending those log files to a central location in real-time, your data is forever lost. In this scenario,
centralizing your logs in real-time is crucial to capturing all relevant data.
5. Use Case Model -1 Web server Log Analysis / Potential Security Log
Sources
Web server log analysis and statistics generator we analyze the web server logs to compute the following statistics
for further data analysis and create reports and dashboards:
• Response counts by different HTTP response codes
• Response content size
• IP address of the clients to assess where the highest web traffic is coming from
• Top end point URLs to identify which services are accessed more than others
Successful user login “Accepted password”,“Accepted publickey”,
“session opened”
Failed user login “authentication failure”,“failed password”
User log-off “session closed”
User account change or deletion “password changed”,“new user”,
“delete user”
Sudo actions “sudo: … COMMAND=…”“FAILED su”
Service failure “failed” or “failure”
6. Use Case Model -2 Checklist for Security On windows
Look at both inbound and outbound activities.
Examples below show log excerpts from Cisco ASA logs; other devices have similar functionality.
Traffic allowed on firewall “Built … connection”,“access-list … permitted”
Traffic blocked on firewall “access-list … denied”,“deny inbound”,
“Deny … by”
Bytes transferred (large files?) “Teardown TCP connection … duration … bytes …”
Bandwidth and protocol usage “limit … exceeded”,“CPU utilization”
Detected attack activity “attack from”
User account changes “user added”,“user deleted”,
“User priv level changed”
Administrator access “AAA user …”,“User … locked out”,
“login failed”
7. Use Case- Background
• We'll look at a web server log analytics use case to show how Spark Streaming can help with running analytics on data
streams that are generated in a continuous manner(Stream) to compute the following statistics for further data analysis and
create reports and dashboards:-
• IP address of the clients to assess where the highest web traffic is coming from.
• Top end point URLs to identify which services are accessed more than others.
• Streaming Data Analytics - Spark Streaming is an extension of core Spark API, which makes it easy to build fault-tolerant
processing of real-time data streams. Streaming data is basically a continuous group of data records generated from
sources like sensors, server traffic and online searches. Some of the examples of streaming data are user activity on
websites, monitoring data, server logs, and other event data. Streaming data processing applications help with live
dashboards, real-time online recommendations, and instant fraud detection.
The way Spark Streaming works is it divides the live stream of data into batches (called micro batches) of a pre-defined interval
(‘N’ seconds) and then treats each batch of data as Resilient Distributed Datasets (RDDs). Then we can process these
RDDs using the operations like map, reduce, reduceByKey, join and window. The results of these RDD operations are returned
in batches. We usually store these results into a data store for further analytics and to generate reports and dashboards or
sending event based alerts.
9. Kafka-Mechanism
Applications(producers) send messages (records)
to a Kafka node (broker) and said messages are
processed by other applications called consumers.
Said messages get stored in a topic and consumers
subscribe to the topic to receive new messages.
Apache Kafka is a distributed streaming platform, Publish and subscribe to streams of records, similar to a message queue or
enterprise messaging system, Store streams of records in a fault-tolerant durable way. Process streams of records as they occur.
10. Spark-Mechanism
The main() method of the program runs in the driver. The driver is the process that runs the user code(called as Driver
Program) that creates RDDs, and performs transformation and action, and also creates SparkContext.
The driver program splits the Spark application into the task and schedules them to run on the executor. The task scheduler
resides in the driver and distributes task among workers. The two main key roles of drivers are:
-> Converting user program into the task.
-> Scheduling task on the executor.
11. Technologies Used
• Zookeeper
• Apache Kafka
• Kafka Clients- Producer/Consumer
• Kafka Connect
• Apache Spark Streaming
• Scala
• Power BI – Visualization
19. Statistics during the execution
When the data stream is being sent to Kafka and processed by Spark Streaming consumer, which include the input rate
showing the number of events per second, processing time in milliseconds.