Case Studies on Big-Data Processing and Streaming - Iranian Java User Group

Case Studies
on
Big-Data Processing and Data Streaming
By: Amir Sedighi
LinkedIn: http://linkedin.com/in/amirsedighi
Twitter: @amirsedighi

JUG - A.Sedighi - 2015 2 / 48
Background
● BS and MS degrees in Software Engineering
● Senior Software Engineer
– +20 Years of Programming Experience
● Cross-platform Software Development
– +4 Years of Big-Data Processing and Machine-Learning Experience
● Log Management and Forensic
● Big-Data Visualization
● Data Warehouse using Big-Data Technologies
● Recommender Systems
● Analytical Real-Time Search Engines
● Integrating Fedora Digital Library with HDFS
● Next Generation Event Processing
● Online Resume
– http://linkedin.com/in/amirsedighi

JUG - A.Sedighi - 2015 3 / 48
Outline
● An Introduction to Big-Data Processing
● Big-Data and Processing and Data Streaming
– Data Processing
1. +TB Scale Data Warehouse
2. Analytical Real-Time Search Solution and BI
3. Scaleable Recommender System
4. Integrating Fedora Digital Library with HDFS
– Stream and Event Processing
1. Super Fast Scaleable Log Management, Forensic and BI
2. Super Fast Scaleable Fraud Detection

JUG - A.Sedighi - 2015 4 / 48
What Big-Data Is?

JUG - A.Sedighi - 2015 5 / 48
● Every 2 Days Human Create As Much Information As We Did
Up To 2003 - Eric Schmidt

JUG - A.Sedighi - 2015 6 / 48
Big-Data Characteristics
● Volume
● Variety
● Velocity

JUG - A.Sedighi - 2015 7 / 48
You're a Part of It Every Day
● We've have the ability to store anything
● Companies and people are generating data like
never before in history
– Social Networks
– Online Web Portals
– Log Writers - Our Digital Footprint!

JUG - A.Sedighi - 2015 8 / 48
You're a Part of It Every Day
● Big-Data is whatever people do in the digital world,
including the foot print of what people, companies,
devices and services do (Logs), including traditional
tabular data stores.

JUG - A.Sedighi - 2015 9 / 48
As a Manager still You're a Part of It
● “Over half of the business leaders today, realize they
don't have access to the insights they need to do their
job.” - IBM

JUG - A.Sedighi - 2015 10 / 48
Vertical or Horizontal?

JUG - A.Sedighi - 2015 11 / 48
Scale Up or Scale Out

JUG - A.Sedighi - 2015 12 / 48
Linear Scalability

JUG - A.Sedighi - 2015 13 / 48
Big-Data Processing Solutions

JUG - A.Sedighi - 2015 14 / 48
Q: How To Be Linear Scaleable on Commodity
Machines?
A: MapReduce

JUG - A.Sedighi - 2015 15 / 48
Q: How to store big data on commodity machines?
A: Distributed File System

JUG - A.Sedighi - 2015 16 / 48
Replication → Fault Tolerant
Replication → Data Locality → Utilization

JUG - A.Sedighi - 2015 17 / 48
Big-Data Processing, Most Popular
Technologies
● Apache Hadoop Ecosystem
● NoSQL Databases
– HBase
– Cassandra
– MongoDB
– Neo4j
● Elasticsearch
– Lucene
– SolR
● Java

JUG - A.Sedighi - 2015 18 / 48
+TB Scale Data Warehouse
1

JUG - A.Sedighi - 2015 19 / 48
DW Solution
● SQL
● ETL
– RDBMS
– NoSQL
– File System
● REST API

JUG - A.Sedighi - 2015 20 / 48
REST Admin Panel

JUG - A.Sedighi - 2015 21 / 48
Features
● Extendable Capacity for Data Warehousing
● Making Very Big Integrated Databases Based on Different
Technologies/Schemas
– DB2, Oracle, MS-SQL …
– Different Schemas Such as HRMS, Banking, Sales...
– Making Small Dense Integrated RDBMSs
● SQL Language Interface
● Linear Scalability

JUG - A.Sedighi - 2015 22 / 48
Main Technologies and Frameworks
● Apache Hadoop
– Sqoop
– YARN/HDFS
– Hive or Drill or Impala
● Microservices Architecture
– Java 1.7
– Spring Boot

JUG - A.Sedighi - 2015 23 / 48
Analytical Real-Time Scalable Search Solution
and BI
2

JUG - A.Sedighi - 2015 24 / 48
+TB Scale RT Searching
● Indexing Incoming Data on-the-fly
● Highly Scaleable and Reliable
● Simple or Complex Queries
● REST API
● Schema Agnostic
● Customizable GUI and BI

JUG - A.Sedighi - 2015 25 / 48
Business Intelligence

JUG - A.Sedighi - 2015 26 / 48
Rich GUI

JUG - A.Sedighi - 2015 27 / 48
● Elasticsearch
– Apache Lucene
– REST
● Kibana

JUG - A.Sedighi - 2015 28 / 48
Scalable Recommender System
3

JUG - A.Sedighi - 2015 29 / 48
Recommender System
● Value-added Service (Loyalty Services)
● Machine-Learning
– Clustering Throw Thousands of Nodes
● Apache Mahout
● Super Fast

JUG - A.Sedighi - 2015 30 / 48
How It Works?

JUG - A.Sedighi - 2015 31 / 48
Technologies and Frameworks
● Microservices Architecture
● Java 1.6
● Apache Mahout
● Redis

Fedora Digital Library and HDFS Integration
4

Migrating from Expensive Servers to Commodity
Machines
● Making HDFS as Fedora Digital Library Storage
– Research and Development
– Hadoop 1.2, Later Hadoop YARN 2.2
– Integrating with SolR over HDFS
● Java 1.7
● Fedora
– Islandora
– GSearch

JUG - A.Sedighi - 2015 34 / 48
Data Streaming

JUG - A.Sedighi - 2015 35 / 48
Big-Data Streaming, Most Popular Technologies
● Piping and Messaging
– Kafka, Flume, FluentD and ZeroMQ
● Stream Processing
– Storm, Samza and Spark
● Machine Learning
– Machine Learning: MLLib and Mahout
● Persisting
– NoSQL DBs
– HDFS

JUG - A.Sedighi - 2015 36 / 48
Log Management, Forensic and BI
1

JUG - A.Sedighi - 2015 37 / 48
Log Management, Forensic and BI
● Every Digital Stuff Writes Things Into Log Files
– Log Files Are Streams of Data
– Log Files Are Messy
– Log Files Come Very Fast, in an Un-Predictable Manner
– Log Files Are About Everything within Your Business
● Log Files Are Full of Insight
– Who Can Hold Them For a Reasonable Period of Time
– Who Can Search Them Rapidly
– Who Can Visualize Them Easily (BI)

JUG - A.Sedighi - 2015 38 / 48
Network Topology
LB
Masters
Data

JUG - A.Sedighi - 2015 39 / 48
● LogStash
– Flume
● Elasticsearch
● Kibana

JUG - A.Sedighi - 2015 40 / 48
Snapshot

JUG - A.Sedighi - 2015 41 / 48
Fraud Detection
2

JUG - A.Sedighi - 2015 42 / 48
Inputs & Outputs
● Inputs: One or multiple sources generate data continuously, in
real time
– Sensor Networks
– Transaction Logs
– Text Streams such as News
– Network Traffic Analysis
● Outputs: Up-to-date Answers generated continuously or
periodically

JUG - A.Sedighi - 2015 43 / 48
Data Processing
Transient Query
– Issued once, then forgotten
Persistent Data
Stored until deleted by user or apps

JUG - A.Sedighi - 2015 44 / 48
Stream Processing
Transient Data
– Deleted as Window Slides
Forward
Generated up-to-date
answers as time goes on
Persistent Queries
TimeBased
CountBased

JUG - A.Sedighi - 2015 45 / 48
Features
● Scalability
● Real-Timing, (Only 1 Second delay at most)
● Super Fast Decision Making
● Implementing Complex Fraud Scenarios Aa Easy as Defining
Queries
● Uniform Api For Processing Old or Early Events

JUG - A.Sedighi - 2015 46 / 48
● Java 1.7, Scala 2.11
● Apache Flume
● Apache Kafka
● Apache Spark

Where To Start?
● You need Big Amount of Data
● You need to change your mind
– Rack Space and Number of Servers, IO and Process Limitations
● You need To Understand Fundamentals
– Linux (Bash Script)
– Java is a Most, Python works and Scala is an advantage
– SQL and ETL
– MapReduce, Resource Management and Serialization Frameworks
– Apache Hadoop Ecosystem and Successors

JUG - A.Sedighi - 2015 48 / 48
Thank You!, Question?
http://slideshare.net/amirsedighi

Case Studies on Big-Data Processing and Streaming - Iranian Java User Group

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (11)

Similaire à Case Studies on Big-Data Processing and Streaming - Iranian Java User Group

Similaire à Case Studies on Big-Data Processing and Streaming - Iranian Java User Group (20)

Plus de Amir Sedighi

Plus de Amir Sedighi (8)

Dernier

Dernier (20)

Case Studies on Big-Data Processing and Streaming - Iranian Java User Group