During recent years, the data science has undergone a big shift towards big data processing. As a result, a change in our methodology seems to be inevitable. This change, however, does not necessarily translate to a loss in decades of investments in classical data processing technologies and data warehousing. Instead, it supports adapting to the new environment with regards to the mass production of business data, by adopting modern practices.
In this talk we review some frameworks and solutions to modern big data processing approaches, along with a few case studies that have been carried out in Iran.
2. JUG - A.Sedighi - 2015 2 / 48
Background
● BS and MS degrees in Software Engineering
● Senior Software Engineer
– +20 Years of Programming Experience
● Cross-platform Software Development
– +4 Years of Big-Data Processing and Machine-Learning Experience
● Log Management and Forensic
● Big-Data Visualization
● Data Warehouse using Big-Data Technologies
● Recommender Systems
● Analytical Real-Time Search Engines
● Integrating Fedora Digital Library with HDFS
● Next Generation Event Processing
● Online Resume
– http://linkedin.com/in/amirsedighi
3. JUG - A.Sedighi - 2015 3 / 48
Outline
● An Introduction to Big-Data Processing
● Big-Data and Processing and Data Streaming
– Data Processing
1. +TB Scale Data Warehouse
2. Analytical Real-Time Search Solution and BI
3. Scaleable Recommender System
4. Integrating Fedora Digital Library with HDFS
– Stream and Event Processing
1. Super Fast Scaleable Log Management, Forensic and BI
2. Super Fast Scaleable Fraud Detection
7. JUG - A.Sedighi - 2015 7 / 48
You're a Part of It Every Day
● We've have the ability to store anything
● Companies and people are generating data like
never before in history
– Social Networks
– Online Web Portals
– Log Writers - Our Digital Footprint!
8. JUG - A.Sedighi - 2015 8 / 48
You're a Part of It Every Day
● Big-Data is whatever people do in the digital world,
including the foot print of what people, companies,
devices and services do (Logs), including traditional
tabular data stores.
9. JUG - A.Sedighi - 2015 9 / 48
As a Manager still You're a Part of It
● “Over half of the business leaders today, realize they
don't have access to the insights they need to do their
job.” - IBM
21. JUG - A.Sedighi - 2015 21 / 48
Features
● Extendable Capacity for Data Warehousing
● Making Very Big Integrated Databases Based on Different
Technologies/Schemas
– DB2, Oracle, MS-SQL …
– Different Schemas Such as HRMS, Banking, Sales...
– Making Small Dense Integrated RDBMSs
● SQL Language Interface
● Linear Scalability
22. JUG - A.Sedighi - 2015 22 / 48
Main Technologies and Frameworks
● Apache Hadoop
– Sqoop
– YARN/HDFS
– Hive or Drill or Impala
● Microservices Architecture
– Java 1.7
– Spring Boot
23. JUG - A.Sedighi - 2015 23 / 48
Analytical Real-Time Scalable Search Solution
and BI
2
24. JUG - A.Sedighi - 2015 24 / 48
+TB Scale RT Searching
● Indexing Incoming Data on-the-fly
● Highly Scaleable and Reliable
● Simple or Complex Queries
● REST API
● Schema Agnostic
● Customizable GUI and BI
27. JUG - A.Sedighi - 2015 27 / 48
Main Technologies and Frameworks
● Elasticsearch
– Apache Lucene
– REST
● Kibana
28. JUG - A.Sedighi - 2015 28 / 48
Scalable Recommender System
3
29. JUG - A.Sedighi - 2015 29 / 48
Recommender System
● Value-added Service (Loyalty Services)
● Machine-Learning
– Clustering Throw Thousands of Nodes
● Apache Mahout
● Super Fast
33. Migrating from Expensive Servers to Commodity
Machines
● Making HDFS as Fedora Digital Library Storage
– Research and Development
– Hadoop 1.2, Later Hadoop YARN 2.2
– Integrating with SolR over HDFS
● Java 1.7
● Fedora
– Islandora
– GSearch
35. JUG - A.Sedighi - 2015 35 / 48
Big-Data Streaming, Most Popular Technologies
● Piping and Messaging
– Kafka, Flume, FluentD and ZeroMQ
● Stream Processing
– Storm, Samza and Spark
● Machine Learning
– Machine Learning: MLLib and Mahout
● Persisting
– NoSQL DBs
– HDFS
36. JUG - A.Sedighi - 2015 36 / 48
Log Management, Forensic and BI
1
37. JUG - A.Sedighi - 2015 37 / 48
Log Management, Forensic and BI
● Every Digital Stuff Writes Things Into Log Files
– Log Files Are Streams of Data
– Log Files Are Messy
– Log Files Come Very Fast, in an Un-Predictable Manner
– Log Files Are About Everything within Your Business
● Log Files Are Full of Insight
– Who Can Hold Them For a Reasonable Period of Time
– Who Can Search Them Rapidly
– Who Can Visualize Them Easily (BI)
38. JUG - A.Sedighi - 2015 38 / 48
Network Topology
LB
Masters
Data
39. JUG - A.Sedighi - 2015 39 / 48
Main Technologies and Frameworks
● LogStash
– Flume
● Elasticsearch
● Kibana
42. JUG - A.Sedighi - 2015 42 / 48
Inputs & Outputs
● Inputs: One or multiple sources generate data continuously, in
real time
– Sensor Networks
– Transaction Logs
– Text Streams such as News
– Network Traffic Analysis
● Outputs: Up-to-date Answers generated continuously or
periodically
43. JUG - A.Sedighi - 2015 43 / 48
Data Processing
Transient Query
– Issued once, then forgotten
Persistent Data
Stored until deleted by user or apps
44. JUG - A.Sedighi - 2015 44 / 48
Stream Processing
Transient Data
– Deleted as Window Slides
Forward
Generated up-to-date
answers as time goes on
Persistent Queries
TimeBased
CountBased
45. JUG - A.Sedighi - 2015 45 / 48
Features
● Scalability
● Real-Timing, (Only 1 Second delay at most)
● Super Fast Decision Making
● Implementing Complex Fraud Scenarios Aa Easy as Defining
Queries
● Uniform Api For Processing Old or Early Events
46. JUG - A.Sedighi - 2015 46 / 48
Main Technologies and Frameworks
● Java 1.7, Scala 2.11
● Apache Flume
● Apache Kafka
● Apache Spark
47. Where To Start?
● You need Big Amount of Data
● You need to change your mind
– Rack Space and Number of Servers, IO and Process Limitations
● You need To Understand Fundamentals
– Linux (Bash Script)
– Java is a Most, Python works and Scala is an advantage
– SQL and ETL
– MapReduce, Resource Management and Serialization Frameworks
– Apache Hadoop Ecosystem and Successors