1. Liferay & Big Data
Getting value from your data
!
Miguel Ángel Pastor Olivar
Senior Software Engineer
2. About me
Who am I?
!
• Miguel Ángel Pastor Olivar
!
• Member of the Liferay core infrastructure team
!
• Worked in analytics for a long time
– Disclaimer: Not a computer scientist
!
• Email: miguel.pastor@liferay.com
!
• @miguelinlas3
#LRNAS2014
3. Synopsis
What are we going to talk about?
!
• Big Data: what is this about?
!
• What’s ahead of big data
!
• Connecting Liferay with this “new” world
!
• Simple architecture proposal
!
• Use cases
!
• Questions (and hopefully answers)
#LRNAS2014
5. Definitions
Big Data
!
• It is just a buzzword
!
• Data is so big that regular solutions are:
!
– Extremely slow
!
– Too small
!
– Really expensive
!
• How we use all the data we already own
#LRNAS2014
It is no more than a buzzword but we generally associate it with the problem that datasets has become too big that traditional relational databases are not
able to longer work with them.
!
Note the NoSQL movement has emerged during the last years and pretends to handle in a better way all this new semistructured data, new ways of scaling,
…
6. Definitions
More formally …
!
• Volume
– Transactions, data streaming from social media, …
!
• Velocity
– Torrents of data in real time
!
• Variety
– Numerical data, text, email, video, audio, …
#LRNAS2014
1. Many factors have influenced to increase data volumes: Transaction based data stored through the years, social media, …
2. Data streaming is a reality: IOT, smart cities, RFID sensors, … We have to deal with them as fast as we can
3. Tons of different formats that we need to deal with and interconnect to extract useful information
7. Trending
What is trending?
!
• Data volumes will keep increasing … rapidly
!
• Less emphasis on formal schemas
!
• Data driven applications
#LRNAS2014
Data volumes: Facebook has over 800PB of data stored in Hadoop clusters
!F
ormal schemas: data schemas and sources change rapidly, and we need to integrate so many disparate sources of data that we need to rapidly evolve and
adapt to the changes
! Self driving cars, smart cities ,… generic algorithm and data structures represent the world using data instead of encoding a model of the world within the
software itself (some engineering is required though)
9. Business goals
You already own tons of different data
!
• Get value from it!
!
• Analyse it so you can …
!
– Take faster decisions
!
– Take better decisions
!
– Improve your users experience
!
• Make more money!
#LRNAS2014
10. Business goals
Popular applications
!
• Recommender system:
– Amazon store: you may also like …
!
• Predicting the future:
– Netflix does autoscaling based on past network data
traffic
!
• Churn models
– Big telco companies build social networks to reduce the
churn
– Some big banks have tried to do the same
#LRNAS2014
11. Business goals
Popular applications
!
• Sentiment analysis
– Are talking about you in the Internet?
– Is it good or bad?
!
• Real Time Bidding
– Optimise advertising
!
• Health care
– Improve patients health while reducing costs
– Improve quality of life of multiple sclerosis patients
#LRNAS2014
13. Terminology
Concepts
!
• Storage models
• Where and how we store our relevant information
!
• Computation models
• How we process and transform all the previous
information
!
• Analytics
• How we can take actions based on the previous steps
#LRNAS2014
14. Big Data
architectures
Make a quick tour along some of the popular architectures nowadays: mainly Hadoop/HDFS and all the libraries built on top of the Hadoop API
16. Data storage: HDFS
Hadoop Distributed File System (HDFS)
!
• Java based file system
!
• Scalable, fault-tolerant, distributed storage
!
• Designed to run on commodity hardware
!
• Closely related to MapReduce
#LRNAS2014
This is the most popular alternative which allows you to store your data in a distributed filesystem and execute Map Reduce algorithms on top of it
!
We will see other alternatives to Hadoop which can do much more than MapReduce algorithms
17. Data storage: HDFS
#LRNAS2014
Source: http://hortonworks.com/hadoop/hdfs/
An HDFS cluster is comprised of a NameNode which manages the cluster metadata and DataNodes that store the data. Files and directories are represented
on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, or namespace and disk space quotas
18. Data storage: NoSQL
NoSQL Movement
!
• Semistructured data
!
• Focused on
!
• Horizontal scalability
!
• Availability
!
• Different trade-offs: CAP, BASE, …
!
• Many alternatives: Cassandra, Riak, HBase, …
#LRNAS2014
This “new” movement tries to deal with the huge increase of data (ant is variety) focusing on different topics to those addressed by the traditional
relational databases: horizontal scalability, availability, unstructured data models, …
!
There is plenty of alternatives: memory based, disk based, key-value, key-document, graph databases, … and the usage of this new databases is
increasing on BigData systems
! Some other databases has brought the horizontal scalability and availability to the new
!
19. Data storage: Apache Cassandra
An example: Apache Cassandra
!
• P2P architecture, no single point of failure
!
• Linear scalability
!
• Larger than memory datasets
!
• Fully durable
!
• Tuneable consistency
!
• Integrated caching
#LRNAS2014
20. Data storage: NewSQL
NewSQL Movement
!
• Modern relational databases
!
• Same scalable performance than NoSQL for OLTP
!
• Maintain ACID guarantees
!
• A few alternatives: VoltDB, Google Spanner, FoundationDB,
…
#LRNAS2014
New designs for traditional databases (pretty different along the different options)
!
Google Spanner use GPS based clocks, VoltDB optimise for every specific app by compiling the schema and so on, …
!
22. Computation: Apache Hadoop
Apache Hadoop Map Reduce
!
• Framework:
• Distributed processing
• Large datasets
• Clusters of computers
!
• Simple programming model
!
• Coarse grained
!
• Verbose and hard to use API
#LRNAS2014
29. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
30. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
31. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
32. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
33. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
34. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
35. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
36. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
37. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
38. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
39. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
40. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
41. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
42. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
43. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
44. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
45. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])
46. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])
47. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])
48. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])
49. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])
50. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])
51. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])
52. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])
53. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])
54. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])
55. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])
56. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])
57. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])
58. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])
59. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])
60. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])
61. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])
62. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])
63. Computation: Map Reduce
Liferay
projects is
the
best
Open
Source
project
#LRNAS2014
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])
64. Computation: Apache Hadoop
Apache Hadoop Map Reduce
!
• Batch model data crunching
!
• Not so good event stream processing
!
• But …
!
• Many algorithms hard to implement using MapReduce
!
• Again, API hard to use
!
• Cascading, Scalding, Cascalog, Impala, …
#LRNAS2014
65. Computation: Apache Storm
Apache Storm
!
• Distributed realtime computation system
!
• Easy to reliably process unbounded streams of data
!
• Multi language support
!
• Realtime analytics, online machine learning, continuous
computation, distributed RPC, ETL, …
#LRNAS2014
66. Computation: Apache Storm
Spout
Spout
#LRNAS2014
Bolt Bolt
Bolt
Spouts are data sources and bolts are the event processors
!
There are facilities to support reliable message handling, various sources encapsulated in Spouts and various targets of output. Distributed processing is
baked in from the start
67. Computation: Apache Spark
Apache Spark
!
• Fast and general-purpose cluster computing system
• Developed by Berkeley AMP
!
• High level APIs (not MapReduce)
!
• Optimised engine: supports general execution graphs
!
• Higher-level tools:
• Spark SQL, MLib, Spark Streaming, Graphx (will go
deeper later on)
#LRNAS2014
68. Computation: Apache Mahout
Apache Mahout
!
• Scalable machine learning library
!
• Built on top of Hadoop
!
• Some algorithms don’t require Hadoop at all
#LRNAS2014
69. Computation: Apache Spark
R language
!
• Focused on:
• Data visualisation
• Statistical computations
• Analysis of data
!
• Tons of built-in packages
!
• Connect to Hadoop through Hadoop Streaming
!
• Not a fast language (compared to proprietary alternatives
like SAS)
#LRNAS2014
71. Reference Architecture
How do we proceed?
!
• Plenty of alternatives
!
• No silver bullet
!
• Problems to solve:
!
• Data integration
!
• Real time
!
• Batch processing
#LRNAS2014
87. Reference Architecture
Relational
Database
#LRNAS2014
Event System
Hadoop
User
Tracking
NoSQL
Storage
System
Events
Search
Data Logs
88. Reference Architecture
Relational
Database
#LRNAS2014
Event System
Hadoop
User
Tracking
NoSQL
Storage
System
Events
Search
Data Logs
89. Reference Architecture
Relational
Database
#LRNAS2014
Event System
Hadoop
User
Tracking
NoSQL
Storage
System
Events
Search
Data Logs
Monitoring
90. Reference Architecture
Relational
Database
#LRNAS2014
Event System
Hadoop
User
Tracking
NoSQL
Storage
System
Events
Search
Data Logs
Monitoring
91. Reference Architecture
Relational
Database
#LRNAS2014
Event System
Hadoop
User
Tracking
NoSQL
Storage
System
Events
Search
Data Logs
Monitoring
Dataware
House
92. Reference Architecture
Relational
Database
#LRNAS2014
Event System
Hadoop
User
Tracking
NoSQL
Storage
System
Events
Search
Data Logs
Monitoring
Dataware
House
93. Reference Architecture
Relational
Database
#LRNAS2014
Event System
Hadoop
User
Tracking
NoSQL
Storage
System
Events
Search
Data Logs
Monitoring
Dataware
House Streaming
94. Reference Architecture
Relational
Database
#LRNAS2014
Event System
Hadoop
User
Tracking
NoSQL
Storage
System
Events
Search
Data Logs
Monitoring
Dataware
House Streaming
95. Reference Architecture
Relational
Database
#LRNAS2014
Event System
Hadoop
User
Tracking
NoSQL
Storage
System
Events
Search
Data Logs
Monitoring
Dataware
House Streaming
Social
Graph
97. Reference Architecture
Relational
Database
#LRNAS2014
Event System
Hadoop
User
Tracking
NoSQL
Storage
System
Events
Search
Data Logs
Monitoring
Dataware
House Streaming
Social
Graph
98. Reference Architecture
Relational
Database
#LRNAS2014
Event System
Hadoop
User
Tracking
NoSQL
Storage
System
Events
Search
Data Logs
Monitoring
Dataware
House Streaming
Social
Graph
99. Reference Architecture: Liferay
Liferay
!
• Tons of data available within the platform
• System events
!
• User tracking (client side)
• Clicks, navigation, activities, …
!
• Monitoring (transactions, load page times, …)
!
• Models (message boards, blogs, wiki …)
!
• Custom developments …
#LRNAS2014
101. Reference Architecture
Relational
Database
#LRNAS2014
Event System
Hadoop
User
Tracking
NoSQL
Storage
System
Events
Search
Data Logs
Monitoring
Dataware
House Streaming
Social
Graph
102. Reference Architecture
Relational
Database
#LRNAS2014
Event System
Hadoop
User
Tracking
NoSQL
Storage
System
Events
Search
Data Logs
Monitoring
Dataware
House Streaming
Social
Graph
103. Reference Architecture: Unified Log Service
Data integration
Source: http://en.wikipedia.org/wiki/Maslow's_hierarchy_of_needs
#LRNAS2014
Effective use of data follows a kind of Maslow's hierarchy of needs.
!
1. Base of the pyramid involves capturing all the relevant data
2. This data needs to be modelled in a uniform way to make it easy to read and process.
!
3. Work on infrastructure to process this data in various ways—MapReduce, real-time query systems, etc.
104. Reference Architecture: Unified Log Service
Log structured data flow
!
• Natural data structure for data flow
#LRNAS2014
Data Source
0 1 2 3 4 5 6 7 8
Writes
9
Reads Reads
System A System B
105. Distributed log: Apache Kafka
Apache Kafka
!
• Publish-subscribe as distributed commit log
!
• Fast
!
• Scalable
!
• Durable
!
• Distributed by design
#LRNAS2014
Fast: Hundreds of megabytes of reads and writes per second from thousands of clients.
! Scalable: Elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data
streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers
!
Durable: Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without
performance impact.
!
Distributed by Design: cluster-centric design that offers strong durability and fault-tolerance guarantees.
106. Distributed log: Apache Kafka
Apache Kafka 1000 feet architecture
#LRNAS2014
Broker A
Broker B
Producer Consumer
Broker C
ZooKeeper
108. Reference Architecture
Relational
Database
#LRNAS2014
Event System
Hadoop
User
Tracking
NoSQL
Storage
System
Events
Search
Data Logs
Monitoring
Dataware
House Streaming
Social
Graph
109. Reference Architecture
Relational
Database
#LRNAS2014
Event System
Hadoop
User
Tracking
NoSQL
Storage
System
Events
Search
Data Logs
Monitoring
Dataware
House Streaming
Social
Graph
110. Analytics
What are we looking for?
• Few different datasources
!
• Unified log service in place
!
• Tons of info ready to be processed:
• Batch processing
• Real time processing
• Machine learning algorithms
• Graph analysis
!
• Unified programming model?
#LRNAS2014
111. Analytics
Apache Spark
• Fast and general engine for large-scale data processing
!
• Write your apps in Java, Scala or Python
!
• Integrated with Hadoop
!
• Run on YARN cluster manager
!
• Can read any existing Hadoop data (HDFS)
!
• In memory or disk
#LRNAS2014
120. Analytics
Apache Spark
• Driver program running main function and executes various
parallel operations on a cluster
!
• Main abstraction: Resilient Distributed Datasets (RDD)
• HDFS (or any Hadoop file system)
!
• Scala collection
!
• Second abstraction: shared variables
#LRNAS2014
RDD
* collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.
!
* created by starting with a
- file in the Hadoop file system (or any other Hadoop-supported file system),
- Scala collection in the driver program, and transforming it.
!
* automatically recover from node failures
123. Analytics
Spark SQL
• Mix SQL queries with Spark programs
!
• Unified Data Access
!
• Hive compatibility
!
• Standard JDBC or ODBC connectivity
!
• Same engine for both interactive and long running queries
#LRNAS2014
126. Analytics
Spark Streaming
• Build your apps using high-level operators
!
• Fault tolerance: exactly-once semantics out of the box
!
• Combine streaming with batch and interactive queries
!
• Can read from HDFS, Flume, Kafka, Twitter and ZeroMQ
!
• Define your own custom data sources
#LRNAS2014
133. Analytics
GraphX
• API for graphs and graph-parallel computation
!
• Growing scale and importance
• From social networks to language modelling
!
• Directed multigraph with properties attached to each vertex
and edge
!
• Growing collection of graph algorithms and builders
#LRNAS2014
135. XXX Remove this slide!!
!
For NAS all the following examples will depend on
how much free time I get to work on them (I actually
need to write one more) until the day of the
presentation :( but I guess it should be fine to show
some snippets within the slides
!
Not all of them will be included, just putting a few
ideas
137. Examples: Kafka and Liferay
Connecting Liferay and Kafka
• Easy to use
!
• “Transparent” for the developer
!
• Runtime pluggable
!
• Common API: use it through our Message Bus
!
• You can take a look to Kafka Bridge
#LRNAS2014
151. Examples: Recommender’s goals
You might want to read …
• Blog posts
!
• Ratings for previous blog posts
!
• Recommend to the user some entries for future reading
#LRNAS2014
156. Examples: Recommender storage
UserID::BlogEntryID::Rating::Timestamp
#LRNAS2014
Blog Rating
save/update
Blog Entry
save/update
Apache Kafka
157. Examples: Recommender storage
UserID::BlogEntryID::Rating::Timestamp BlogEntryID::Title::CategoryNames
#LRNAS2014
Blog Rating
save/update
Blog Entry
save/update
Apache Kafka
158. Examples: Recommender storage
UserID::BlogEntryID::Rating::Timestamp BlogEntryID::Title::CategoryNames
#LRNAS2014
Blog Rating
save/update
Blog Entry
save/update
Apache Kafka
159. Examples: Recommender storage
UserID::BlogEntryID::Rating::Timestamp BlogEntryID::Title::CategoryNames
#LRNAS2014
Blog Rating
save/update
Blog Entry
save/update
Apache Kafka
HDFS
160. Examples: Recommender’s analysis
Collaborative filtering
• Commonly used in recommender systems
!
• Try to fill missing entries in association matrix
!
• MLib includes the Alternating Least Squares algorithm
(ALS)
#LRNAS2014
163. Takeaways
What I would like you’ve learned today
• It is not about data size, it’s about how you use it
!
• You already own tons of data, you just need to take get
value from it
!
• There is no silver bullet: you’ve plenty of alternatives
!
• JVM Big data related techs are usually a great choice
!
• Try it yourself!!
#LRNAS2014
165. References
References
• Apache Kafka
!
• Apache Spark
!
• Apache Storm
!
• Apache Hadoop
!
• Big Data definition at Wikipedia
!
• What every software engineer should know about a log
#LRNAS2014