SlideShare une entreprise Scribd logo
1  sur  167
Liferay & Big Data 
Getting value from your data 
! 
Miguel Ángel Pastor Olivar 
Senior Software Engineer
About me 
Who am I? 
! 
• Miguel Ángel Pastor Olivar 
! 
• Member of the Liferay core infrastructure team 
! 
• Worked in analytics for a long time 
– Disclaimer: Not a computer scientist 
! 
• Email: miguel.pastor@liferay.com 
! 
• @miguelinlas3 
#LRNAS2014
Synopsis 
What are we going to talk about? 
! 
• Big Data: what is this about? 
! 
• What’s ahead of big data 
! 
• Connecting Liferay with this “new” world 
! 
• Simple architecture proposal 
! 
• Use cases 
! 
• Questions (and hopefully answers) 
#LRNAS2014
Big Data?
Definitions 
Big Data 
! 
• It is just a buzzword 
! 
• Data is so big that regular solutions are: 
! 
– Extremely slow 
! 
– Too small 
! 
– Really expensive 
! 
• How we use all the data we already own 
#LRNAS2014 
It is no more than a buzzword but we generally associate it with the problem that datasets has become too big that traditional relational databases are not 
able to longer work with them. 
! 
Note the NoSQL movement has emerged during the last years and pretends to handle in a better way all this new semistructured data, new ways of scaling, 
…
Definitions 
More formally … 
! 
• Volume 
– Transactions, data streaming from social media, … 
! 
• Velocity 
– Torrents of data in real time 
! 
• Variety 
– Numerical data, text, email, video, audio, … 
#LRNAS2014 
1. Many factors have influenced to increase data volumes: Transaction based data stored through the years, social media, … 
2. Data streaming is a reality: IOT, smart cities, RFID sensors, … We have to deal with them as fast as we can 
3. Tons of different formats that we need to deal with and interconnect to extract useful information
Trending 
What is trending? 
! 
• Data volumes will keep increasing … rapidly 
! 
• Less emphasis on formal schemas 
! 
• Data driven applications 
#LRNAS2014 
Data volumes: Facebook has over 800PB of data stored in Hadoop clusters 
!F 
ormal schemas: data schemas and sources change rapidly, and we need to integrate so many disparate sources of data that we need to rapidly evolve and 
adapt to the changes 
! Self driving cars, smart cities ,… generic algorithm and data structures represent the world using data instead of encoding a model of the world within the 
software itself (some engineering is required though)
What do you want?
Business goals 
You already own tons of different data 
! 
• Get value from it! 
! 
• Analyse it so you can … 
! 
– Take faster decisions 
! 
– Take better decisions 
! 
– Improve your users experience 
! 
• Make more money! 
#LRNAS2014
Business goals 
Popular applications 
! 
• Recommender system: 
– Amazon store: you may also like … 
! 
• Predicting the future: 
– Netflix does autoscaling based on past network data 
traffic 
! 
• Churn models 
– Big telco companies build social networks to reduce the 
churn 
– Some big banks have tried to do the same 
#LRNAS2014
Business goals 
Popular applications 
! 
• Sentiment analysis 
– Are talking about you in the Internet? 
– Is it good or bad? 
! 
• Real Time Bidding 
– Optimise advertising 
! 
• Health care 
– Improve patients health while reducing costs 
– Improve quality of life of multiple sclerosis patients 
#LRNAS2014
Terminology
Terminology 
Concepts 
! 
• Storage models 
• Where and how we store our relevant information 
! 
• Computation models 
• How we process and transform all the previous 
information 
! 
• Analytics 
• How we can take actions based on the previous steps 
#LRNAS2014
Big Data 
architectures 
Make a quick tour along some of the popular architectures nowadays: mainly Hadoop/HDFS and all the libraries built on top of the Hadoop API
Storing data
Data storage: HDFS 
Hadoop Distributed File System (HDFS) 
! 
• Java based file system 
! 
• Scalable, fault-tolerant, distributed storage 
! 
• Designed to run on commodity hardware 
! 
• Closely related to MapReduce 
#LRNAS2014 
This is the most popular alternative which allows you to store your data in a distributed filesystem and execute Map Reduce algorithms on top of it 
! 
We will see other alternatives to Hadoop which can do much more than MapReduce algorithms
Data storage: HDFS 
#LRNAS2014 
Source: http://hortonworks.com/hadoop/hdfs/ 
An HDFS cluster is comprised of a NameNode which manages the cluster metadata and DataNodes that store the data. Files and directories are represented 
on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, or namespace and disk space quotas
Data storage: NoSQL 
NoSQL Movement 
! 
• Semistructured data 
! 
• Focused on 
! 
• Horizontal scalability 
! 
• Availability 
! 
• Different trade-offs: CAP, BASE, … 
! 
• Many alternatives: Cassandra, Riak, HBase, … 
#LRNAS2014 
This “new” movement tries to deal with the huge increase of data (ant is variety) focusing on different topics to those addressed by the traditional 
relational databases: horizontal scalability, availability, unstructured data models, … 
! 
There is plenty of alternatives: memory based, disk based, key-value, key-document, graph databases, … and the usage of this new databases is 
increasing on BigData systems 
! Some other databases has brought the horizontal scalability and availability to the new 
!
Data storage: Apache Cassandra 
An example: Apache Cassandra 
! 
• P2P architecture, no single point of failure 
! 
• Linear scalability 
! 
• Larger than memory datasets 
! 
• Fully durable 
! 
• Tuneable consistency 
! 
• Integrated caching 
#LRNAS2014
Data storage: NewSQL 
NewSQL Movement 
! 
• Modern relational databases 
! 
• Same scalable performance than NoSQL for OLTP 
! 
• Maintain ACID guarantees 
! 
• A few alternatives: VoltDB, Google Spanner, FoundationDB, 
… 
#LRNAS2014 
New designs for traditional databases (pretty different along the different options) 
! 
Google Spanner use GPS based clocks, VoltDB optimise for every specific app by compiling the schema and so on, … 
!
Computation and 
Analytics
Computation: Apache Hadoop 
Apache Hadoop Map Reduce 
! 
• Framework: 
• Distributed processing 
• Large datasets 
• Clusters of computers 
! 
• Simple programming model 
! 
• Coarse grained 
! 
• Verbose and hard to use API 
#LRNAS2014
Computation: Map Reduce 
#LRNAS2014
Computation: Map Reduce 
Liferay 
projects is 
#LRNAS2014
Computation: Map Reduce 
Liferay 
projects is 
the 
#LRNAS2014
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
#LRNAS2014
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
#LRNAS2014
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”)
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”)
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”)
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”)
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”)
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”)
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”)
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”)
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”)
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”)
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1])
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1)
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
Computation: Map Reduce 
Liferay 
projects is 
the 
best 
Open 
Source 
project 
#LRNAS2014 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
Computation: Apache Hadoop 
Apache Hadoop Map Reduce 
! 
• Batch model data crunching 
! 
• Not so good event stream processing 
! 
• But … 
! 
• Many algorithms hard to implement using MapReduce 
! 
• Again, API hard to use 
! 
• Cascading, Scalding, Cascalog, Impala, … 
#LRNAS2014
Computation: Apache Storm 
Apache Storm 
! 
• Distributed realtime computation system 
! 
• Easy to reliably process unbounded streams of data 
! 
• Multi language support 
! 
• Realtime analytics, online machine learning, continuous 
computation, distributed RPC, ETL, … 
#LRNAS2014
Computation: Apache Storm 
Spout 
Spout 
#LRNAS2014 
Bolt Bolt 
Bolt 
Spouts are data sources and bolts are the event processors 
! 
There are facilities to support reliable message handling, various sources encapsulated in Spouts and various targets of output. Distributed processing is 
baked in from the start
Computation: Apache Spark 
Apache Spark 
! 
• Fast and general-purpose cluster computing system 
• Developed by Berkeley AMP 
! 
• High level APIs (not MapReduce) 
! 
• Optimised engine: supports general execution graphs 
! 
• Higher-level tools: 
• Spark SQL, MLib, Spark Streaming, Graphx (will go 
deeper later on) 
#LRNAS2014
Computation: Apache Mahout 
Apache Mahout 
! 
• Scalable machine learning library 
! 
• Built on top of Hadoop 
! 
• Some algorithms don’t require Hadoop at all 
#LRNAS2014
Computation: Apache Spark 
R language 
! 
• Focused on: 
• Data visualisation 
• Statistical computations 
• Analysis of data 
! 
• Tons of built-in packages 
! 
• Connect to Hadoop through Hadoop Streaming 
! 
• Not a fast language (compared to proprietary alternatives 
like SAS) 
#LRNAS2014
Reference 
architecture
Reference Architecture 
How do we proceed? 
! 
• Plenty of alternatives 
! 
• No silver bullet 
! 
• Problems to solve: 
! 
• Data integration 
! 
• Real time 
! 
• Batch processing 
#LRNAS2014
Reference Architecture 
#LRNAS2014
Reference Architecture 
Relational 
Database 
#LRNAS2014
Reference Architecture 
Relational 
Database 
#LRNAS2014 
User 
Tracking
Reference Architecture 
Relational 
Database 
#LRNAS2014 
User 
Tracking 
NoSQL 
Storage
Reference Architecture 
Relational 
Database 
#LRNAS2014 
User 
Tracking 
NoSQL 
Storage 
System 
Events
Reference Architecture 
Relational 
Database 
#LRNAS2014 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data
Reference Architecture 
Relational 
Database 
#LRNAS2014 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs 
Monitoring
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs 
Monitoring
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs 
Monitoring 
Dataware 
House
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs 
Monitoring 
Dataware 
House
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs 
Monitoring 
Dataware 
House Streaming
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs 
Monitoring 
Dataware 
House Streaming
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs 
Monitoring 
Dataware 
House Streaming 
Social 
Graph
Data sources
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs 
Monitoring 
Dataware 
House Streaming 
Social 
Graph
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs 
Monitoring 
Dataware 
House Streaming 
Social 
Graph
Reference Architecture: Liferay 
Liferay 
! 
• Tons of data available within the platform 
• System events 
! 
• User tracking (client side) 
• Clicks, navigation, activities, … 
! 
• Monitoring (transactions, load page times, …) 
! 
• Models (message boards, blogs, wiki …) 
! 
• Custom developments … 
#LRNAS2014
Event system
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs 
Monitoring 
Dataware 
House Streaming 
Social 
Graph
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs 
Monitoring 
Dataware 
House Streaming 
Social 
Graph
Reference Architecture: Unified Log Service 
Data integration 
Source: http://en.wikipedia.org/wiki/Maslow's_hierarchy_of_needs 
#LRNAS2014 
Effective use of data follows a kind of Maslow's hierarchy of needs. 
! 
1. Base of the pyramid involves capturing all the relevant data 
2. This data needs to be modelled in a uniform way to make it easy to read and process. 
! 
3. Work on infrastructure to process this data in various ways—MapReduce, real-time query systems, etc.
Reference Architecture: Unified Log Service 
Log structured data flow 
! 
• Natural data structure for data flow 
#LRNAS2014 
Data Source 
0 1 2 3 4 5 6 7 8 
Writes 
9 
Reads Reads 
System A System B
Distributed log: Apache Kafka 
Apache Kafka 
! 
• Publish-subscribe as distributed commit log 
! 
• Fast 
! 
• Scalable 
! 
• Durable 
! 
• Distributed by design 
#LRNAS2014 
Fast: Hundreds of megabytes of reads and writes per second from thousands of clients. 
! Scalable: Elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data 
streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers 
! 
Durable: Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without 
performance impact. 
! 
Distributed by Design: cluster-centric design that offers strong durability and fault-tolerance guarantees.
Distributed log: Apache Kafka 
Apache Kafka 1000 feet architecture 
#LRNAS2014 
Broker A 
Broker B 
Producer Consumer 
Broker C 
ZooKeeper
Computation and 
Analytics
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs 
Monitoring 
Dataware 
House Streaming 
Social 
Graph
Reference Architecture 
Relational 
Database 
#LRNAS2014 
Event System 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data Logs 
Monitoring 
Dataware 
House Streaming 
Social 
Graph
Analytics 
What are we looking for? 
• Few different datasources 
! 
• Unified log service in place 
! 
• Tons of info ready to be processed: 
• Batch processing 
• Real time processing 
• Machine learning algorithms 
• Graph analysis 
! 
• Unified programming model? 
#LRNAS2014
Analytics 
Apache Spark 
• Fast and general engine for large-scale data processing 
! 
• Write your apps in Java, Scala or Python 
! 
• Integrated with Hadoop 
! 
• Run on YARN cluster manager 
! 
• Can read any existing Hadoop data (HDFS) 
! 
• In memory or disk 
#LRNAS2014
Analytics 
Apache Spark Main Components 
#LRNAS2014
Analytics 
Apache Spark Main Components 
#LRNAS2014 
Apache Spark
Analytics 
Apache Spark Main Components 
#LRNAS2014 
Apache Spark 
Spark SQL
Analytics 
Apache Spark Main Components 
#LRNAS2014 
Apache Spark 
Spark SQL 
Spark 
Streaming
Analytics 
Apache Spark Main Components 
#LRNAS2014 
Apache Spark 
Spark SQL 
Spark 
Streaming MLib
Analytics 
Apache Spark Main Components 
#LRNAS2014 
Apache Spark 
Spark SQL 
Spark 
Streaming MLib GraphX
Spark Core
Analytics 
Apache Spark Components 
#LRNAS2014 
Apache Spark 
Spark SQL 
Spark 
Streaming MLib GraphX
Analytics 
Apache Spark 
• Driver program running main function and executes various 
parallel operations on a cluster 
! 
• Main abstraction: Resilient Distributed Datasets (RDD) 
• HDFS (or any Hadoop file system) 
! 
• Scala collection 
! 
• Second abstraction: shared variables 
#LRNAS2014 
RDD 
* collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. 
! 
* created by starting with a 
- file in the Hadoop file system (or any other Hadoop-supported file system), 
- Scala collection in the driver program, and transforming it. 
! 
* automatically recover from node failures
Spark SQL
Analytics 
Apache Spark Components 
#LRNAS2014 
Apache Spark 
Spark SQL 
Spark 
Streaming MLib GraphX
Analytics 
Spark SQL 
• Mix SQL queries with Spark programs 
! 
• Unified Data Access 
! 
• Hive compatibility 
! 
• Standard JDBC or ODBC connectivity 
! 
• Same engine for both interactive and long running queries 
#LRNAS2014
Spark Streaming
Analytics 
Apache Spark Components 
#LRNAS2014 
Apache Spark 
Spark SQL 
Spark 
Streaming MLib GraphX
Analytics 
Spark Streaming 
• Build your apps using high-level operators 
! 
• Fault tolerance: exactly-once semantics out of the box 
! 
• Combine streaming with batch and interactive queries 
! 
• Can read from HDFS, Flume, Kafka, Twitter and ZeroMQ 
! 
• Define your own custom data sources 
#LRNAS2014
MLIB
Analytics 
Apache Spark Components 
#LRNAS2014 
Apache Spark 
Spark SQL 
Spark 
Streaming MLib GraphX
Analytics 
MLib 
• Scalable machine learning library 
! 
• Basic statistics 
• Summary statistics 
• Correlations 
• …. 
! 
• Classification and regression 
• Linear models 
• Decision tress 
• Naive Bayes 
#LRNAS2014
Analytics 
MLib 
• Clustering 
• K-Means 
! 
• Collaborative filtering 
• Alternate least squares 
! 
• Dimensionality reduction 
! 
• Singular value decomposition 
! 
• Principal component analysis 
#LRNAS2014
GraphX
Analytics 
Apache Spark Components 
#LRNAS2014 
Apache Spark 
Spark SQL 
Spark 
Streaming MLib GraphX
Analytics 
GraphX 
• API for graphs and graph-parallel computation 
! 
• Growing scale and importance 
• From social networks to language modelling 
! 
• Directed multigraph with properties attached to each vertex 
and edge 
! 
• Growing collection of graph algorithms and builders 
#LRNAS2014
Use cases and 
examples
XXX Remove this slide!! 
! 
For NAS all the following examples will depend on 
how much free time I get to work on them (I actually 
need to write one more) until the day of the 
presentation :( but I guess it should be fine to show 
some snippets within the slides 
! 
Not all of them will be included, just putting a few 
ideas
Connecting Liferay 
and Kafka
Examples: Kafka and Liferay 
Connecting Liferay and Kafka 
• Easy to use 
! 
• “Transparent” for the developer 
! 
• Runtime pluggable 
! 
• Common API: use it through our Message Bus 
! 
• You can take a look to Kafka Bridge 
#LRNAS2014
Examples: Kafka and Liferay 
#LRNAS2014
Examples: Kafka and Liferay 
Liferay Core 
#LRNAS2014
Examples: Kafka and Liferay 
Liferay Core 
Liferay App 
#LRNAS2014
Examples: Kafka and Liferay 
Liferay Core 
Liferay App 
#LRNAS2014 
Message 
Bus API
Examples: Kafka and Liferay 
Liferay Core 
Liferay App 
#LRNAS2014 
Message 
Bus API
Examples: Kafka and Liferay 
Liferay Core 
Liferay App 
#LRNAS2014 
Message 
Bus API
Examples: Kafka and Liferay 
Liferay Core 
Liferay App 
#LRNAS2014 
Message 
Bus API 
Kafka Topic 
Message Payload
Examples: Kafka and Liferay 
Liferay Core 
#LRNAS2014 
Kafka Bridge 
Liferay App 
Message 
Bus API 
Kafka Topic 
Message Payload
Examples: Kafka and Liferay 
Liferay Core 
#LRNAS2014 
Kafka Bridge 
Liferay App 
Message 
Bus API 
Kafka Topic 
Message Payload
Examples: Kafka and Liferay 
#LRNAS2014 
Apache Kafka 
Liferay Core 
Kafka Bridge 
Liferay App 
Message 
Bus API 
Kafka Topic 
Message Payload
Examples: Kafka and Liferay 
#LRNAS2014 
Apache Kafka 
Liferay Core 
Kafka Bridge 
Liferay App 
Message 
Bus API 
Kafka Topic 
Message Payload
Examples: Kafka and Liferay 
#LRNAS2014
Recommendation 
engine
Examples: Recommender’s goals 
You might want to read … 
• Blog posts 
! 
• Ratings for previous blog posts 
! 
• Recommend to the user some entries for future reading 
#LRNAS2014
Examples: Recommender storage 
#LRNAS2014
Examples: Recommender storage 
#LRNAS2014 
Blog Rating 
save/update
Examples: Recommender storage 
#LRNAS2014 
Blog Rating 
save/update 
Blog Entry 
save/update
Examples: Recommender storage 
#LRNAS2014 
Blog Rating 
save/update 
Blog Entry 
save/update 
Apache Kafka
Examples: Recommender storage 
UserID::BlogEntryID::Rating::Timestamp 
#LRNAS2014 
Blog Rating 
save/update 
Blog Entry 
save/update 
Apache Kafka
Examples: Recommender storage 
UserID::BlogEntryID::Rating::Timestamp BlogEntryID::Title::CategoryNames 
#LRNAS2014 
Blog Rating 
save/update 
Blog Entry 
save/update 
Apache Kafka
Examples: Recommender storage 
UserID::BlogEntryID::Rating::Timestamp BlogEntryID::Title::CategoryNames 
#LRNAS2014 
Blog Rating 
save/update 
Blog Entry 
save/update 
Apache Kafka
Examples: Recommender storage 
UserID::BlogEntryID::Rating::Timestamp BlogEntryID::Title::CategoryNames 
#LRNAS2014 
Blog Rating 
save/update 
Blog Entry 
save/update 
Apache Kafka 
HDFS
Examples: Recommender’s analysis 
Collaborative filtering 
• Commonly used in recommender systems 
! 
• Try to fill missing entries in association matrix 
! 
• MLib includes the Alternating Least Squares algorithm 
(ALS) 
#LRNAS2014
Examples: Recommender’s analysis 
#LRNAS2014
Takeaways
Takeaways 
What I would like you’ve learned today 
• It is not about data size, it’s about how you use it 
! 
• You already own tons of data, you just need to take get 
value from it 
! 
• There is no silver bullet: you’ve plenty of alternatives 
! 
• JVM Big data related techs are usually a great choice 
! 
• Try it yourself!! 
#LRNAS2014
References
References 
References 
• Apache Kafka 
! 
• Apache Spark 
! 
• Apache Storm 
! 
• Apache Hadoop 
! 
• Big Data definition at Wikipedia 
! 
• What every software engineer should know about a log 
#LRNAS2014
Thank you!
Questions 
(and hopefully answers)

Contenu connexe

Tendances

No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platformhadooparchbook
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
 
Solr cloud the 'search first' nosql database extended deep dive
Solr cloud the 'search first' nosql database   extended deep diveSolr cloud the 'search first' nosql database   extended deep dive
Solr cloud the 'search first' nosql database extended deep divelucenerevolution
 
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCSpotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCJosh Baer
 
Musings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBaseMusings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBaseJesse Yates
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Lucidworks
 
Phoenix Secondary Indexing - LA HUG Sept 9th, 2013
Phoenix Secondary Indexing - LA HUG Sept 9th, 2013Phoenix Secondary Indexing - LA HUG Sept 9th, 2013
Phoenix Secondary Indexing - LA HUG Sept 9th, 2013Jesse Yates
 
Spark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 MonthsSpark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 Monthstsliwowicz
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphDataWorks Summit
 
Spil Games: outgrowing an internet startup
Spil Games: outgrowing an internet startupSpil Games: outgrowing an internet startup
Spil Games: outgrowing an internet startupart-spilgames
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine LearningChris Fregly
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiSlim Baltagi
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platformhadooparchbook
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
 

Tendances (20)

No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Solr cloud the 'search first' nosql database extended deep dive
Solr cloud the 'search first' nosql database   extended deep diveSolr cloud the 'search first' nosql database   extended deep dive
Solr cloud the 'search first' nosql database extended deep dive
 
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCSpotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
 
Musings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBaseMusings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBase
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
 
Phoenix Secondary Indexing - LA HUG Sept 9th, 2013
Phoenix Secondary Indexing - LA HUG Sept 9th, 2013Phoenix Secondary Indexing - LA HUG Sept 9th, 2013
Phoenix Secondary Indexing - LA HUG Sept 9th, 2013
 
Spark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 MonthsSpark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 Months
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
 
Spil Games: outgrowing an internet startup
Spil Games: outgrowing an internet startupSpil Games: outgrowing an internet startup
Spil Games: outgrowing an internet startup
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
 

Similaire à Liferay and Big Data

Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Benjamin Nussbaum
 
Getting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudGetting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudRightScale
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...BigDataEverywhere
 
Big Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case studyBig Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case studySharjeel Imtiaz
 
Know thy logos
Know thy logosKnow thy logos
Know thy logosVishal V
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataRoi Blanco
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologiesneeraj rathore
 
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...Heidi Nance
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Anything Data: Big, Streaming, NoSQL, Cloud, Science ... A Sloppy Travel Guide
Anything Data: Big, Streaming, NoSQL, Cloud, Science ... A Sloppy Travel GuideAnything Data: Big, Streaming, NoSQL, Cloud, Science ... A Sloppy Travel Guide
Anything Data: Big, Streaming, NoSQL, Cloud, Science ... A Sloppy Travel GuideAhmet Akyol
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusersBob Hardaway
 

Similaire à Liferay and Big Data (20)

Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
 
Getting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudGetting Started with Big Data in the Cloud
Getting Started with Big Data in the Cloud
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
 
Big Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case studyBig Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case study
 
Know thy logos
Know thy logosKnow thy logos
Know thy logos
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
 
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Spark
SparkSpark
Spark
 
Anything Data: Big, Streaming, NoSQL, Cloud, Science ... A Sloppy Travel Guide
Anything Data: Big, Streaming, NoSQL, Cloud, Science ... A Sloppy Travel GuideAnything Data: Big, Streaming, NoSQL, Cloud, Science ... A Sloppy Travel Guide
Anything Data: Big, Streaming, NoSQL, Cloud, Science ... A Sloppy Travel Guide
 
Big Data
Big DataBig Data
Big Data
 
DataHub
DataHubDataHub
DataHub
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Data at Spotify
Data at SpotifyData at Spotify
Data at Spotify
 

Plus de Miguel Pastor

Microservices: The OSGi way A different vision on microservices
Microservices: The OSGi way A different vision on microservicesMicroservices: The OSGi way A different vision on microservices
Microservices: The OSGi way A different vision on microservicesMiguel Pastor
 
Reactive applications and Akka intro used in the Madrid Scala Meetup
Reactive applications and Akka intro used in the Madrid Scala MeetupReactive applications and Akka intro used in the Madrid Scala Meetup
Reactive applications and Akka intro used in the Madrid Scala MeetupMiguel Pastor
 
Reactive applications using Akka
Reactive applications using AkkaReactive applications using Akka
Reactive applications using AkkaMiguel Pastor
 
Liferay Devcon 2013: Our way towards modularity
Liferay Devcon 2013: Our way towards modularityLiferay Devcon 2013: Our way towards modularity
Liferay Devcon 2013: Our way towards modularityMiguel Pastor
 
Liferay Module Framework
Liferay Module FrameworkLiferay Module Framework
Liferay Module FrameworkMiguel Pastor
 
Hadoop, Cloud y Spring
Hadoop, Cloud y Spring Hadoop, Cloud y Spring
Hadoop, Cloud y Spring Miguel Pastor
 
Scala: un vistazo general
Scala: un vistazo generalScala: un vistazo general
Scala: un vistazo generalMiguel Pastor
 
Platform as a Service overview
Platform as a Service overviewPlatform as a Service overview
Platform as a Service overviewMiguel Pastor
 
Aspect Oriented Programming introduction
Aspect Oriented Programming introductionAspect Oriented Programming introduction
Aspect Oriented Programming introductionMiguel Pastor
 
Software measure-slides
Software measure-slidesSoftware measure-slides
Software measure-slidesMiguel Pastor
 
Groovy and Grails intro
Groovy and Grails introGroovy and Grails intro
Groovy and Grails introMiguel Pastor
 

Plus de Miguel Pastor (17)

Microservices: The OSGi way A different vision on microservices
Microservices: The OSGi way A different vision on microservicesMicroservices: The OSGi way A different vision on microservices
Microservices: The OSGi way A different vision on microservices
 
Reactive applications and Akka intro used in the Madrid Scala Meetup
Reactive applications and Akka intro used in the Madrid Scala MeetupReactive applications and Akka intro used in the Madrid Scala Meetup
Reactive applications and Akka intro used in the Madrid Scala Meetup
 
Reactive applications using Akka
Reactive applications using AkkaReactive applications using Akka
Reactive applications using Akka
 
Liferay Devcon 2013: Our way towards modularity
Liferay Devcon 2013: Our way towards modularityLiferay Devcon 2013: Our way towards modularity
Liferay Devcon 2013: Our way towards modularity
 
Liferay Module Framework
Liferay Module FrameworkLiferay Module Framework
Liferay Module Framework
 
Liferay and Cloud
Liferay and CloudLiferay and Cloud
Liferay and Cloud
 
Jvm fundamentals
Jvm fundamentalsJvm fundamentals
Jvm fundamentals
 
Scala Overview
Scala OverviewScala Overview
Scala Overview
 
Hadoop, Cloud y Spring
Hadoop, Cloud y Spring Hadoop, Cloud y Spring
Hadoop, Cloud y Spring
 
Scala: un vistazo general
Scala: un vistazo generalScala: un vistazo general
Scala: un vistazo general
 
Platform as a Service overview
Platform as a Service overviewPlatform as a Service overview
Platform as a Service overview
 
HadoopDB
HadoopDBHadoopDB
HadoopDB
 
Aspect Oriented Programming introduction
Aspect Oriented Programming introductionAspect Oriented Programming introduction
Aspect Oriented Programming introduction
 
Software measure-slides
Software measure-slidesSoftware measure-slides
Software measure-slides
 
Arquitecturas MMOG
Arquitecturas MMOGArquitecturas MMOG
Arquitecturas MMOG
 
Software Failures
Software FailuresSoftware Failures
Software Failures
 
Groovy and Grails intro
Groovy and Grails introGroovy and Grails intro
Groovy and Grails intro
 

Dernier

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 

Dernier (20)

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 

Liferay and Big Data

  • 1. Liferay & Big Data Getting value from your data ! Miguel Ángel Pastor Olivar Senior Software Engineer
  • 2. About me Who am I? ! • Miguel Ángel Pastor Olivar ! • Member of the Liferay core infrastructure team ! • Worked in analytics for a long time – Disclaimer: Not a computer scientist ! • Email: miguel.pastor@liferay.com ! • @miguelinlas3 #LRNAS2014
  • 3. Synopsis What are we going to talk about? ! • Big Data: what is this about? ! • What’s ahead of big data ! • Connecting Liferay with this “new” world ! • Simple architecture proposal ! • Use cases ! • Questions (and hopefully answers) #LRNAS2014
  • 5. Definitions Big Data ! • It is just a buzzword ! • Data is so big that regular solutions are: ! – Extremely slow ! – Too small ! – Really expensive ! • How we use all the data we already own #LRNAS2014 It is no more than a buzzword but we generally associate it with the problem that datasets has become too big that traditional relational databases are not able to longer work with them. ! Note the NoSQL movement has emerged during the last years and pretends to handle in a better way all this new semistructured data, new ways of scaling, …
  • 6. Definitions More formally … ! • Volume – Transactions, data streaming from social media, … ! • Velocity – Torrents of data in real time ! • Variety – Numerical data, text, email, video, audio, … #LRNAS2014 1. Many factors have influenced to increase data volumes: Transaction based data stored through the years, social media, … 2. Data streaming is a reality: IOT, smart cities, RFID sensors, … We have to deal with them as fast as we can 3. Tons of different formats that we need to deal with and interconnect to extract useful information
  • 7. Trending What is trending? ! • Data volumes will keep increasing … rapidly ! • Less emphasis on formal schemas ! • Data driven applications #LRNAS2014 Data volumes: Facebook has over 800PB of data stored in Hadoop clusters !F ormal schemas: data schemas and sources change rapidly, and we need to integrate so many disparate sources of data that we need to rapidly evolve and adapt to the changes ! Self driving cars, smart cities ,… generic algorithm and data structures represent the world using data instead of encoding a model of the world within the software itself (some engineering is required though)
  • 8. What do you want?
  • 9. Business goals You already own tons of different data ! • Get value from it! ! • Analyse it so you can … ! – Take faster decisions ! – Take better decisions ! – Improve your users experience ! • Make more money! #LRNAS2014
  • 10. Business goals Popular applications ! • Recommender system: – Amazon store: you may also like … ! • Predicting the future: – Netflix does autoscaling based on past network data traffic ! • Churn models – Big telco companies build social networks to reduce the churn – Some big banks have tried to do the same #LRNAS2014
  • 11. Business goals Popular applications ! • Sentiment analysis – Are talking about you in the Internet? – Is it good or bad? ! • Real Time Bidding – Optimise advertising ! • Health care – Improve patients health while reducing costs – Improve quality of life of multiple sclerosis patients #LRNAS2014
  • 13. Terminology Concepts ! • Storage models • Where and how we store our relevant information ! • Computation models • How we process and transform all the previous information ! • Analytics • How we can take actions based on the previous steps #LRNAS2014
  • 14. Big Data architectures Make a quick tour along some of the popular architectures nowadays: mainly Hadoop/HDFS and all the libraries built on top of the Hadoop API
  • 16. Data storage: HDFS Hadoop Distributed File System (HDFS) ! • Java based file system ! • Scalable, fault-tolerant, distributed storage ! • Designed to run on commodity hardware ! • Closely related to MapReduce #LRNAS2014 This is the most popular alternative which allows you to store your data in a distributed filesystem and execute Map Reduce algorithms on top of it ! We will see other alternatives to Hadoop which can do much more than MapReduce algorithms
  • 17. Data storage: HDFS #LRNAS2014 Source: http://hortonworks.com/hadoop/hdfs/ An HDFS cluster is comprised of a NameNode which manages the cluster metadata and DataNodes that store the data. Files and directories are represented on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, or namespace and disk space quotas
  • 18. Data storage: NoSQL NoSQL Movement ! • Semistructured data ! • Focused on ! • Horizontal scalability ! • Availability ! • Different trade-offs: CAP, BASE, … ! • Many alternatives: Cassandra, Riak, HBase, … #LRNAS2014 This “new” movement tries to deal with the huge increase of data (ant is variety) focusing on different topics to those addressed by the traditional relational databases: horizontal scalability, availability, unstructured data models, … ! There is plenty of alternatives: memory based, disk based, key-value, key-document, graph databases, … and the usage of this new databases is increasing on BigData systems ! Some other databases has brought the horizontal scalability and availability to the new !
  • 19. Data storage: Apache Cassandra An example: Apache Cassandra ! • P2P architecture, no single point of failure ! • Linear scalability ! • Larger than memory datasets ! • Fully durable ! • Tuneable consistency ! • Integrated caching #LRNAS2014
  • 20. Data storage: NewSQL NewSQL Movement ! • Modern relational databases ! • Same scalable performance than NoSQL for OLTP ! • Maintain ACID guarantees ! • A few alternatives: VoltDB, Google Spanner, FoundationDB, … #LRNAS2014 New designs for traditional databases (pretty different along the different options) ! Google Spanner use GPS based clocks, VoltDB optimise for every specific app by compiling the schema and so on, … !
  • 22. Computation: Apache Hadoop Apache Hadoop Map Reduce ! • Framework: • Distributed processing • Large datasets • Clusters of computers ! • Simple programming model ! • Coarse grained ! • Verbose and hard to use API #LRNAS2014
  • 24. Computation: Map Reduce Liferay projects is #LRNAS2014
  • 25. Computation: Map Reduce Liferay projects is the #LRNAS2014
  • 26. Computation: Map Reduce Liferay projects is the best Open Source #LRNAS2014
  • 27. Computation: Map Reduce Liferay projects is the best Open Source #LRNAS2014
  • 28. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014
  • 29. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1
  • 30. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1
  • 31. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1
  • 32. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”)
  • 33. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”)
  • 34. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”)
  • 35. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”)
  • 36. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”)
  • 37. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”)
  • 38. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”)
  • 39. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”)
  • 40. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”)
  • 41. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”)
  • 42. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle
  • 43. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1])
  • 44. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1)
  • 45. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 46. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 47. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 48. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 49. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 50. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 51. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 52. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 53. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 54. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 55. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 56. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 57. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 58. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 59. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 60. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 61. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 62. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 63. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 64. Computation: Apache Hadoop Apache Hadoop Map Reduce ! • Batch model data crunching ! • Not so good event stream processing ! • But … ! • Many algorithms hard to implement using MapReduce ! • Again, API hard to use ! • Cascading, Scalding, Cascalog, Impala, … #LRNAS2014
  • 65. Computation: Apache Storm Apache Storm ! • Distributed realtime computation system ! • Easy to reliably process unbounded streams of data ! • Multi language support ! • Realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, … #LRNAS2014
  • 66. Computation: Apache Storm Spout Spout #LRNAS2014 Bolt Bolt Bolt Spouts are data sources and bolts are the event processors ! There are facilities to support reliable message handling, various sources encapsulated in Spouts and various targets of output. Distributed processing is baked in from the start
  • 67. Computation: Apache Spark Apache Spark ! • Fast and general-purpose cluster computing system • Developed by Berkeley AMP ! • High level APIs (not MapReduce) ! • Optimised engine: supports general execution graphs ! • Higher-level tools: • Spark SQL, MLib, Spark Streaming, Graphx (will go deeper later on) #LRNAS2014
  • 68. Computation: Apache Mahout Apache Mahout ! • Scalable machine learning library ! • Built on top of Hadoop ! • Some algorithms don’t require Hadoop at all #LRNAS2014
  • 69. Computation: Apache Spark R language ! • Focused on: • Data visualisation • Statistical computations • Analysis of data ! • Tons of built-in packages ! • Connect to Hadoop through Hadoop Streaming ! • Not a fast language (compared to proprietary alternatives like SAS) #LRNAS2014
  • 71. Reference Architecture How do we proceed? ! • Plenty of alternatives ! • No silver bullet ! • Problems to solve: ! • Data integration ! • Real time ! • Batch processing #LRNAS2014
  • 73. Reference Architecture Relational Database #LRNAS2014
  • 74. Reference Architecture Relational Database #LRNAS2014 User Tracking
  • 75. Reference Architecture Relational Database #LRNAS2014 User Tracking NoSQL Storage
  • 76. Reference Architecture Relational Database #LRNAS2014 User Tracking NoSQL Storage System Events
  • 77. Reference Architecture Relational Database #LRNAS2014 User Tracking NoSQL Storage System Events Search Data
  • 78. Reference Architecture Relational Database #LRNAS2014 User Tracking NoSQL Storage System Events Search Data Logs
  • 79. Reference Architecture Relational Database #LRNAS2014 Event System User Tracking NoSQL Storage System Events Search Data Logs
  • 80. Reference Architecture Relational Database #LRNAS2014 Event System User Tracking NoSQL Storage System Events Search Data Logs
  • 81. Reference Architecture Relational Database #LRNAS2014 Event System User Tracking NoSQL Storage System Events Search Data Logs
  • 82. Reference Architecture Relational Database #LRNAS2014 Event System User Tracking NoSQL Storage System Events Search Data Logs
  • 83. Reference Architecture Relational Database #LRNAS2014 Event System User Tracking NoSQL Storage System Events Search Data Logs
  • 84. Reference Architecture Relational Database #LRNAS2014 Event System User Tracking NoSQL Storage System Events Search Data Logs
  • 85. Reference Architecture Relational Database #LRNAS2014 Event System User Tracking NoSQL Storage System Events Search Data Logs
  • 86. Reference Architecture Relational Database #LRNAS2014 Event System User Tracking NoSQL Storage System Events Search Data Logs
  • 87. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs
  • 88. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs
  • 89. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring
  • 90. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring
  • 91. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House
  • 92. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House
  • 93. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming
  • 94. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming
  • 95. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 97. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 98. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 99. Reference Architecture: Liferay Liferay ! • Tons of data available within the platform • System events ! • User tracking (client side) • Clicks, navigation, activities, … ! • Monitoring (transactions, load page times, …) ! • Models (message boards, blogs, wiki …) ! • Custom developments … #LRNAS2014
  • 101. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 102. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 103. Reference Architecture: Unified Log Service Data integration Source: http://en.wikipedia.org/wiki/Maslow's_hierarchy_of_needs #LRNAS2014 Effective use of data follows a kind of Maslow's hierarchy of needs. ! 1. Base of the pyramid involves capturing all the relevant data 2. This data needs to be modelled in a uniform way to make it easy to read and process. ! 3. Work on infrastructure to process this data in various ways—MapReduce, real-time query systems, etc.
  • 104. Reference Architecture: Unified Log Service Log structured data flow ! • Natural data structure for data flow #LRNAS2014 Data Source 0 1 2 3 4 5 6 7 8 Writes 9 Reads Reads System A System B
  • 105. Distributed log: Apache Kafka Apache Kafka ! • Publish-subscribe as distributed commit log ! • Fast ! • Scalable ! • Durable ! • Distributed by design #LRNAS2014 Fast: Hundreds of megabytes of reads and writes per second from thousands of clients. ! Scalable: Elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers ! Durable: Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact. ! Distributed by Design: cluster-centric design that offers strong durability and fault-tolerance guarantees.
  • 106. Distributed log: Apache Kafka Apache Kafka 1000 feet architecture #LRNAS2014 Broker A Broker B Producer Consumer Broker C ZooKeeper
  • 108. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 109. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 110. Analytics What are we looking for? • Few different datasources ! • Unified log service in place ! • Tons of info ready to be processed: • Batch processing • Real time processing • Machine learning algorithms • Graph analysis ! • Unified programming model? #LRNAS2014
  • 111. Analytics Apache Spark • Fast and general engine for large-scale data processing ! • Write your apps in Java, Scala or Python ! • Integrated with Hadoop ! • Run on YARN cluster manager ! • Can read any existing Hadoop data (HDFS) ! • In memory or disk #LRNAS2014
  • 112. Analytics Apache Spark Main Components #LRNAS2014
  • 113. Analytics Apache Spark Main Components #LRNAS2014 Apache Spark
  • 114. Analytics Apache Spark Main Components #LRNAS2014 Apache Spark Spark SQL
  • 115. Analytics Apache Spark Main Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming
  • 116. Analytics Apache Spark Main Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming MLib
  • 117. Analytics Apache Spark Main Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming MLib GraphX
  • 119. Analytics Apache Spark Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming MLib GraphX
  • 120. Analytics Apache Spark • Driver program running main function and executes various parallel operations on a cluster ! • Main abstraction: Resilient Distributed Datasets (RDD) • HDFS (or any Hadoop file system) ! • Scala collection ! • Second abstraction: shared variables #LRNAS2014 RDD * collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. ! * created by starting with a - file in the Hadoop file system (or any other Hadoop-supported file system), - Scala collection in the driver program, and transforming it. ! * automatically recover from node failures
  • 122. Analytics Apache Spark Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming MLib GraphX
  • 123. Analytics Spark SQL • Mix SQL queries with Spark programs ! • Unified Data Access ! • Hive compatibility ! • Standard JDBC or ODBC connectivity ! • Same engine for both interactive and long running queries #LRNAS2014
  • 125. Analytics Apache Spark Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming MLib GraphX
  • 126. Analytics Spark Streaming • Build your apps using high-level operators ! • Fault tolerance: exactly-once semantics out of the box ! • Combine streaming with batch and interactive queries ! • Can read from HDFS, Flume, Kafka, Twitter and ZeroMQ ! • Define your own custom data sources #LRNAS2014
  • 127. MLIB
  • 128. Analytics Apache Spark Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming MLib GraphX
  • 129. Analytics MLib • Scalable machine learning library ! • Basic statistics • Summary statistics • Correlations • …. ! • Classification and regression • Linear models • Decision tress • Naive Bayes #LRNAS2014
  • 130. Analytics MLib • Clustering • K-Means ! • Collaborative filtering • Alternate least squares ! • Dimensionality reduction ! • Singular value decomposition ! • Principal component analysis #LRNAS2014
  • 131. GraphX
  • 132. Analytics Apache Spark Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming MLib GraphX
  • 133. Analytics GraphX • API for graphs and graph-parallel computation ! • Growing scale and importance • From social networks to language modelling ! • Directed multigraph with properties attached to each vertex and edge ! • Growing collection of graph algorithms and builders #LRNAS2014
  • 134. Use cases and examples
  • 135. XXX Remove this slide!! ! For NAS all the following examples will depend on how much free time I get to work on them (I actually need to write one more) until the day of the presentation :( but I guess it should be fine to show some snippets within the slides ! Not all of them will be included, just putting a few ideas
  • 137. Examples: Kafka and Liferay Connecting Liferay and Kafka • Easy to use ! • “Transparent” for the developer ! • Runtime pluggable ! • Common API: use it through our Message Bus ! • You can take a look to Kafka Bridge #LRNAS2014
  • 138. Examples: Kafka and Liferay #LRNAS2014
  • 139. Examples: Kafka and Liferay Liferay Core #LRNAS2014
  • 140. Examples: Kafka and Liferay Liferay Core Liferay App #LRNAS2014
  • 141. Examples: Kafka and Liferay Liferay Core Liferay App #LRNAS2014 Message Bus API
  • 142. Examples: Kafka and Liferay Liferay Core Liferay App #LRNAS2014 Message Bus API
  • 143. Examples: Kafka and Liferay Liferay Core Liferay App #LRNAS2014 Message Bus API
  • 144. Examples: Kafka and Liferay Liferay Core Liferay App #LRNAS2014 Message Bus API Kafka Topic Message Payload
  • 145. Examples: Kafka and Liferay Liferay Core #LRNAS2014 Kafka Bridge Liferay App Message Bus API Kafka Topic Message Payload
  • 146. Examples: Kafka and Liferay Liferay Core #LRNAS2014 Kafka Bridge Liferay App Message Bus API Kafka Topic Message Payload
  • 147. Examples: Kafka and Liferay #LRNAS2014 Apache Kafka Liferay Core Kafka Bridge Liferay App Message Bus API Kafka Topic Message Payload
  • 148. Examples: Kafka and Liferay #LRNAS2014 Apache Kafka Liferay Core Kafka Bridge Liferay App Message Bus API Kafka Topic Message Payload
  • 149. Examples: Kafka and Liferay #LRNAS2014
  • 151. Examples: Recommender’s goals You might want to read … • Blog posts ! • Ratings for previous blog posts ! • Recommend to the user some entries for future reading #LRNAS2014
  • 153. Examples: Recommender storage #LRNAS2014 Blog Rating save/update
  • 154. Examples: Recommender storage #LRNAS2014 Blog Rating save/update Blog Entry save/update
  • 155. Examples: Recommender storage #LRNAS2014 Blog Rating save/update Blog Entry save/update Apache Kafka
  • 156. Examples: Recommender storage UserID::BlogEntryID::Rating::Timestamp #LRNAS2014 Blog Rating save/update Blog Entry save/update Apache Kafka
  • 157. Examples: Recommender storage UserID::BlogEntryID::Rating::Timestamp BlogEntryID::Title::CategoryNames #LRNAS2014 Blog Rating save/update Blog Entry save/update Apache Kafka
  • 158. Examples: Recommender storage UserID::BlogEntryID::Rating::Timestamp BlogEntryID::Title::CategoryNames #LRNAS2014 Blog Rating save/update Blog Entry save/update Apache Kafka
  • 159. Examples: Recommender storage UserID::BlogEntryID::Rating::Timestamp BlogEntryID::Title::CategoryNames #LRNAS2014 Blog Rating save/update Blog Entry save/update Apache Kafka HDFS
  • 160. Examples: Recommender’s analysis Collaborative filtering • Commonly used in recommender systems ! • Try to fill missing entries in association matrix ! • MLib includes the Alternating Least Squares algorithm (ALS) #LRNAS2014
  • 163. Takeaways What I would like you’ve learned today • It is not about data size, it’s about how you use it ! • You already own tons of data, you just need to take get value from it ! • There is no silver bullet: you’ve plenty of alternatives ! • JVM Big data related techs are usually a great choice ! • Try it yourself!! #LRNAS2014
  • 165. References References • Apache Kafka ! • Apache Spark ! • Apache Storm ! • Apache Hadoop ! • Big Data definition at Wikipedia ! • What every software engineer should know about a log #LRNAS2014