Liferay and Big Data

1. Liferay & Big Data Getting value from your data ! Miguel Ángel Pastor Olivar Senior Software Engineer

2. About me Who am I? ! • Miguel Ángel Pastor Olivar ! • Member of the Liferay core infrastructure team ! • Worked in analytics for a long time – Disclaimer: Not a computer scientist ! • Email: miguel.pastor@liferay.com ! • @miguelinlas3 #LRNAS2014

3. Synopsis What are we going to talk about? ! • Big Data: what is this about? ! • What’s ahead of big data ! • Connecting Liferay with this “new” world ! • Simple architecture proposal ! • Use cases ! • Questions (and hopefully answers) #LRNAS2014

4. Big Data?

5. Definitions Big Data ! • It is just a buzzword ! • Data is so big that regular solutions are: ! – Extremely slow ! – Too small ! – Really expensive ! • How we use all the data we already own #LRNAS2014 It is no more than a buzzword but we generally associate it with the problem that datasets has become too big that traditional relational databases are not able to longer work with them. ! Note the NoSQL movement has emerged during the last years and pretends to handle in a better way all this new semistructured data, new ways of scaling, …

6. Definitions More formally … ! • Volume – Transactions, data streaming from social media, … ! • Velocity – Torrents of data in real time ! • Variety – Numerical data, text, email, video, audio, … #LRNAS2014 1. Many factors have influenced to increase data volumes: Transaction based data stored through the years, social media, … 2. Data streaming is a reality: IOT, smart cities, RFID sensors, … We have to deal with them as fast as we can 3. Tons of different formats that we need to deal with and interconnect to extract useful information

7. Trending What is trending? ! • Data volumes will keep increasing … rapidly ! • Less emphasis on formal schemas ! • Data driven applications #LRNAS2014 Data volumes: Facebook has over 800PB of data stored in Hadoop clusters !F ormal schemas: data schemas and sources change rapidly, and we need to integrate so many disparate sources of data that we need to rapidly evolve and adapt to the changes ! Self driving cars, smart cities ,… generic algorithm and data structures represent the world using data instead of encoding a model of the world within the software itself (some engineering is required though)

8. What do you want?

9. Business goals You already own tons of different data ! • Get value from it! ! • Analyse it so you can … ! – Take faster decisions ! – Take better decisions ! – Improve your users experience ! • Make more money! #LRNAS2014

10. Business goals Popular applications ! • Recommender system: – Amazon store: you may also like … ! • Predicting the future: – Netflix does autoscaling based on past network data traffic ! • Churn models – Big telco companies build social networks to reduce the churn – Some big banks have tried to do the same #LRNAS2014

11. Business goals Popular applications ! • Sentiment analysis – Are talking about you in the Internet? – Is it good or bad? ! • Real Time Bidding – Optimise advertising ! • Health care – Improve patients health while reducing costs – Improve quality of life of multiple sclerosis patients #LRNAS2014

12. Terminology

13. Terminology Concepts ! • Storage models • Where and how we store our relevant information ! • Computation models • How we process and transform all the previous information ! • Analytics • How we can take actions based on the previous steps #LRNAS2014

14. Big Data architectures Make a quick tour along some of the popular architectures nowadays: mainly Hadoop/HDFS and all the libraries built on top of the Hadoop API

15. Storing data

16. Data storage: HDFS Hadoop Distributed File System (HDFS) ! • Java based file system ! • Scalable, fault-tolerant, distributed storage ! • Designed to run on commodity hardware ! • Closely related to MapReduce #LRNAS2014 This is the most popular alternative which allows you to store your data in a distributed filesystem and execute Map Reduce algorithms on top of it ! We will see other alternatives to Hadoop which can do much more than MapReduce algorithms

17. Data storage: HDFS #LRNAS2014 Source: http://hortonworks.com/hadoop/hdfs/ An HDFS cluster is comprised of a NameNode which manages the cluster metadata and DataNodes that store the data. Files and directories are represented on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, or namespace and disk space quotas

18. Data storage: NoSQL NoSQL Movement ! • Semistructured data ! • Focused on ! • Horizontal scalability ! • Availability ! • Different trade-offs: CAP, BASE, … ! • Many alternatives: Cassandra, Riak, HBase, … #LRNAS2014 This “new” movement tries to deal with the huge increase of data (ant is variety) focusing on different topics to those addressed by the traditional relational databases: horizontal scalability, availability, unstructured data models, … ! There is plenty of alternatives: memory based, disk based, key-value, key-document, graph databases, … and the usage of this new databases is increasing on BigData systems ! Some other databases has brought the horizontal scalability and availability to the new !

19. Data storage: Apache Cassandra An example: Apache Cassandra ! • P2P architecture, no single point of failure ! • Linear scalability ! • Larger than memory datasets ! • Fully durable ! • Tuneable consistency ! • Integrated caching #LRNAS2014

20. Data storage: NewSQL NewSQL Movement ! • Modern relational databases ! • Same scalable performance than NoSQL for OLTP ! • Maintain ACID guarantees ! • A few alternatives: VoltDB, Google Spanner, FoundationDB, … #LRNAS2014 New designs for traditional databases (pretty different along the different options) ! Google Spanner use GPS based clocks, VoltDB optimise for every specific app by compiling the schema and so on, … !

21. Computation and Analytics

22. Computation: Apache Hadoop Apache Hadoop Map Reduce ! • Framework: • Distributed processing • Large datasets • Clusters of computers ! • Simple programming model ! • Coarse grained ! • Verbose and hard to use API #LRNAS2014

23. Computation: Map Reduce #LRNAS2014

24. Computation: Map Reduce Liferay projects is #LRNAS2014

25. Computation: Map Reduce Liferay projects is the #LRNAS2014

26. Computation: Map Reduce Liferay projects is the best Open Source #LRNAS2014

27. Computation: Map Reduce Liferay projects is the best Open Source #LRNAS2014

28. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014

29. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1

30. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1

31. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1

32. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”)

33. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”)

34. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”)

35. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”)

36. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”)

37. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”)

38. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”)

39. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”)

40. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”)

41. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”)

42. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle

43. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1])

44. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1)

45. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])

64. Computation: Apache Hadoop Apache Hadoop Map Reduce ! • Batch model data crunching ! • Not so good event stream processing ! • But … ! • Many algorithms hard to implement using MapReduce ! • Again, API hard to use ! • Cascading, Scalding, Cascalog, Impala, … #LRNAS2014

65. Computation: Apache Storm Apache Storm ! • Distributed realtime computation system ! • Easy to reliably process unbounded streams of data ! • Multi language support ! • Realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, … #LRNAS2014

66. Computation: Apache Storm Spout Spout #LRNAS2014 Bolt Bolt Bolt Spouts are data sources and bolts are the event processors ! There are facilities to support reliable message handling, various sources encapsulated in Spouts and various targets of output. Distributed processing is baked in from the start

67. Computation: Apache Spark Apache Spark ! • Fast and general-purpose cluster computing system • Developed by Berkeley AMP ! • High level APIs (not MapReduce) ! • Optimised engine: supports general execution graphs ! • Higher-level tools: • Spark SQL, MLib, Spark Streaming, Graphx (will go deeper later on) #LRNAS2014

68. Computation: Apache Mahout Apache Mahout ! • Scalable machine learning library ! • Built on top of Hadoop ! • Some algorithms don’t require Hadoop at all #LRNAS2014

69. Computation: Apache Spark R language ! • Focused on: • Data visualisation • Statistical computations • Analysis of data ! • Tons of built-in packages ! • Connect to Hadoop through Hadoop Streaming ! • Not a fast language (compared to proprietary alternatives like SAS) #LRNAS2014

70. Reference architecture

71. Reference Architecture How do we proceed? ! • Plenty of alternatives ! • No silver bullet ! • Problems to solve: ! • Data integration ! • Real time ! • Batch processing #LRNAS2014

72. Reference Architecture #LRNAS2014

73. Reference Architecture Relational Database #LRNAS2014

74. Reference Architecture Relational Database #LRNAS2014 User Tracking

75. Reference Architecture Relational Database #LRNAS2014 User Tracking NoSQL Storage

76. Reference Architecture Relational Database #LRNAS2014 User Tracking NoSQL Storage System Events

77. Reference Architecture Relational Database #LRNAS2014 User Tracking NoSQL Storage System Events Search Data

78. Reference Architecture Relational Database #LRNAS2014 User Tracking NoSQL Storage System Events Search Data Logs

79. Reference Architecture Relational Database #LRNAS2014 Event System User Tracking NoSQL Storage System Events Search Data Logs

87. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs

88. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs

89. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring

90. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring

91. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House

92. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House

93. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming

94. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming

95. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph

96. Data sources

99. Reference Architecture: Liferay Liferay ! • Tons of data available within the platform • System events ! • User tracking (client side) • Clicks, navigation, activities, … ! • Monitoring (transactions, load page times, …) ! • Models (message boards, blogs, wiki …) ! • Custom developments … #LRNAS2014

100. Event system

103. Reference Architecture: Unified Log Service Data integration Source: http://en.wikipedia.org/wiki/Maslow's_hierarchy_of_needs #LRNAS2014 Effective use of data follows a kind of Maslow's hierarchy of needs. ! 1. Base of the pyramid involves capturing all the relevant data 2. This data needs to be modelled in a uniform way to make it easy to read and process. ! 3. Work on infrastructure to process this data in various ways—MapReduce, real-time query systems, etc.

104. Reference Architecture: Unified Log Service Log structured data flow ! • Natural data structure for data flow #LRNAS2014 Data Source 0 1 2 3 4 5 6 7 8 Writes 9 Reads Reads System A System B

105. Distributed log: Apache Kafka Apache Kafka ! • Publish-subscribe as distributed commit log ! • Fast ! • Scalable ! • Durable ! • Distributed by design #LRNAS2014 Fast: Hundreds of megabytes of reads and writes per second from thousands of clients. ! Scalable: Elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers ! Durable: Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact. ! Distributed by Design: cluster-centric design that offers strong durability and fault-tolerance guarantees.

106. Distributed log: Apache Kafka Apache Kafka 1000 feet architecture #LRNAS2014 Broker A Broker B Producer Consumer Broker C ZooKeeper

107. Computation and Analytics

110. Analytics What are we looking for? • Few different datasources ! • Unified log service in place ! • Tons of info ready to be processed: • Batch processing • Real time processing • Machine learning algorithms • Graph analysis ! • Unified programming model? #LRNAS2014

111. Analytics Apache Spark • Fast and general engine for large-scale data processing ! • Write your apps in Java, Scala or Python ! • Integrated with Hadoop ! • Run on YARN cluster manager ! • Can read any existing Hadoop data (HDFS) ! • In memory or disk #LRNAS2014

112. Analytics Apache Spark Main Components #LRNAS2014

113. Analytics Apache Spark Main Components #LRNAS2014 Apache Spark

114. Analytics Apache Spark Main Components #LRNAS2014 Apache Spark Spark SQL

115. Analytics Apache Spark Main Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming

116. Analytics Apache Spark Main Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming MLib

117. Analytics Apache Spark Main Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming MLib GraphX

118. Spark Core

119. Analytics Apache Spark Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming MLib GraphX

120. Analytics Apache Spark • Driver program running main function and executes various parallel operations on a cluster ! • Main abstraction: Resilient Distributed Datasets (RDD) • HDFS (or any Hadoop file system) ! • Scala collection ! • Second abstraction: shared variables #LRNAS2014 RDD * collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. ! * created by starting with a - file in the Hadoop file system (or any other Hadoop-supported file system), - Scala collection in the driver program, and transforming it. ! * automatically recover from node failures

121. Spark SQL

123. Analytics Spark SQL • Mix SQL queries with Spark programs ! • Unified Data Access ! • Hive compatibility ! • Standard JDBC or ODBC connectivity ! • Same engine for both interactive and long running queries #LRNAS2014

124. Spark Streaming

126. Analytics Spark Streaming • Build your apps using high-level operators ! • Fault tolerance: exactly-once semantics out of the box ! • Combine streaming with batch and interactive queries ! • Can read from HDFS, Flume, Kafka, Twitter and ZeroMQ ! • Define your own custom data sources #LRNAS2014

127. MLIB

129. Analytics MLib • Scalable machine learning library ! • Basic statistics • Summary statistics • Correlations • …. ! • Classification and regression • Linear models • Decision tress • Naive Bayes #LRNAS2014

130. Analytics MLib • Clustering • K-Means ! • Collaborative filtering • Alternate least squares ! • Dimensionality reduction ! • Singular value decomposition ! • Principal component analysis #LRNAS2014

131. GraphX

133. Analytics GraphX • API for graphs and graph-parallel computation ! • Growing scale and importance • From social networks to language modelling ! • Directed multigraph with properties attached to each vertex and edge ! • Growing collection of graph algorithms and builders #LRNAS2014

134. Use cases and examples

135. XXX Remove this slide!! ! For NAS all the following examples will depend on how much free time I get to work on them (I actually need to write one more) until the day of the presentation :( but I guess it should be fine to show some snippets within the slides ! Not all of them will be included, just putting a few ideas

136. Connecting Liferay and Kafka

137. Examples: Kafka and Liferay Connecting Liferay and Kafka • Easy to use ! • “Transparent” for the developer ! • Runtime pluggable ! • Common API: use it through our Message Bus ! • You can take a look to Kafka Bridge #LRNAS2014

138. Examples: Kafka and Liferay #LRNAS2014

139. Examples: Kafka and Liferay Liferay Core #LRNAS2014

140. Examples: Kafka and Liferay Liferay Core Liferay App #LRNAS2014

141. Examples: Kafka and Liferay Liferay Core Liferay App #LRNAS2014 Message Bus API

144. Examples: Kafka and Liferay Liferay Core Liferay App #LRNAS2014 Message Bus API Kafka Topic Message Payload

145. Examples: Kafka and Liferay Liferay Core #LRNAS2014 Kafka Bridge Liferay App Message Bus API Kafka Topic Message Payload

146. Examples: Kafka and Liferay Liferay Core #LRNAS2014 Kafka Bridge Liferay App Message Bus API Kafka Topic Message Payload

147. Examples: Kafka and Liferay #LRNAS2014 Apache Kafka Liferay Core Kafka Bridge Liferay App Message Bus API Kafka Topic Message Payload

148. Examples: Kafka and Liferay #LRNAS2014 Apache Kafka Liferay Core Kafka Bridge Liferay App Message Bus API Kafka Topic Message Payload

149. Examples: Kafka and Liferay #LRNAS2014

150. Recommendation engine

151. Examples: Recommender’s goals You might want to read … • Blog posts ! • Ratings for previous blog posts ! • Recommend to the user some entries for future reading #LRNAS2014

152. Examples: Recommender storage #LRNAS2014

153. Examples: Recommender storage #LRNAS2014 Blog Rating save/update

154. Examples: Recommender storage #LRNAS2014 Blog Rating save/update Blog Entry save/update

155. Examples: Recommender storage #LRNAS2014 Blog Rating save/update Blog Entry save/update Apache Kafka

156. Examples: Recommender storage UserID::BlogEntryID::Rating::Timestamp #LRNAS2014 Blog Rating save/update Blog Entry save/update Apache Kafka

157. Examples: Recommender storage UserID::BlogEntryID::Rating::Timestamp BlogEntryID::Title::CategoryNames #LRNAS2014 Blog Rating save/update Blog Entry save/update Apache Kafka

158. Examples: Recommender storage UserID::BlogEntryID::Rating::Timestamp BlogEntryID::Title::CategoryNames #LRNAS2014 Blog Rating save/update Blog Entry save/update Apache Kafka

159. Examples: Recommender storage UserID::BlogEntryID::Rating::Timestamp BlogEntryID::Title::CategoryNames #LRNAS2014 Blog Rating save/update Blog Entry save/update Apache Kafka HDFS

160. Examples: Recommender’s analysis Collaborative filtering • Commonly used in recommender systems ! • Try to fill missing entries in association matrix ! • MLib includes the Alternating Least Squares algorithm (ALS) #LRNAS2014

161. Examples: Recommender’s analysis #LRNAS2014

162. Takeaways

163. Takeaways What I would like you’ve learned today • It is not about data size, it’s about how you use it ! • You already own tons of data, you just need to take get value from it ! • There is no silver bullet: you’ve plenty of alternatives ! • JVM Big data related techs are usually a great choice ! • Try it yourself!! #LRNAS2014

164. References

165. References References • Apache Kafka ! • Apache Spark ! • Apache Storm ! • Apache Hadoop ! • Big Data definition at Wikipedia ! • What every software engineer should know about a log #LRNAS2014

166. Thank you!

167. Questions (and hopefully answers)

Liferay and Big Data

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Liferay and Big Data

Similaire à Liferay and Big Data (20)

Plus de Miguel Pastor

Plus de Miguel Pastor (17)

Dernier

Dernier (20)

Liferay and Big Data