SlideShare une entreprise Scribd logo
1  sur  67
Liferay & Big Data 
Getting value from your data 
! 
Miguel Ángel Pastor Olivar 
miguel.pastor@liferay.com
Who am I? 
! 
• Some random guy 
! 
• Member of the Liferay core infrastructure 
team 
! 
•Disclaimer: Not a computer scientist 
! 
• @miguelinlas3
What are we going to talk about? 
! 
• Big Data: what is this about? 
! 
• Simple architecture proposal 
! 
• Use cases 
! 
• Questions (and hopefully answers)
Big Data?
• Data is so big that regular solutions are: 
! 
–Extremely slow 
! 
–Too small 
! 
–Really expensive 
! 
• How we use all the data we already own
! 
• Volume 
–Transactions, data streaming from social media, … 
! 
• Velocity 
–Torrents of data in real time 
! 
• Variety 
–Numerical data, text, email, video, audio, …
Popular usages
• Recommender systems 
! 
• Predicting the future: 
– Netflix does autoscaling based on past 
network data traffic 
! 
• Churn models 
– Big telco companies build social networks 
to reduce the churn
• Sentiment analysis 
–Are talking about you in the Internet? 
! 
• Real Time Bidding 
–Optimise advertising 
! 
• Health care 
–Improve patients health while reducing costs 
–Improve quality of life of multiple sclerosis patients
Terminology
• Storage models 
• How to store relevant information 
! 
• Computation models 
• Process and transform all the information 
! 
• Analytics 
• How we can take actions based on the 
previous steps
Big Data 
Architectures
Data storage
Hadoop Distributed File System (HDFS) 
! 
• Java based file system 
! 
• Scalable, fault-tolerant, distributed storage 
! 
• Designed to run on commodity hardware 
! 
• Closely related to MapReduce
Source: http://hortonworks.com/
NoSQL storage
• Semistructured data 
! 
• Focused on 
! 
• Horizontal scalability 
! 
• Availability 
! 
• Different trade-offs: CAP, BASE, … 
!
NewSQL 
storage
• Modern relational databases 
! 
• Same scalable performance than NoSQL for 
OLTP 
! 
• Maintain ACID guarantees 
! 
• A few alternatives: VoltDB, Google Spanner, 
FoundationDB, …
Computation 
and analytics
Apache Hadoop
Apache Hadoop Map Reduce 
! 
• Distributed processing 
! 
• Large datasets 
! 
•Clusters of computers 
#LRNAS2014 
! 
• Simple programming model 
! 
• Verbose and hard to use API
Liferay 
projects 
is 
the 
best 
Open 
Source 
project 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
• Batch model data crunching 
! 
• Not so good event stream processing 
! 
• But … 
! 
• Many algorithms hard to implement using 
MapReduce 
! 
• Cascading, Scalding, Cascalog, Impala, …
Apache Storm
• Distributed realtime computation system 
! 
• Easy to reliably process unbounded streams of data 
! 
• Multi language support 
! 
• Realtime analytics, online machine learning, continuous 
computation, distributed RPC, ETL, …
Spout 
Spout 
Bolt Bolt 
Bolt
Apache Spark
• Fast and general-purpose cluster computing 
• Developed by Berkeley AMP 
! 
• High level APIs (not MapReduce) 
! 
• Optimised engine: 
• supports general execution graphs 
! 
• Higher-level tools: 
• Spark SQL, MLib, Spark Streaming, Graphx
Apache Mahout
! 
• Scalable machine learning library 
#LRNAS2014 
! 
• Built on top of Hadoop 
! 
• Some algorithms don’t require Hadoop at all 
#LRNAS2014
R language
• Focused on: 
• Data visualisation 
• Statistical computations 
• Analysis of data 
! 
• Tons of built-in packages 
! 
• Connect to Hadoop through Hadoop Streaming 
! 
• Not a fast language
Reference 
Architecture
RDBMS 
Event Broker 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data 
Logs 
Monitoring Dataware 
House 
Streaming Social 
Graph
Datasources
RDBMS 
Event Broker 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data 
Logs 
Monitoring Dataware 
House 
Streaming Social 
Graph
• System events 
! 
• User tracking (client side) 
• Clicks, navigation, activities, … 
! 
• Monitoring (transactions, load page times, …) 
! 
• Models (message boards, blogs, wiki …) 
! 
• Custom developments …
Event broker
RDBMS 
Event Broker 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data 
Logs 
Monitoring Dataware 
House 
Streaming Social 
Graph
Data Source 
0 1 2 3 4 5 6 7 8 
Writes 
9 
Reads Reads 
System A System B
Apache Kafka 
! 
• Publish-subscribe as distributed commit log 
! 
• Fast 
! 
• Scalable 
! 
• Durable 
! 
• Distributed by design
Broker A 
Broker B 
Producer Consumer 
Broker C 
ZooKeeper
Computation 
and analytics
RDBMS 
Event Broker 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data 
Logs 
Monitoring Dataware 
House 
Streaming Social 
Graph
Batch processing? 
! 
Real time processing? 
! 
Machine learning algorithms? 
! 
Graph analysis? 
! 
Unified programming model?
! 
• Fast and general engine for large-scale data 
processing 
! 
• Write your apps in Java, Scala or Python 
! 
• Run on YARN cluster manager 
! 
• Can read any existing Hadoop data (HDFS) 
! 
• In memory or disk
Apache Spark Main Components 
Apache Spark 
Spark SQL 
Spark 
Streaming MLib GraphX
Spark Core
• Driver main function and executes various 
parallel operations on a cluster 
! 
• Resilient Distributed Datasets (RDD) 
• HDFS (or any Hadoop file system) 
! 
• Scala collection 
! 
• Second abstraction: shared variables
Spark SQL
• Mix SQL queries with Spark programs 
! 
• Unified Data Access 
! 
• Hive compatibility 
! 
• Standard JDBC or ODBC connectivity 
! 
• Same engine for both interactive and long running 
queries
Spark Streaming
• Build your apps using high-level operators 
! 
• Fault tolerance: exactly-once semantics out of the box 
! 
• Combine streaming with batch and interactive queries 
! 
• Can read from HDFS, Flume, Kafka, Twitter and ZeroMQ 
! 
• Define your own custom data sources
Spark MLib
! 
• Basic statistics 
• Summary statistics 
• Correlations 
• …. 
! 
• Classification and regression 
• Linear models 
• Decision tress 
• Naive Bayes
! 
• Clustering 
• K-Means 
! 
• Collaborative filtering 
• Alternate least squares 
! 
• Dimensionality reduction 
• Singular value decomposition 
! 
• Principal component analysis
Spark GraphX
! 
• Graphs API and graph-parallel computation 
! 
• Growing scale and importance 
• From social networks to language modelling 
! 
• Directed multigraph with properties attached to each 
vertex and edge 
! 
• Growing collection of graph algorithms and builders
Live demo! 
Building a messages 
classifier
Takeaways
• Not about data size, but how you use it 
! 
• You already own tons of data, you just need to take get 
value from it 
! 
• There is no silver bullet: you’ve plenty of alternatives 
! 
• JVM Big data related techs are usually a great choice 
! 
• Try it yourself!!
References
!• 
Apache Kafka 
! 
• Apache Spark 
! 
• Apache Storm 
! 
• Apache Hadoop 
! 
• Big Data definition at Wikipedia 
! 
• Liferay Kafka Bridge 
! 
• What every software engineer should know about a log
Thank you!!
Questions 
(and hopefully answers)

Contenu connexe

Tendances

Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Databricks
 

Tendances (20)

Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDB
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
How do spark_kafka_and_syncsort_dmx-h
How do spark_kafka_and_syncsort_dmx-hHow do spark_kafka_and_syncsort_dmx-h
How do spark_kafka_and_syncsort_dmx-h
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQL
 
Elasticsearch JVM-MX Meetup April 2016
Elasticsearch JVM-MX Meetup April 2016Elasticsearch JVM-MX Meetup April 2016
Elasticsearch JVM-MX Meetup April 2016
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
 
Real time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflowsReal time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflows
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Hybrid Apache Spark Architecture with YARN and Kubernetes
Hybrid Apache Spark Architecture with YARN and KubernetesHybrid Apache Spark Architecture with YARN and Kubernetes
Hybrid Apache Spark Architecture with YARN and Kubernetes
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
 
Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 

En vedette

Arianrod prefacio1
Arianrod prefacio1Arianrod prefacio1
Arianrod prefacio1
raceaguilart
 
Curso Comunicacion 2
Curso Comunicacion 2Curso Comunicacion 2
Curso Comunicacion 2
juan pablo
 
Water and Waste Water Treatment - EN - 140716 - webreduced
Water and Waste Water Treatment - EN - 140716 - webreducedWater and Waste Water Treatment - EN - 140716 - webreduced
Water and Waste Water Treatment - EN - 140716 - webreduced
Renan Norbiate de Melo
 
Origen y significado del día de muertos
Origen y significado del día de muertosOrigen y significado del día de muertos
Origen y significado del día de muertos
ommasi
 

En vedette (20)

3. Sinagogas, inspiración para Grupos Pequeños
3. Sinagogas, inspiración para Grupos Pequeños3. Sinagogas, inspiración para Grupos Pequeños
3. Sinagogas, inspiración para Grupos Pequeños
 
Arianrod prefacio1
Arianrod prefacio1Arianrod prefacio1
Arianrod prefacio1
 
KIAC_Conference Report_Print
KIAC_Conference Report_PrintKIAC_Conference Report_Print
KIAC_Conference Report_Print
 
Curso Comunicacion 2
Curso Comunicacion 2Curso Comunicacion 2
Curso Comunicacion 2
 
Ruta de la tapa
Ruta de la tapaRuta de la tapa
Ruta de la tapa
 
Arrow ECS - One Source, IT Skills & Serivces
Arrow ECS - One Source, IT Skills & SerivcesArrow ECS - One Source, IT Skills & Serivces
Arrow ECS - One Source, IT Skills & Serivces
 
Algo de astronomia
Algo de astronomiaAlgo de astronomia
Algo de astronomia
 
Water and Waste Water Treatment - EN - 140716 - webreduced
Water and Waste Water Treatment - EN - 140716 - webreducedWater and Waste Water Treatment - EN - 140716 - webreduced
Water and Waste Water Treatment - EN - 140716 - webreduced
 
Integración prevención 03 10-10
Integración prevención 03 10-10Integración prevención 03 10-10
Integración prevención 03 10-10
 
CyberAttack -- Whose side is your computer on?
CyberAttack -- Whose side is your computer on?CyberAttack -- Whose side is your computer on?
CyberAttack -- Whose side is your computer on?
 
Origen y significado del día de muertos
Origen y significado del día de muertosOrigen y significado del día de muertos
Origen y significado del día de muertos
 
HSBP June Invite
HSBP June InviteHSBP June Invite
HSBP June Invite
 
Netherlands Fuel Card Briefing
Netherlands Fuel Card Briefing Netherlands Fuel Card Briefing
Netherlands Fuel Card Briefing
 
Dermlite Dermatoscopes
Dermlite DermatoscopesDermlite Dermatoscopes
Dermlite Dermatoscopes
 
Como funciona el alcohol en el cuerpo
Como funciona el alcohol en el cuerpoComo funciona el alcohol en el cuerpo
Como funciona el alcohol en el cuerpo
 
Vhigo Mase
Vhigo MaseVhigo Mase
Vhigo Mase
 
Reputacion online C4E
Reputacion online C4EReputacion online C4E
Reputacion online C4E
 
Future Academy - Cerificate
Future Academy - CerificateFuture Academy - Cerificate
Future Academy - Cerificate
 
Mr. Eduard Rodès Director of the European Short Sea Shipping School
Mr. Eduard Rodès Director of the   European Short Sea Shipping School Mr. Eduard Rodès Director of the   European Short Sea Shipping School
Mr. Eduard Rodès Director of the European Short Sea Shipping School
 
Customer Lifestage
Customer LifestageCustomer Lifestage
Customer Lifestage
 

Similaire à Liferay & Big Data Dev Con 2014

Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
Christopher Whitaker
 

Similaire à Liferay & Big Data Dev Con 2014 (20)

Apache drill
Apache drillApache drill
Apache drill
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
 

Plus de Miguel Pastor

Microservices: The OSGi way A different vision on microservices
Microservices: The OSGi way A different vision on microservicesMicroservices: The OSGi way A different vision on microservices
Microservices: The OSGi way A different vision on microservices
Miguel Pastor
 
Reactive applications and Akka intro used in the Madrid Scala Meetup
Reactive applications and Akka intro used in the Madrid Scala MeetupReactive applications and Akka intro used in the Madrid Scala Meetup
Reactive applications and Akka intro used in the Madrid Scala Meetup
Miguel Pastor
 
Liferay Devcon 2013: Our way towards modularity
Liferay Devcon 2013: Our way towards modularityLiferay Devcon 2013: Our way towards modularity
Liferay Devcon 2013: Our way towards modularity
Miguel Pastor
 

Plus de Miguel Pastor (17)

Microservices: The OSGi way A different vision on microservices
Microservices: The OSGi way A different vision on microservicesMicroservices: The OSGi way A different vision on microservices
Microservices: The OSGi way A different vision on microservices
 
Reactive applications and Akka intro used in the Madrid Scala Meetup
Reactive applications and Akka intro used in the Madrid Scala MeetupReactive applications and Akka intro used in the Madrid Scala Meetup
Reactive applications and Akka intro used in the Madrid Scala Meetup
 
Reactive applications using Akka
Reactive applications using AkkaReactive applications using Akka
Reactive applications using Akka
 
Liferay Devcon 2013: Our way towards modularity
Liferay Devcon 2013: Our way towards modularityLiferay Devcon 2013: Our way towards modularity
Liferay Devcon 2013: Our way towards modularity
 
Liferay Module Framework
Liferay Module FrameworkLiferay Module Framework
Liferay Module Framework
 
Liferay and Cloud
Liferay and CloudLiferay and Cloud
Liferay and Cloud
 
Jvm fundamentals
Jvm fundamentalsJvm fundamentals
Jvm fundamentals
 
Scala Overview
Scala OverviewScala Overview
Scala Overview
 
Hadoop, Cloud y Spring
Hadoop, Cloud y Spring Hadoop, Cloud y Spring
Hadoop, Cloud y Spring
 
Scala: un vistazo general
Scala: un vistazo generalScala: un vistazo general
Scala: un vistazo general
 
Platform as a Service overview
Platform as a Service overviewPlatform as a Service overview
Platform as a Service overview
 
HadoopDB
HadoopDBHadoopDB
HadoopDB
 
Aspect Oriented Programming introduction
Aspect Oriented Programming introductionAspect Oriented Programming introduction
Aspect Oriented Programming introduction
 
Software measure-slides
Software measure-slidesSoftware measure-slides
Software measure-slides
 
Arquitecturas MMOG
Arquitecturas MMOGArquitecturas MMOG
Arquitecturas MMOG
 
Software Failures
Software FailuresSoftware Failures
Software Failures
 
Groovy and Grails intro
Groovy and Grails introGroovy and Grails intro
Groovy and Grails intro
 

Dernier

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 

Dernier (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 

Liferay & Big Data Dev Con 2014

  • 1. Liferay & Big Data Getting value from your data ! Miguel Ángel Pastor Olivar miguel.pastor@liferay.com
  • 2. Who am I? ! • Some random guy ! • Member of the Liferay core infrastructure team ! •Disclaimer: Not a computer scientist ! • @miguelinlas3
  • 3. What are we going to talk about? ! • Big Data: what is this about? ! • Simple architecture proposal ! • Use cases ! • Questions (and hopefully answers)
  • 5. • Data is so big that regular solutions are: ! –Extremely slow ! –Too small ! –Really expensive ! • How we use all the data we already own
  • 6. ! • Volume –Transactions, data streaming from social media, … ! • Velocity –Torrents of data in real time ! • Variety –Numerical data, text, email, video, audio, …
  • 8. • Recommender systems ! • Predicting the future: – Netflix does autoscaling based on past network data traffic ! • Churn models – Big telco companies build social networks to reduce the churn
  • 9. • Sentiment analysis –Are talking about you in the Internet? ! • Real Time Bidding –Optimise advertising ! • Health care –Improve patients health while reducing costs –Improve quality of life of multiple sclerosis patients
  • 11. • Storage models • How to store relevant information ! • Computation models • Process and transform all the information ! • Analytics • How we can take actions based on the previous steps
  • 14. Hadoop Distributed File System (HDFS) ! • Java based file system ! • Scalable, fault-tolerant, distributed storage ! • Designed to run on commodity hardware ! • Closely related to MapReduce
  • 17. • Semistructured data ! • Focused on ! • Horizontal scalability ! • Availability ! • Different trade-offs: CAP, BASE, … !
  • 19. • Modern relational databases ! • Same scalable performance than NoSQL for OLTP ! • Maintain ACID guarantees ! • A few alternatives: VoltDB, Google Spanner, FoundationDB, …
  • 22. Apache Hadoop Map Reduce ! • Distributed processing ! • Large datasets ! •Clusters of computers #LRNAS2014 ! • Simple programming model ! • Verbose and hard to use API
  • 23. Liferay projects is the best Open Source project best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 24. • Batch model data crunching ! • Not so good event stream processing ! • But … ! • Many algorithms hard to implement using MapReduce ! • Cascading, Scalding, Cascalog, Impala, …
  • 26. • Distributed realtime computation system ! • Easy to reliably process unbounded streams of data ! • Multi language support ! • Realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, …
  • 27. Spout Spout Bolt Bolt Bolt
  • 29. • Fast and general-purpose cluster computing • Developed by Berkeley AMP ! • High level APIs (not MapReduce) ! • Optimised engine: • supports general execution graphs ! • Higher-level tools: • Spark SQL, MLib, Spark Streaming, Graphx
  • 31. ! • Scalable machine learning library #LRNAS2014 ! • Built on top of Hadoop ! • Some algorithms don’t require Hadoop at all #LRNAS2014
  • 33. • Focused on: • Data visualisation • Statistical computations • Analysis of data ! • Tons of built-in packages ! • Connect to Hadoop through Hadoop Streaming ! • Not a fast language
  • 35. RDBMS Event Broker Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 37. RDBMS Event Broker Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 38. • System events ! • User tracking (client side) • Clicks, navigation, activities, … ! • Monitoring (transactions, load page times, …) ! • Models (message boards, blogs, wiki …) ! • Custom developments …
  • 40. RDBMS Event Broker Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 41. Data Source 0 1 2 3 4 5 6 7 8 Writes 9 Reads Reads System A System B
  • 42. Apache Kafka ! • Publish-subscribe as distributed commit log ! • Fast ! • Scalable ! • Durable ! • Distributed by design
  • 43. Broker A Broker B Producer Consumer Broker C ZooKeeper
  • 45. RDBMS Event Broker Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 46. Batch processing? ! Real time processing? ! Machine learning algorithms? ! Graph analysis? ! Unified programming model?
  • 47.
  • 48. ! • Fast and general engine for large-scale data processing ! • Write your apps in Java, Scala or Python ! • Run on YARN cluster manager ! • Can read any existing Hadoop data (HDFS) ! • In memory or disk
  • 49. Apache Spark Main Components Apache Spark Spark SQL Spark Streaming MLib GraphX
  • 51. • Driver main function and executes various parallel operations on a cluster ! • Resilient Distributed Datasets (RDD) • HDFS (or any Hadoop file system) ! • Scala collection ! • Second abstraction: shared variables
  • 53. • Mix SQL queries with Spark programs ! • Unified Data Access ! • Hive compatibility ! • Standard JDBC or ODBC connectivity ! • Same engine for both interactive and long running queries
  • 55. • Build your apps using high-level operators ! • Fault tolerance: exactly-once semantics out of the box ! • Combine streaming with batch and interactive queries ! • Can read from HDFS, Flume, Kafka, Twitter and ZeroMQ ! • Define your own custom data sources
  • 57. ! • Basic statistics • Summary statistics • Correlations • …. ! • Classification and regression • Linear models • Decision tress • Naive Bayes
  • 58. ! • Clustering • K-Means ! • Collaborative filtering • Alternate least squares ! • Dimensionality reduction • Singular value decomposition ! • Principal component analysis
  • 60. ! • Graphs API and graph-parallel computation ! • Growing scale and importance • From social networks to language modelling ! • Directed multigraph with properties attached to each vertex and edge ! • Growing collection of graph algorithms and builders
  • 61. Live demo! Building a messages classifier
  • 63. • Not about data size, but how you use it ! • You already own tons of data, you just need to take get value from it ! • There is no silver bullet: you’ve plenty of alternatives ! • JVM Big data related techs are usually a great choice ! • Try it yourself!!
  • 65. !• Apache Kafka ! • Apache Spark ! • Apache Storm ! • Apache Hadoop ! • Big Data definition at Wikipedia ! • Liferay Kafka Bridge ! • What every software engineer should know about a log