SlideShare une entreprise Scribd logo
1  sur  16
Nils Kübler, MeMo News, @nkuebler

MeMo News: Online Media
Monitoring with Hadoop
14.05.12
Classic Media Monitoring




                           14.05.12
Online Media Monitoring




                          14.05.12
Social Media Monitoring




                          14.05.12
Web Crawler Types

                                Intrinsic
                                Quality


                     Research                  Focused
                     Crawlers                  Crawlers




                                    General
                                    Crawler
              Archive
                                                     News agents
              Crawlers




                                   Mirroring

    Representational               systems                         Freshness
    Quality

                                                                               14.05.12
Monitoring with MeMo News




                            14.05.12
Search Engine Architecture


Offline                                      Online


 Retrieving   Gathering   Indexing   Index       Search




                                                          14.05.12
Offline-Part: Technology-
Stack




                            14.05.12
Offline-Part: Technology-
Stack




                            14.05.12
The MeMo Newsagent




                     14.05.12
Cluster: Masters




                   14.05.12
Cluster: Workers




                   14.05.12
Scaling the Newsagent




                        14.05.12
FAQ

QUESTIONS?
Attributions

Photos:
http://www.flickr.com/photos/foreignoffice/4036442
903/
http://www.flickr.com/photos/hytok/2640161873/
http://www.flickr.com/photos/videocrab/116136642/
http://commons.wikimedia.org/wiki/File:Crystal_Cle
ar_app_personal.png




                                                     14.05.12

Contenu connexe

En vedette

Hortonworks Technical Workshop: Apache Ambari
Hortonworks Technical Workshop:   Apache AmbariHortonworks Technical Workshop:   Apache Ambari
Hortonworks Technical Workshop: Apache AmbariHortonworks
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaSkillspeed
 
Designing Puppet: Roles/Profiles Pattern
Designing Puppet: Roles/Profiles PatternDesigning Puppet: Roles/Profiles Pattern
Designing Puppet: Roles/Profiles PatternPuppet
 
Big Data Analytics : A Social Network Approach
Big Data Analytics : A Social Network ApproachBig Data Analytics : A Social Network Approach
Big Data Analytics : A Social Network ApproachAndry Alamsyah
 
Social media mining PPT
Social media mining PPTSocial media mining PPT
Social media mining PPTChhavi Mathur
 
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...Sonatype
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performanceDataWorks Summit
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataNicolas Poggi
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingBart Vandewoestyne
 
Big data and Social Media Analytics
Big data and Social Media AnalyticsBig data and Social Media Analytics
Big data and Social Media AnalyticsSimplify360
 
Hadoop Monitoring best Practices
Hadoop Monitoring best PracticesHadoop Monitoring best Practices
Hadoop Monitoring best PracticesEdward Capriolo
 

En vedette (15)

Hortonworks Technical Workshop: Apache Ambari
Hortonworks Technical Workshop:   Apache AmbariHortonworks Technical Workshop:   Apache Ambari
Hortonworks Technical Workshop: Apache Ambari
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social Media
 
Designing Puppet: Roles/Profiles Pattern
Designing Puppet: Roles/Profiles PatternDesigning Puppet: Roles/Profiles Pattern
Designing Puppet: Roles/Profiles Pattern
 
Social media with big data analytics
Social media with big data analyticsSocial media with big data analytics
Social media with big data analytics
 
Big Data Analytics : A Social Network Approach
Big Data Analytics : A Social Network ApproachBig Data Analytics : A Social Network Approach
Big Data Analytics : A Social Network Approach
 
Social media mining PPT
Social media mining PPTSocial media mining PPT
Social media mining PPT
 
Kudu Cloudera Meetup Paris
Kudu Cloudera Meetup ParisKudu Cloudera Meetup Paris
Kudu Cloudera Meetup Paris
 
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 
Big data and Social Media Analytics
Big data and Social Media AnalyticsBig data and Social Media Analytics
Big data and Social Media Analytics
 
Hadoop Monitoring best Practices
Hadoop Monitoring best PracticesHadoop Monitoring best Practices
Hadoop Monitoring best Practices
 

Similaire à 14.05.2012 Social Media Monitoring with Hadoop (Nils Kübler, MeMo News)

IT Vulnerability & Tools Watch 2011
IT Vulnerability & Tools Watch 2011IT Vulnerability & Tools Watch 2011
IT Vulnerability & Tools Watch 2011WASecurity
 
Cloud Native Computing: What does it mean, and is your app Cloud Native?
Cloud Native Computing: What does it mean, and is your app Cloud Native?Cloud Native Computing: What does it mean, and is your app Cloud Native?
Cloud Native Computing: What does it mean, and is your app Cloud Native?Michael O'Sullivan
 
Capacity Planning Free Solution
Capacity Planning Free SolutionCapacity Planning Free Solution
Capacity Planning Free Solutionluanrjesus
 
Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...
Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...
Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...Burr Sutter
 
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve PooleDevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve PooleJAXLondon_Conference
 
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"Daniel Bryant
 
Dipping Your Toes Into Cloud Native Application Development
Dipping Your Toes Into Cloud Native Application DevelopmentDipping Your Toes Into Cloud Native Application Development
Dipping Your Toes Into Cloud Native Application DevelopmentMatthew Farina
 
Big Data as a Service: A Neo-Metropolis Model Approach for Innovation
Big Data as a Service: A Neo-Metropolis Model Approach for InnovationBig Data as a Service: A Neo-Metropolis Model Approach for Innovation
Big Data as a Service: A Neo-Metropolis Model Approach for InnovationSoftServe
 
DevoxxUK 2016: "DevOps: Microservices, containers, platforms, tooling... Oh y...
DevoxxUK 2016: "DevOps: Microservices, containers, platforms, tooling... Oh y...DevoxxUK 2016: "DevOps: Microservices, containers, platforms, tooling... Oh y...
DevoxxUK 2016: "DevOps: Microservices, containers, platforms, tooling... Oh y...Daniel Bryant
 
StackEngine Demo - Boston
StackEngine Demo - BostonStackEngine Demo - Boston
StackEngine Demo - BostonBoyd Hemphill
 
Trends in Human-Computer Interaction in Information Seeking
Trends in Human-Computer Interaction in Information SeekingTrends in Human-Computer Interaction in Information Seeking
Trends in Human-Computer Interaction in Information SeekingRich Miller
 
Production-Ready_Microservices_excerpt.pdf
Production-Ready_Microservices_excerpt.pdfProduction-Ready_Microservices_excerpt.pdf
Production-Ready_Microservices_excerpt.pdfajcob123
 
Qconny2014dmarsh 140613080328-phpapp02
Qconny2014dmarsh 140613080328-phpapp02Qconny2014dmarsh 140613080328-phpapp02
Qconny2014dmarsh 140613080328-phpapp02재구 김
 
Semantic Web Landscape 2009
Semantic Web Landscape 2009Semantic Web Landscape 2009
Semantic Web Landscape 2009LeeFeigenbaum
 
Emerging trends in software development: The next generation of storage
Emerging trends in software development: The next generation of storageEmerging trends in software development: The next generation of storage
Emerging trends in software development: The next generation of storageDonnie Berkholz
 
3-D Cloud Monitoring: Enabling Effective Cloud Infrastructure and Application...
3-D Cloud Monitoring: Enabling Effective Cloud Infrastructure and Application...3-D Cloud Monitoring: Enabling Effective Cloud Infrastructure and Application...
3-D Cloud Monitoring: Enabling Effective Cloud Infrastructure and Application...Florian Blum
 
server to cloud: converting a legacy platform to an open source paas
server to cloud:  converting a legacy platform to an open source paasserver to cloud:  converting a legacy platform to an open source paas
server to cloud: converting a legacy platform to an open source paasTodd Fritz
 

Similaire à 14.05.2012 Social Media Monitoring with Hadoop (Nils Kübler, MeMo News) (20)

IT Vulnerability & Tools Watch 2011
IT Vulnerability & Tools Watch 2011IT Vulnerability & Tools Watch 2011
IT Vulnerability & Tools Watch 2011
 
Cloud Native Computing: What does it mean, and is your app Cloud Native?
Cloud Native Computing: What does it mean, and is your app Cloud Native?Cloud Native Computing: What does it mean, and is your app Cloud Native?
Cloud Native Computing: What does it mean, and is your app Cloud Native?
 
Capacity Planning Free Solution
Capacity Planning Free SolutionCapacity Planning Free Solution
Capacity Planning Free Solution
 
Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...
Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...
Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...
 
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve PooleDevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
 
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
 
Dipping Your Toes Into Cloud Native Application Development
Dipping Your Toes Into Cloud Native Application DevelopmentDipping Your Toes Into Cloud Native Application Development
Dipping Your Toes Into Cloud Native Application Development
 
Big Data as a Service: A Neo-Metropolis Model Approach for Innovation
Big Data as a Service: A Neo-Metropolis Model Approach for InnovationBig Data as a Service: A Neo-Metropolis Model Approach for Innovation
Big Data as a Service: A Neo-Metropolis Model Approach for Innovation
 
DevoxxUK 2016: "DevOps: Microservices, containers, platforms, tooling... Oh y...
DevoxxUK 2016: "DevOps: Microservices, containers, platforms, tooling... Oh y...DevoxxUK 2016: "DevOps: Microservices, containers, platforms, tooling... Oh y...
DevoxxUK 2016: "DevOps: Microservices, containers, platforms, tooling... Oh y...
 
StackEngine Demo - Boston
StackEngine Demo - BostonStackEngine Demo - Boston
StackEngine Demo - Boston
 
2016 open-source-network-softwarization
2016 open-source-network-softwarization2016 open-source-network-softwarization
2016 open-source-network-softwarization
 
2016 open-source-network-softwarization
2016 open-source-network-softwarization2016 open-source-network-softwarization
2016 open-source-network-softwarization
 
Trends in Human-Computer Interaction in Information Seeking
Trends in Human-Computer Interaction in Information SeekingTrends in Human-Computer Interaction in Information Seeking
Trends in Human-Computer Interaction in Information Seeking
 
Production-Ready_Microservices_excerpt.pdf
Production-Ready_Microservices_excerpt.pdfProduction-Ready_Microservices_excerpt.pdf
Production-Ready_Microservices_excerpt.pdf
 
Qconny2014dmarsh 140613080328-phpapp02
Qconny2014dmarsh 140613080328-phpapp02Qconny2014dmarsh 140613080328-phpapp02
Qconny2014dmarsh 140613080328-phpapp02
 
DevOps Note 20120224
DevOps Note 20120224DevOps Note 20120224
DevOps Note 20120224
 
Semantic Web Landscape 2009
Semantic Web Landscape 2009Semantic Web Landscape 2009
Semantic Web Landscape 2009
 
Emerging trends in software development: The next generation of storage
Emerging trends in software development: The next generation of storageEmerging trends in software development: The next generation of storage
Emerging trends in software development: The next generation of storage
 
3-D Cloud Monitoring: Enabling Effective Cloud Infrastructure and Application...
3-D Cloud Monitoring: Enabling Effective Cloud Infrastructure and Application...3-D Cloud Monitoring: Enabling Effective Cloud Infrastructure and Application...
3-D Cloud Monitoring: Enabling Effective Cloud Infrastructure and Application...
 
server to cloud: converting a legacy platform to an open source paas
server to cloud:  converting a legacy platform to an open source paasserver to cloud:  converting a legacy platform to an open source paas
server to cloud: converting a legacy platform to an open source paas
 

Plus de Swiss Big Data User Group

Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useSwiss Big Data User Group
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorSwiss Big Data User Group
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisSwiss Big Data User Group
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesSwiss Big Data User Group
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningSwiss Big Data User Group
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseSwiss Big Data User Group
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexitySwiss Big Data User Group
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceSwiss Big Data User Group
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketSwiss Big Data User Group
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridSwiss Big Data User Group
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseSwiss Big Data User Group
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computingSwiss Big Data User Group
 

Plus de Swiss Big Data User Group (20)

Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to use
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operator
 
Data Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2CData Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2C
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data Analysis
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companies
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Educating Data Scientists of the Future
Educating Data Scientists of the FutureEducating Data Scientists of the Future
Educating Data Scientists of the Future
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data Warehouse
 
Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexity
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density Choice
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maket
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC Datagrid
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph database
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computing
 
In-Store Analysis with Hadoop
In-Store Analysis with HadoopIn-Store Analysis with Hadoop
In-Store Analysis with Hadoop
 
Big Data Visualization With ParaView
Big Data Visualization With ParaViewBig Data Visualization With ParaView
Big Data Visualization With ParaView
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 

Dernier

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Dernier (20)

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

14.05.2012 Social Media Monitoring with Hadoop (Nils Kübler, MeMo News)

Notes de l'éditeur

  1. Welcome, my name is Nils Kübler, I work for MeMo News in Kreuzlingen. This is my first english presentation so it is not that fluent, please excuse that. Today I am going to introduce you our first hadoop projject, which is our Media Monitoring Platform at   MeMo news . In this presentation, I will first show you what Media Monitoring is about and how online and social monitoring works . second i will come on some of the core features of the memonews-platform , because they are important to understand some of our design decisions. In the third part I will come to the basics of our architecture ,  will introduce our technology stack and how our software uses the technology. In the final I will give you an overview of our little cluster and will wrap-up by showing how our software currently scales .
  2. Media Monitoring means monitoring the output of print, online and broadcast data.   It helps the customers to validate the results of marketing campains or to react early on changes in the public opinion .  I will explain you classic and online media monitoring types: The classical approach , is to systematically record radio or television broadasts and to collect clippings from print media and scan them for keywords .
  3. In the online world, there are two common forms: the online media monitoring , where online-sources such as news-portals or web-blogs are used as sources.
  4. And another form is the Social Media monitoring , which primary monitors social networks such as twitter, facebook or even forums.  MeMo News is both , an online media monitor and also a social media monitoring tool. Q: Has anybody an idea what type of software could be used for such a thing?  A: By downloading the internet :) Q: And how is that done? A: with a web-crawler
  5. So what's a web-crawler? A web-crawler downloads contents from the web, gathers data from the downloaded contents and persists them somewhere. Most likely in na search-index , to make it searchable.  A web crawler has to balance between quality of the downloaded contents and the freshness . We also distinquish between different aspects of  quailty.  For example, an archive crawler tries to copy objects of the web as good as possible. This means they have a good representional quality . A Research crawler on the other hand, tries to give weight on data that is relevant to the user. This means, they have good instrinsic quality . The crawler type which fits the online-monitoring task best, is the newsagent . The newsagent primary focuses on having very fresh information . It is also sometimes called "near-realtime" search. Before we going deeper into the details of our newsagent , I will explain you the most important use-cases with the MeMo-News platform, because they are important to understand our architecture.
  6. Monitoring with memonews works by creating so called search agents . The user creates the agents in the platform . Each agent consists a name and a query . The queries are then used to generate mail-reports or realtime alerts when articles that match are found. Additionally, the user is also able to login to the web-platform , where he can navigate through the agents , look at the most recent results and may also put results to different archives , which will never get delelted.   Because of this, we can never delete anything from our search index, that means it grows and grows.
  7. Here you can see the mask for editing an example agent for monitoring articles mentioning the Swiss Hadoop User group. Lets take a deeper look into the architecture of a search-engine.
  8. A search engine consists of two parts: The online and offline part. The online part is responsible for the user-interaction with our system.  The offline part is about everything that happens under the hood. That is: retrieving the contents from the web, gathering the important contents from it, and store them to our index.  In the following we focus primary on the offline part, which is running on the hadoop platform in memonews.
  9. Our Technology-Stack consists of 5 Maint-Parts: Data-Storage is provided by HDFS, and we use coordination via zookeeper. HBase extends HDFS by prioding Random Data Access and MapReduce allowes us to process both: HDFS and HBase Data in an asynchronous way.  The most important component though, is the solr , which isn't part of the core hadoop platform. We won't get deeper into those technologies here, but i would like to ask if there is interest about some of these topics for future presentations? 
  10. Our Newsagent directly uses all of these parts:  - We make heavy use of zookeeper to coordinate our downloaders. - We persist everithing we download on HBase . - This Data will also get Indexed to SOLR - And We make also heavy use of Mapreduce for Job-Scheduling and for asynchronous anlyzation tasks, such as priority calculation for downloads. OK, then lets take a look into our newsagent
  11. About every 3 Minutes or so, we make a full scan of all our known sources via mapreduce , and check if they need to be downloaded. If so, a job will be generated on zookeeper . This is what we call scheduling . We have a distributed application called the http-loader which is responsible to download sources. Every Http-Loader knows exactly for which jobs he is responsible, and will execute it, as soon as possible. When the source is downloaded, the new articles are stored on hbase.  On each update on hbase, the lily-Rowlog is triggered, which will start an update of the search-index and also do a so called prospective search  to send realtime-alerts to users.  During the Prospective search  we check every new article for matches with any of the existing agents. We used the lily-rowlog as trigger-mechanismn, because the previous version of hbase did not provide coprocessors. Coprocessors are like triggers in RDBMS, they enable to execute custom code as soon as data in hbase changes.  In a future version we will replace that, we already got an experimental implementation for that. Is there any interest about prospecivte search, Coprocessors  or lily rowlog  for a future presentation? OK, then let's take a short look in our current cluster setup
  12. I have separated the setup in  two slides, this here are our masters. We got 2 Virtual Master hosts, where we have each Master running in it's own Virtual Machine.  There is always only one VM active for each service, the other  one will only get activated when the active VM is down.  So even when one Virtual Master Host fail, the other one is able to take over all master-services. This gives us a pretty good reliability. The SOLR-Server need to run on a standalone machine, because it requires so much resources. Here we follow th same pattern, when the active Server fails, the other one will take over the work.  ... pause ... Currently one machine is suitable to hold our SOLR shards, but we will soon need to distribute them across multiple machines, because our index grows and grows.
  13. Our "unit of scale" is the Worker. Each worker has the typical hadoop-services installed: Datanode, Tasktracker and Regionserver. And currently 3 of the Workers are also providing our Zookeeper-Quorum, which we may migrate some time to another place. We also have two own processes runng on each  worker: The http-loader and the index-updater, which i introduced already on a previous slide. ... pause ... So, we already reached the last slide, where we can see how our  newsagent currently scales.
  14. As you can see, with 1 worker we download around 15 articles per second.  With 2 workers, this number is around doubled.  The scalling seems to be reduced with 3 or 4 workers, but this is only one measurement. We are positive that we can handle that, even when this graphic may suggest that we will reach an scalability-problem soon.    And we definitly will need to scale our system more as we want to crawl more sources. Another point why we will need to introduce much more workers is, that our analyzis tasks will grow and grow as we introduce more and more analytic tasks, such as sentiment analyzis. .. pause ... so thats it. Thank you for your attention ...
  15. ... any questions?
  16. As you can see, with 1 worker we download around 15 articles per second.  With 2 workers, this number is around doubled.  The scalling seems to be reduced with 3 or 4 workers, but this is only one measurement. We are positive that we can handle that, even when this graphic may suggest that we will reach an scalability-problem soon.    And we definitly will need to scale our system more as we want to crawl more sources. Another point why we will need to introduce much more workers is, that our analyzis tasks will grow and grow as we introduce more and more analytic tasks, such as sentiment analyzis. .. pause ... so thats it. Thank you for your attention ...