SlideShare une entreprise Scribd logo
1  sur  36
School of something
          Computing
FACULTY OF ENGINEERING
           OTHER




Distributed Data Mining for User Sensemaking
        in Online Collaborative Spaces


                             Submitted to:
               DicoSyn2012 Workshop @ CSCW’12


Presented By: Ahmad Ammari
RF in User & Community Modelling
OUTLINE
• The Big Data “Problem” in Online Collaborative Spaces
• What is User Sensemaking and How Big Data is affecting it?
• Can Distributed Data Mining Help?
 • What is Hadoop & Map / Reduce?
 • What is Mahout?
• Proposed Approach to support User Sensemaking in OCS
 • Content Pre-Processing
 • Content Clustering
 • Topic Modelling
• Case Study: Making Sense of Online Forums
 • How are Discussions currently Organized? Clusters vs. Categories
 • Which Content to Mine? Mining the Right Discussion Parts
                                                                      2
 • How Can This Help Sensemaking? Some Usage Scenarios
How “Big” is Big Data?
• Emails
  •   90 Trillion – The Number of Emails Sent on the Internet in 2009
  •   107 Trillion – The Number of Emails Sent on the Internet in 2010
• Websites
  •   234 Million – The Number of Websites by Dec 2009
  •   255 Million – The Number of Websites by Dec 2010
• Social Media
  •   152 Million – The Number of Blogs on the Internet in 2010
  •   25 Billion – The Number of sent Tweets on Twitter in 2010
• Multi Media
  •   5 Billion – The Number of Photos Hosted by Flicker (Sep 2010)
  •   2 Billion – The Number of Videos Watched per Day on YouTube


                                           3
What about Online CS?

                They are Big Too!
 Top 10 biggest Internet forums




                              4
What about Online CS?

               They are Big Too!
 Stack Exchange Family of Forums




                            5
Why is it a Problem?




           Where should I post my
          programming question to
            get relevant replies?




                        6
Why is it a Problem?




            Where to find a solution
          to my MS Outlook Problem?




                         7
Why is it a Problem?




                  What are the actual
                   discussions are
                     really about?




     I cannot make sense
        of Big Content!

                              8
Why Making Sense of Big Data
        is not Easy, not Fast?
•       Because it’s Big and still increasing!
•       Because it’s Diverse!
    •     Stack Exchange Suite of Forums has more than
          50 Different Technical Discussion Forums
    •     WebProWorld Technical Forums has more than
          40 Discussion Categories
•       Because it’s Dynamic!
    •     294 Billion – The Average Number of Email Messages per Day
    •     21.4 Million – The Number of Added Websites in 2010
    •     96,101 New Blogs in last 24 hours (8th Dec 2011)
    •     190 Million – The Number of Tweets per day
          in June 2011
•       Because it’s Noisy!
    •     200 billion – The number of spam emails per day in 2009
    •     262 billion – The number of spam emails per day in 2010
                                                     9
But What is “Sensemaking”?!
• Creating a representation of a collection of information [Russell et al, 1993]
    •     Focused on the context of understanding large document collections. [Paul et al, 2011]
•       Transforming Information into Knowledge [Priolli & Card, 2005]
    •     Seeking, filtering, searching for relations, extracting, schematizing
•       Understanding connections among people, places, and events [Klein et al,
        2006]




                                                    10
Our Solution!




      Large-Scale Data            Knowledge Discovery in Big
         Processing                        Content
   Quick Data Processing           Analysis of Unstructured
  Scalable Data Processing                   Data
  Robust Data Processing            Machine Intelligence to
                                      Support Humans




                             11
What is Hadoop?
•       A framework for storing and processing big data on lots of commodity
        machines
    •     Up to 4,000 machines in a cluster
    •     Up to 20 PB in a cluster
•       Open Source Apache project
•       Implemented in Java                         We focused on distributed
                                                  computation with Map/Reduce
•       Contains Many Sub-Projects:
    •     Map/Reduce – Software Framework for Distributed Processing of Large Dataets
    •     HDFS – Hadoop Distributed File System
    •     Hadoop Common – Provides Access to the File Systems Supported by Hadoop
    •     Chukwa – Data Collection System for Managing Large Distributed Systems
    •     Hbase – Scalable, Distributed Database that Supports Structured Data Storage
    •     Hive – Data Warehouse Infrastructure that provides Data Summarization & Ad Hoc Querying
    •     Pig – High-Level Data-Flow Language & Execution Framework for Parallel Computation
    •     Zookeeper – High-Performance Coordination Service for Dist. Apps.
                                                  12
Who Use Hadoop?




                  13
Why they Use Hadoop?




                   14
Hadoop Map/Reduce

• Simply: A parallel programming model and an associated
  implementation
• Abstract model: hides many system-level details from the
  programmer
• Move-code-to-data philosophy: computation on data piece takes
  place on the same machine where that piece resides
• Map/Reduce Job runs in Phases, each Phase runs in Parallel
  across all Nodes in the Hadoop Cluster
• Main Phases: Mapping, Reducing
• Are there Other Phases? Yes!
  • Shuffling & Sorting, Combining, Partitioning
  • But .. Programmer writes “Mapper” and “Reducer” functions only!
                                  15
Hadoop Map/Reduce




                    16
Hadoop Map/Reduce




        More formally,
        • Map(k1,v1)  list(k2,v2)
          • Shuffle & Sort(list(k2,v2))  k2, list(v2)
        • Reduce(k2, list(v2))  list(k3, v3)
                          17
Hadoop Map/Reduce




                    18
Our Solution!




      Large-Scale Data            Knowledge Discovery in Big
         Processing                        Content
   Quick Data Processing           Analysis of Unstructured
  Scalable Data Processing                   Data
  Robust Data Processing            Machine Intelligence to
                                      Support Humans




                             19
What is Mahout?
•       Open source machine learning library from Apache
•       Began life in 2008 as a subproject of Apache’s Lucene Search Engine
•       In 2009 absorbed the Taste open source collaborative filtering project
•       In 2010 became a stand-alone Project
•       Written in Java
•       ML algorithms mainly for
    •     Recommender Engines (CF-based)
    •     Clustering                                                    April 2010
    •     Classification
•       Pre-Processing algorithms for Unstructured Data
•       Scalability is achieved by Map/Reduce Implementations of ML Algorithms

                             We focused on Mahout Clustering
                                   and Pre-Processing
                             Implementations in Map/Reduce
                                         20
Sensemaking-Support with DDM




INPUT: Collaboration Content (Discussions)

                           21
Sensemaking-Support with DDM




Content Pre-Processing: Prepare Content for Mining

                           22
Sensemaking-Support with DDM




Content Clustering: Derive Groups of Similar Content

                            23
Sensemaking-Support with DDM




Topic Modelling: Identify Fine-Grained Topics and
Generate Topic Clouds
                            24
Sensemaking-Support with DDM




            OUTPUT: Topic Clouds

                    25
Content Pre-Processing

• Apache Lucene Text Analysis
  •   Tokenization, Non-Letter Removal, Lower Case Filtration, Stop Word Removal
• TFIDF Weighting: Computing Numerical Weights to Content Terms
• n-gram Collocations
  •   Multi-Term Phrases having high probability of occurring together
  •   Examples: “Social Media”, “Data Mining”, “Machine Learning”
• Normalization
  •   decreasing the magnitude of large document vectors & increasing the magnitude
      of small ones
  •   p-norm
  •   p depends on similarity measure used
  •   With Text Content, best similarity measures are Euclidean & Cosine  p = 2
  •   Example: the 2-norm of a 3-dimensional
         vector, [x, y, z], is           26
Content Clustering

Discovering Clusters of “similar” Points

 EM algorithm to a
   2 component
 Gaussian mixture
 model on the Old
  Faithful Geyser
      dataset
  http://bit.ly/oldfaithful



                                           27
K-Means Clustering

Map/Reduce Implementation in Mahout
                          1.   Starting with three
                               random points as
1            2                 centroids
                          2.   Map stage: assigns
                               each point to the cluster
                               nearest to it
                          3.   Reduce stage: the
                               associated points are
                               averaged out to produce
                               the new location of the
3            4                 centroid
                          4.   After each iteration, the
                               final configuration is fed
                               back into the same loop
                               until the centroids come
                               to rest at their final
                                                       28
                               positions
Canopy Clustering
• Fast approximate clustering technique
• Divide the input set of points into overlapping clusters known as canopies
• In Mahout, it is used to estimate the approximate cluster centroids (or canopy
  centroids) using two distance thresholds, T1 and T2, with T1 > T2
                                           1. Start with a point and mark it as part
1                    2                        of a canopy
                                           2. all the points within distance T2
                                              removed from the data set and
                                              prevented from becoming new
                                              canopies.
                                           3. The points within the outer circle are
                                              also put in the same canopy, but
3                    4                        they’re allowed to be part of other
                                              canopies. Assignment process is
                                              done in a single pass on a mapper.
                                           4. The reducer computes the average
                                              of the centroid and merges close
                                              canopies
                                                                                 29
Sensemaking in Online Forums

•   Illustration of the Approach to support user sensemaking in Online Forums
•   Content Collection from WebProWorld Technical Forums
•   Large Forum (1000s of Discussion Threads)
•   Organize Discussions into Categories (Subforums) Defined by Forum
    Designers
•   Four subforums were chosen for the experiment:
      • Two subforums representing fairly specialized categories – SEO (Search
          Engine Optimization) and e-Commerce
      • Two subforums representing broad categories – IT and Computer
          Assistance
• Objectives for the experiment
     •   Investigate the extent of sensemaking support needed for the public
         technical forum
     •   Determine which content representation for clustering is more appropriate
         to derive topic clouds for the sensemaker
     •   Illustrate how the output of the approach could provide sensemaking
                                                                               30
         support
Clusters vs Categories




Distribution of Four Categories in    Distribution of Four Categories in Four
Four Mahout-based Clusters by Title   Mahout-based Clusters by Title and
                                      First Post


                                                                       31
Content Representation




The smaller the average DBI, the       clustering models having item
better the model is for achieving a    distribution values closer to 1.0 will
coherent set of similar discussions.   derive minor distinct clusters with
                                       topic-specific discussions.


                                                                         32
Example Topic Clouds

                       Enabled Discovery of Topic-
                       Specific Discussions not
                       Obvious in Category Names:
                       • Disk & Keyboard Problems
                       • Security Issues
                       • Hard Disk Backup
                       • MS Outlook File Problems
                       • Certificates and Skills in
                         Web Design
                       • Photo features in social
                         networks (facebook)
                       • Optimizing Search Engines
                         for Blog Search
                       • Design of Datawarehousing
                         Systems

                                             33
Cross Validated Statistics Forum




                                   34
Conclusion

• Big Data creates a Big Challenge to sensemaking in Online
  Collaborative Spaces
• Distributed Data Mining with Hadoop Map/Reduce and Mahout is
  exploited to support user sensemaking by summarizing the huge
  content found in Large-scale Discussion Forums
• Cluster Analysis shows that Different User-created Categories may
  contain similar Collaborative Content, creating difficulty for the users
  to find the content that address their problems / interests
• Clustering of content represented by titles produces more coherent
  clusters with more ability to uncover fine-grained discussions that are
  buried in the huge amount of content
• Mahout is not currently perfect!
   • Lack of Clustering Validity Measures
   • Lack of Dimension Reduction Algorithms (e.g. LSI) important to
       improve clustering results
                                                                          35
   • Lack of GUI Support
School of something
          Computing
FACULTY OF ENGINEERING
           OTHER




                          Thank You

                            Ahmad Ammari
                         A.Ammari@leeds.ac.uk

Contenu connexe

Tendances

Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdfvishal choudhary
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyShital Kat
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Django and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks assDjango and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks assTobias Lindaaker
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataWANdisco Plc
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project GuidanceVarad Meru
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop WorldCloudera, Inc.
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 

Tendances (20)

Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdf
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
lec2_ref.pdf
lec2_ref.pdflec2_ref.pdf
lec2_ref.pdf
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
final report
final reportfinal report
final report
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Django and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks assDjango and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks ass
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project Guidance
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop World
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

En vedette

Distributed Datamining and Agent System,security
Distributed Datamining and Agent System,securityDistributed Datamining and Agent System,security
Distributed Datamining and Agent System,securityAman Hamrey
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSINGKing Julian
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecturepcherukumalla
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data WarehousingJason S
 

En vedette (6)

Distributed Datamining and Agent System,security
Distributed Datamining and Agent System,securityDistributed Datamining and Agent System,security
Distributed Datamining and Agent System,security
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 

Similaire à Distributed data mining

Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataMelissa Hornbostel
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersZohar Elkayam
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA
 
Getting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudGetting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudRightScale
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemZohar Elkayam
 
Builiding analytical apps on Hadoop
Builiding analytical apps on HadoopBuiliding analytical apps on Hadoop
Builiding analytical apps on HadoopDmitry Makarchuk
 

Similaire à Distributed data mining (20)

Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Spark
SparkSpark
Spark
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
Hadoop Eco system
Hadoop Eco systemHadoop Eco system
Hadoop Eco system
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
Getting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudGetting Started with Big Data in the Cloud
Getting Started with Big Data in the Cloud
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
 
Builiding analytical apps on Hadoop
Builiding analytical apps on HadoopBuiliding analytical apps on Hadoop
Builiding analytical apps on Hadoop
 

Plus de Ahmad Ammari

Cis 2303 lo1 part 1_weeks_1_2 - student ver
Cis 2303 lo1 part 1_weeks_1_2 - student verCis 2303 lo1 part 1_weeks_1_2 - student ver
Cis 2303 lo1 part 1_weeks_1_2 - student verAhmad Ammari
 
You tube Group Profiling Services
You tube Group Profiling ServicesYou tube Group Profiling Services
You tube Group Profiling ServicesAhmad Ammari
 
Aum workshop paper_presentation
Aum workshop paper_presentationAum workshop paper_presentation
Aum workshop paper_presentationAhmad Ammari
 

Plus de Ahmad Ammari (6)

Itecn453 lec01
Itecn453 lec01Itecn453 lec01
Itecn453 lec01
 
Cis 2303 lo1 part 1_weeks_1_2 - student ver
Cis 2303 lo1 part 1_weeks_1_2 - student verCis 2303 lo1 part 1_weeks_1_2 - student ver
Cis 2303 lo1 part 1_weeks_1_2 - student ver
 
Itec410 lec01
Itec410 lec01Itec410 lec01
Itec410 lec01
 
Blog clustering
Blog clusteringBlog clustering
Blog clustering
 
You tube Group Profiling Services
You tube Group Profiling ServicesYou tube Group Profiling Services
You tube Group Profiling Services
 
Aum workshop paper_presentation
Aum workshop paper_presentationAum workshop paper_presentation
Aum workshop paper_presentation
 

Dernier

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Dernier (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Distributed data mining

  • 1. School of something Computing FACULTY OF ENGINEERING OTHER Distributed Data Mining for User Sensemaking in Online Collaborative Spaces Submitted to: DicoSyn2012 Workshop @ CSCW’12 Presented By: Ahmad Ammari RF in User & Community Modelling
  • 2. OUTLINE • The Big Data “Problem” in Online Collaborative Spaces • What is User Sensemaking and How Big Data is affecting it? • Can Distributed Data Mining Help? • What is Hadoop & Map / Reduce? • What is Mahout? • Proposed Approach to support User Sensemaking in OCS • Content Pre-Processing • Content Clustering • Topic Modelling • Case Study: Making Sense of Online Forums • How are Discussions currently Organized? Clusters vs. Categories • Which Content to Mine? Mining the Right Discussion Parts 2 • How Can This Help Sensemaking? Some Usage Scenarios
  • 3. How “Big” is Big Data? • Emails • 90 Trillion – The Number of Emails Sent on the Internet in 2009 • 107 Trillion – The Number of Emails Sent on the Internet in 2010 • Websites • 234 Million – The Number of Websites by Dec 2009 • 255 Million – The Number of Websites by Dec 2010 • Social Media • 152 Million – The Number of Blogs on the Internet in 2010 • 25 Billion – The Number of sent Tweets on Twitter in 2010 • Multi Media • 5 Billion – The Number of Photos Hosted by Flicker (Sep 2010) • 2 Billion – The Number of Videos Watched per Day on YouTube 3
  • 4. What about Online CS? They are Big Too! Top 10 biggest Internet forums 4
  • 5. What about Online CS? They are Big Too! Stack Exchange Family of Forums 5
  • 6. Why is it a Problem? Where should I post my programming question to get relevant replies? 6
  • 7. Why is it a Problem? Where to find a solution to my MS Outlook Problem? 7
  • 8. Why is it a Problem? What are the actual discussions are really about? I cannot make sense of Big Content! 8
  • 9. Why Making Sense of Big Data is not Easy, not Fast? • Because it’s Big and still increasing! • Because it’s Diverse! • Stack Exchange Suite of Forums has more than 50 Different Technical Discussion Forums • WebProWorld Technical Forums has more than 40 Discussion Categories • Because it’s Dynamic! • 294 Billion – The Average Number of Email Messages per Day • 21.4 Million – The Number of Added Websites in 2010 • 96,101 New Blogs in last 24 hours (8th Dec 2011) • 190 Million – The Number of Tweets per day in June 2011 • Because it’s Noisy! • 200 billion – The number of spam emails per day in 2009 • 262 billion – The number of spam emails per day in 2010 9
  • 10. But What is “Sensemaking”?! • Creating a representation of a collection of information [Russell et al, 1993] • Focused on the context of understanding large document collections. [Paul et al, 2011] • Transforming Information into Knowledge [Priolli & Card, 2005] • Seeking, filtering, searching for relations, extracting, schematizing • Understanding connections among people, places, and events [Klein et al, 2006] 10
  • 11. Our Solution! Large-Scale Data Knowledge Discovery in Big Processing Content Quick Data Processing Analysis of Unstructured Scalable Data Processing Data Robust Data Processing Machine Intelligence to Support Humans 11
  • 12. What is Hadoop? • A framework for storing and processing big data on lots of commodity machines • Up to 4,000 machines in a cluster • Up to 20 PB in a cluster • Open Source Apache project • Implemented in Java We focused on distributed computation with Map/Reduce • Contains Many Sub-Projects: • Map/Reduce – Software Framework for Distributed Processing of Large Dataets • HDFS – Hadoop Distributed File System • Hadoop Common – Provides Access to the File Systems Supported by Hadoop • Chukwa – Data Collection System for Managing Large Distributed Systems • Hbase – Scalable, Distributed Database that Supports Structured Data Storage • Hive – Data Warehouse Infrastructure that provides Data Summarization & Ad Hoc Querying • Pig – High-Level Data-Flow Language & Execution Framework for Parallel Computation • Zookeeper – High-Performance Coordination Service for Dist. Apps. 12
  • 14. Why they Use Hadoop? 14
  • 15. Hadoop Map/Reduce • Simply: A parallel programming model and an associated implementation • Abstract model: hides many system-level details from the programmer • Move-code-to-data philosophy: computation on data piece takes place on the same machine where that piece resides • Map/Reduce Job runs in Phases, each Phase runs in Parallel across all Nodes in the Hadoop Cluster • Main Phases: Mapping, Reducing • Are there Other Phases? Yes! • Shuffling & Sorting, Combining, Partitioning • But .. Programmer writes “Mapper” and “Reducer” functions only! 15
  • 17. Hadoop Map/Reduce More formally, • Map(k1,v1)  list(k2,v2) • Shuffle & Sort(list(k2,v2))  k2, list(v2) • Reduce(k2, list(v2))  list(k3, v3) 17
  • 19. Our Solution! Large-Scale Data Knowledge Discovery in Big Processing Content Quick Data Processing Analysis of Unstructured Scalable Data Processing Data Robust Data Processing Machine Intelligence to Support Humans 19
  • 20. What is Mahout? • Open source machine learning library from Apache • Began life in 2008 as a subproject of Apache’s Lucene Search Engine • In 2009 absorbed the Taste open source collaborative filtering project • In 2010 became a stand-alone Project • Written in Java • ML algorithms mainly for • Recommender Engines (CF-based) • Clustering April 2010 • Classification • Pre-Processing algorithms for Unstructured Data • Scalability is achieved by Map/Reduce Implementations of ML Algorithms We focused on Mahout Clustering and Pre-Processing Implementations in Map/Reduce 20
  • 21. Sensemaking-Support with DDM INPUT: Collaboration Content (Discussions) 21
  • 22. Sensemaking-Support with DDM Content Pre-Processing: Prepare Content for Mining 22
  • 23. Sensemaking-Support with DDM Content Clustering: Derive Groups of Similar Content 23
  • 24. Sensemaking-Support with DDM Topic Modelling: Identify Fine-Grained Topics and Generate Topic Clouds 24
  • 25. Sensemaking-Support with DDM OUTPUT: Topic Clouds 25
  • 26. Content Pre-Processing • Apache Lucene Text Analysis • Tokenization, Non-Letter Removal, Lower Case Filtration, Stop Word Removal • TFIDF Weighting: Computing Numerical Weights to Content Terms • n-gram Collocations • Multi-Term Phrases having high probability of occurring together • Examples: “Social Media”, “Data Mining”, “Machine Learning” • Normalization • decreasing the magnitude of large document vectors & increasing the magnitude of small ones • p-norm • p depends on similarity measure used • With Text Content, best similarity measures are Euclidean & Cosine  p = 2 • Example: the 2-norm of a 3-dimensional vector, [x, y, z], is 26
  • 27. Content Clustering Discovering Clusters of “similar” Points EM algorithm to a 2 component Gaussian mixture model on the Old Faithful Geyser dataset http://bit.ly/oldfaithful 27
  • 28. K-Means Clustering Map/Reduce Implementation in Mahout 1. Starting with three random points as 1 2 centroids 2. Map stage: assigns each point to the cluster nearest to it 3. Reduce stage: the associated points are averaged out to produce the new location of the 3 4 centroid 4. After each iteration, the final configuration is fed back into the same loop until the centroids come to rest at their final 28 positions
  • 29. Canopy Clustering • Fast approximate clustering technique • Divide the input set of points into overlapping clusters known as canopies • In Mahout, it is used to estimate the approximate cluster centroids (or canopy centroids) using two distance thresholds, T1 and T2, with T1 > T2 1. Start with a point and mark it as part 1 2 of a canopy 2. all the points within distance T2 removed from the data set and prevented from becoming new canopies. 3. The points within the outer circle are also put in the same canopy, but 3 4 they’re allowed to be part of other canopies. Assignment process is done in a single pass on a mapper. 4. The reducer computes the average of the centroid and merges close canopies 29
  • 30. Sensemaking in Online Forums • Illustration of the Approach to support user sensemaking in Online Forums • Content Collection from WebProWorld Technical Forums • Large Forum (1000s of Discussion Threads) • Organize Discussions into Categories (Subforums) Defined by Forum Designers • Four subforums were chosen for the experiment: • Two subforums representing fairly specialized categories – SEO (Search Engine Optimization) and e-Commerce • Two subforums representing broad categories – IT and Computer Assistance • Objectives for the experiment • Investigate the extent of sensemaking support needed for the public technical forum • Determine which content representation for clustering is more appropriate to derive topic clouds for the sensemaker • Illustrate how the output of the approach could provide sensemaking 30 support
  • 31. Clusters vs Categories Distribution of Four Categories in Distribution of Four Categories in Four Four Mahout-based Clusters by Title Mahout-based Clusters by Title and First Post 31
  • 32. Content Representation The smaller the average DBI, the clustering models having item better the model is for achieving a distribution values closer to 1.0 will coherent set of similar discussions. derive minor distinct clusters with topic-specific discussions. 32
  • 33. Example Topic Clouds Enabled Discovery of Topic- Specific Discussions not Obvious in Category Names: • Disk & Keyboard Problems • Security Issues • Hard Disk Backup • MS Outlook File Problems • Certificates and Skills in Web Design • Photo features in social networks (facebook) • Optimizing Search Engines for Blog Search • Design of Datawarehousing Systems 33
  • 35. Conclusion • Big Data creates a Big Challenge to sensemaking in Online Collaborative Spaces • Distributed Data Mining with Hadoop Map/Reduce and Mahout is exploited to support user sensemaking by summarizing the huge content found in Large-scale Discussion Forums • Cluster Analysis shows that Different User-created Categories may contain similar Collaborative Content, creating difficulty for the users to find the content that address their problems / interests • Clustering of content represented by titles produces more coherent clusters with more ability to uncover fine-grained discussions that are buried in the huge amount of content • Mahout is not currently perfect! • Lack of Clustering Validity Measures • Lack of Dimension Reduction Algorithms (e.g. LSI) important to improve clustering results 35 • Lack of GUI Support
  • 36. School of something Computing FACULTY OF ENGINEERING OTHER Thank You Ahmad Ammari A.Ammari@leeds.ac.uk